# mutual_information#

Returns the mutual information between two partitions.

Parameters:
xiterable of int values

First partition.

yiterable of int values

Second partition.

norm(optional, default: True)

If True, the result will be normalized in the range $$[0,1]$$.

adjusted(optional, default: False)

If True, the adjusted mutual information will be computed. Note that if adjusted == True, the parameter norm is ignored.

avg_method(optional, default: "arithmetic")

Determines how the adjusted normalization average should be computed. Must be one of: "min", "max", "arithmethic", or "geometric". This option has an effect only if adjusted == True.

Returns:
MIfloat

Mutual information value

Notes

The mutual information is defined as

$\text{MI}(\boldsymbol x,\boldsymbol y) = \frac{1}{N}\sum_{rs}m_{rs}\ln\frac{N m_{rs}}{n_rn'_s},$

with $$m_{rs}=\sum_i\delta_{x_i,r}\delta_{y_i,s}$$ being the contingency table between $$\boldsymbol x$$ and $$\boldsymbol y$$, and $$n_r=\sum_sm_{rs}$$ and $$n'_s=\sum_rm_{rs}$$ are the group sizes in both partitions.

If norm == True, the normalized value is returned:

$2\frac{\text{MI}(\boldsymbol x,\boldsymbol y)}{H_x + H_y}$

which lies in the unit interval $$[0,1]$$, and where $$H_x = -\frac{1}{N}\sum_rn_r\ln\frac{n_r}{N}$$ and $$H_x = -\frac{1}{N}\sum_rn'_r\ln\frac{n'_r}{N}$$.

If adjusted == True, the adjusted mutual information is returned, defined as [vinh_information_2009]:

$\text{AMI}(\boldsymbol x,\boldsymbol y) = \frac{\text{MI}(\boldsymbol x,\boldsymbol y) - E[\text{MI}(\boldsymbol x,\boldsymbol y)]} {\text{avg}(H_x, H_y) - E[\text{MI}(\boldsymbol x,\boldsymbol y)]},$

where

$\begin{split}E[\text{MI}(\boldsymbol x,\boldsymbol y)] = \sum_r\sum_s\sum_{m=\max(1,n_r+n'_s-N)}^{\min(n_r, n'_s)} \frac{m}{N} \ln \left( \frac{Nm}{n_rn'_s}\right) \times \\ \frac{n_r!n'_s!(N-n_r)!(N-n'_s)!}{N!m!(n_r-m)!(n'_s-m)!(N-n_r-n'_s+m)!}\end{split}$

is the expected value of $$\text{MI}(\boldsymbol x,\boldsymbol y)$$ if either partition has its labels randomly permuted, and $$\text{avg}$$ is the function corresponding to the parameter avg_methdod.

This algorithm runs in time $$O(N)$$ where $$N$$ is the length of $$\boldsymbol x$$ and $$\boldsymbol y$$.

@parallel@

If adjusted == False, the algorith will not run in parallel.

References

Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. “Information theoretic measures for clusterings comparison: is a correction for chance necessary?”, In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ‘09). Association for Computing Machinery, New York, NY, USA, 1073–1080. DOI: 10.1145/1553374.1553511 [sci-hub, @tor].

Examples

>>> x = np.random.randint(0, 10, 1000)
>>> y = np.random.randint(0, 10, 1000)
>>> gt.mutual_information(x, y)
0.036350...