variation_information#

graph_tool.inference.variation_information(x, y, norm=False)[source]#

Returns the variation of information between two partitions.

Parameters:
xiterable of int values

First partition.

yiterable of int values

Second partition.

norm(optional, default: False)

If True, the result will be normalized in the range $$[0,1]$$.

Returns:
VIfloat

Variation of information value.

Notes

The variation of information [meila_comparing_2003] is defined as

$\text{VI}(\boldsymbol x,\boldsymbol y) = -\frac{1}{N}\sum_{rs}m_{rs}\left[\ln\frac{m_{rs}}{n_r} + \ln\frac{m_{rs}}{n_s'}\right],$

with $$m_{rs}=\sum_i\delta_{x_i,r}\delta_{y_i,s}$$ being the contingency table between $$\boldsymbol x$$ and $$\boldsymbol y$$, and $$n_r=\sum_sm_{rs}$$ and $$n'_s=\sum_rm_{rs}$$ are the group sizes in both partitions.

If norm == True, the normalized value is returned:

$\frac{\text{VI}(\boldsymbol x,\boldsymbol y)}{\ln N}$

which lies in the unit interval $$[0,1]$$.

This algorithm runs in time $$O(N)$$ where $$N$$ is the length of $$\boldsymbol x$$ and $$\boldsymbol y$$.

References

Marina Meilă, “Comparing Clusterings by the Variation of Information,” in Learning Theory and Kernel Machines, Lecture Notes in Computer Science No. 2777, edited by Bernhard Schölkopf and Manfred K. Warmuth (Springer Berlin Heidelberg, 2003) pp. 173–187. DOI: 10.1007/978-3-540-45167-9_14 [sci-hub, @tor]

Examples

>>> x = np.random.randint(0, 10, 1000)
>>> y = np.random.randint(0, 10, 1000)
>>> gt.variation_information(x, y)
4.525389...