Metrics

class gct.GraphMetrics(data)

metrics for a graph. e.g. density, diameter etc.

The class is here just for convenience. It is not powerful and efficient. One may analyze the graph use other graph libraries (e.g. SNAP, igraph, networkit, etc) using converting functions.

__init__(data)
Parameters

data – a gct.Dataset object

property directed

Return True if the graph is directed

property weighted

Return True if the graph is weighted

property num_edges

Return number of edges

property num_vertices

Return number of vertices (nodes)

property density1

Return density of \(\frac{|E|}{|V|}\)

property density

Return density of \(\frac{|E|}{|all\ possible\ edges|}\).

property degrees

return weighted node degrees

property unweighted_degrees

return unweighted node degrees (e.g. number of out edges)

class gct.SNAPGraphMetrics(data)

metrics for a graph using SNAP

__init__(data)
Parameters

data – a gct.Dataset object

property degree_histogram

return list[(degree,num_node)]

property degree_in_histogram

return list[(degree,num_node)] of in degree

property degree_out_histogram

return list[(degree,num_node)] of out degree

property node_degrees

return list[(node,degree)]

property num_self_edges

return number of self edges

property scc_distribution

return list[(scc_size, count)] for strongly connected components

property wcc_distribution

return list[(cc_size, count)] for weakly connected components

property edge_bridges

return list[edge] where edge is a bridge.

An edge is a bridge if, when removed, increases the number of connected components.

effect_diameter(n_node=100, isDir=False)

Returns the (approximation of the) Effective Diameter (90-th percentile of the distribution of shortest path lengths) of a graph

Parameters
  • n_node – number of nodes to sample

  • isDir – consider direct or not

diameter(n_node=100, isDir=False)

Computes the diameter, or ‘longest shortest path’, of a Graph

This diameter is approximate, as it is calculated with an n_node number of random starting nodes.

Parameters
  • n_node – number of nodes to sample

  • isDir – consider direct or not

sample_shortest_path(n_node=100, isDir=False)

sample diameter, e.g. ‘shortest path’, of a Graph

Parameters
  • n_node – number of nodes to sample

  • isDir – consider direct or not

sample_degree_centrality(n_node=100)

Degree centrality of a node is defined as its degree/(N-1), where N is the number of nodes in the network.

Parameters

n_node – number of nodes to sample

sample_betweenness_centrality(quality=1, isDir=False)

Computes (approximate) Node and Edge Betweenness Centrality based on a sample

Parameters
  • quality – Quality of the approximation. 1.0 gives exact betweenness values.

  • isDir – consider direct or not

sample_closeness_centrality(n_node=100, normalized=True, isDir=False)

Returns closeness centrality sample in Graph. Closeness centrality is equal to 1/farness centrality.

Parameters
  • n_node – number of nodes to sample

  • normalized – Output should be normalized (True) or not (False).

  • isDir – consider direct or not

sample_farness_centrality(n_node=100, normalized=True, isDir=False)

Returns farness centrality sample in Graph. Farness centrality of a node is the average shortest path length to all other nodes that reside in the same connected component as the given node.

Parameters
  • n_node – number of nodes to sample

  • normalized – Output should be normalized (True) or not (False).

  • isDir – consider direct or not

page_rank_score(C=0.85, Eps=0.0001, MaxIter=100)

Computes the PageRank score of every node in Graph

Parameters
  • C – Damping factor.

  • Eps – Convergence difference.

  • MaxIter – Maximum number of iterations.

hubs_and_authorities_score(MaxIter=20)

Computes the Hubs and Authorities score of every node in Graph

return tuple of hubs score and authorrities score :param MaxIter: Maximum number of iterations.

sample_node_eccentricity(n_node=100, normalized=True, isDir=False)

Returns node eccentricity, the largest shortest-path distance from the node to any other node in the Graph.

return tuple of hubs score and authorrities score :param MaxIter: Maximum number of iterations.

eigenvector_centrality(Eps=0.0001, MaxIter=100)

Computes eigenvector centrality of all nodes in Graph. Eigenvector Centrality of a node N is defined recursively as the average of centrality values of N’s neighbors in the network.

Parameters
  • Eps – Convergence difference.

  • MaxIter – Maximum number of iterations.

average_clustering_coefficient(sample_node=False)

Computes the average clustering coefficient as defined in Watts and Strogatz, Collective dynamics of ‘small-world’ networks

Parameters

sample_node – compute clustering coefficient only for a random sample of SampleNodes nodes. Useful for approximate but quick computations.

distribution_clustering_coefficient(sample_node=False)

Computes the distribution of clustering coefficient as defined in Watts and Strogatz, Collective dynamics of ‘small-world’ networks

Parameters

sample_node – compute clustering coefficient only for a random sample of SampleNodes nodes. Useful for approximate but quick computations.

k_core(K)

Returns the K-core of the graph Graph. If the core of order K does not exist, the function returns an empty graph.

class gct.GraphClusterMetrics(data, clusteringobj)

metrics for a clustering of a graph. e.g. modularity.

__init__(data, clusteringobj)
Parameters
property num_edges

number of edges of the graph

property num_vertices

number of vertices (nodes) of the graph

property node_degrees

degrees for all the nodes

property node_weights

weight sum for each node

property unweighted_degrees

unweighted degrees of nodes

property weighted_degrees

weighted degrees of nodes

property num_clusters

number of partitions

property cluster_indexes

cluster identifiers

property cluster_sizes

return cluster size for each cluster

property cluster_expansions

return expansions for each cluster where expansions is

\[\frac{c_s}{n_s}\]

where \(c_s\) is the cluster size of cluster \(s\) and \(n_s\) is the number of vertices for the cluster.

property cluster_cut_ratios

return cut ratio for each cluster which is

\[\frac{c_s}{n_s(n-n_s)}\]

where \(c_s\) is the cluster size of cluster \(s\), n is the number of node of graph and \(n_s\) is the number of vertices for the cluster.

property intra_cluster_densities

return internal cluster density for each cluster which is

\[\frac{m_s}{n_s(n-n_s)/2}\]

where \(m_s\) is number of edges of cluster \(s\), n is the number of node of graph and \(n_s\) is the number of vertices for the cluster.

property inter_cluster_density

return internal cluster density for the clustring which is

\[\frac{|\{ (u,v) | u \in C_i, v \in C_j, i \neq j \}|}{n(n-1)/2 + \sum_{s=1}^{k} n_s(n-n_s)/2}\]

where \(m_s\) is number of edges of cluster \(s\), n is the number of node of graph, \(n_s\) is the number of vertices for the cluster, \(C_i\) is the set of nodes in i-th cluster

property relative_cluster_densities

return relative cluster density for each cluster which is

\[\frac{deg_{intra}}{deg_{intra}+deg_{out}}\]

where the degree is weighted degree.

property modularity

return modularity of the clustering for weighted or unweighted graph. For unweighted graph it is

\[Q=1/(2m) \sum_{i,j} (A_{ij}- \frac{k_i k_j}{2m}) 1_{(i=j)}\]

where m is the number of edges, A is the adjacency matrix, 1 is indicator function, \(k_i k_j\) is the expected number of random edges between the two nodes.

property conductance

return conductance of a cluster S, for unweighted graph which is

\[\frac{Cut_S}{\min (deg_S, deg_{(V \setminus S}))}\]

where S is a cluster

property normalized_cut

normalized cut

property cluster_max_out_degree_fraction

max out degree fraction

property cluster_avg_out_degree_fraction

average out degree fraction

property cluster_flake_out_degree_fraction

Flake out degree fraction

property separability

separability \(m_s/c_s\)

property cluster_clustering_coefficient

(global) clustering coefficient

property cluster_local_clustering_coefficient

local clustering coefficient

class gct.ClusterComparator(clusteringobj1, clusteringobj2)

metrics for two clustering. e.g. nmi, overlap nmi etc. In case that some metrics requires ground truth, make ground truth as the first parameter.

When calculating a metric that does not support overlapping, the overlapped nodes are removed or retrun None

property ground_truth

Alias for for the first Clustering of constructor

property predition

Alias for for the second Clustering of constructor

sklean_nmi()

sklearn normalized_mutual_info_score

sklean_ami()

sklearn adjusted_mutual_info_score

sklean_ars()

sklearn adjusted_rand_score

sklean_completeness()

sklearn completeness_score

GenConvNMI(sync=None, id_remap=None, nmis=None, fnmi=True, risk=None, error=None, fast=None, membership=None, retain_dups=None)

A wrapper for https://github.com/eXascaleInfolab/GenConvNMI

Arguments

Usage: ./gecmi [options] <clusters1> <clusters2> clusters - clusters file in the CNL format, where each line lists space separated ids of the cluster members

-h [ –help ]

produce help message

–input arg

name of the input files

-s [ –sync ]

synchronize the node base omitting the non-matching nodes for the fair evaluation. The node base is selected automatically as a clustering having the least number of nodes.

-i [ –id-remap ]

remap ids allowing arbitrary input ids (non-contiguous ranges), otherwise ids should form a solid range and start from 0 or 1

-n [ –nmis ]

output both NMI [max] and NMI_sqrt

-f [ –fnmi ]

evaluate also FNMI, includes ‘-n’

-r [ –risk ] arg (=0.01)

probability of value being outside

-e [ –error ] arg (=0.01)

admissible error

-a [ –fast ]

apply fast approximate evaluations that are less accurate, but much faster on large networks

-m [–membership] arg (=1)

average expected membership of nodes in the clusters, > 0, typically >= 1

-d [ –retain-dups ]

retain duplicated clusters if any instead of filtering them out (not recommended)

Reference

overlap nmi

Esquivel, Alcides Viamontes, and Martin Rosvall. “Comparing network covers using mutual information.” arXiv preprint arXiv:1202.0425 (2012).

fair nmi

Amelio, Alessia, and Clara Pizzuti. “Is normalized mutual information a fair measure for comparing community detection methods?.” Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. ACM, 2015.

OvpNMI(sync=None, allnmi=None, omega=None, membership=None, verbose=None)

A wrapper for https://github.com/eXascaleInfolab/OvpNMI

Compare sets of clusters by their members (nodes) using various measures (NMI, Omega) and considering overlaps

Arguments

Usage: onmi [OPTIONS] clsfile1 clsfile2

-a, –allnmis

output all NMIs (sqrt and sum-denominators, LFK besides the max-denominator) (default=off)

-m, –membership=FLOAT

average expected membership of nodes in the clusters, > 0, typically >= 1 (default=`1’)

-o, –omega

print the Omega measure (can be slow) (default=off)

-v, –verbose

detailed debugging (default=off)

Reference

overlap nmi

McDaid, Aaron F., Derek Greene, and Neil Hurley. “Normalized mutual information to evaluate overlapping community finding algorithms.” arXiv preprint arXiv:1110.2515 (2011).

overlap nmi

Lancichinetti, Andrea, Santo Fortunato, and János Kertész. “Detecting the overlapping and hierarchical community structure in complex networks.” New Journal of Physics 11.3 (2009): 033015.

Omega

Collins, Linda M., and Clyde W. Dent. “Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions.” Multivariate Behavioral Research 23.2 (1988): 231-242.

xmeasure_nmi(sync=None, all=False, membership=None, detailed=None)

A wrapper for https://github.com/eXascaleInfolab/xmeasures.

Normalized Mutual Information, normalized by either max or also sqrt, avg and min information content denominators.

ATTENTION: This is a standard NMI, which should be used ONLY for the HARD partitioning evaluation (non-overlapping clustering on a single resolution). It penalizes overlapping and multi-resolution structures.

Arguments

Usage: onmi [OPTIONS] clsfile1 clsfile2

-m, –membership=FLOAT

average expected membership of the nodes in the

clusters, > 0, typically >= 1. Used only to facilitate estimation of the nodes number on the containers preallocation if this number is not specified in the file header. (default=`1’)

-n, –nmi

evaluate NMI (Normalized Mutual Information),

applicable only to the non-overlapping clusters (default=off)

-a, –all

evaluate all NMIs using sqrt, avg and min

denominators besides the max one (default=off)

-d, –detailed

detailed (verbose) results output

(default=off)

xmeasure(sync=None, ovp=None, unique=None, omega=None, extended=None, f1=None, kind=False, membership=None, detailed=None)

Evaluating measures are:

  • OI: Omega Index (a fuzzy version of the Adjusted Rand Index, identical to the Fuzzy Rand Index), which yields the same value as Adjusted Rand Index when applied to the non-overlapping clusterings.

  • [M]F1: various [mean] F1 measures of the Greatest (Max) Match including the Average F1-Score (suggested by J. Leskovec) with optional weighting. NOTE: There are 3 matching policies available for each kind of F1. The most representative evaluation is performed by the F1p with combined matching policy (considers both micro and macro weighting).

Arguments

-O, –ovp

evaluate overlapping instead of the

multi-resolution clusters, where max matching for any shared member between R overlapping clusters is 1/R (the member is shared) instead of 1 (the member fully belongs to each [hierarchical sub]group) for the member belonging to R distinct clusters on R resolutions. NOTE: It has no effect for the Omega Index evaluation. (default=off)

-q, –unique

ensure on loading that all cluster members are

unique by removing all duplicates. (default=off)

-m, –membership=FLOAT

average expected membership of the nodes in the

clusters, > 0, typically >= 1. Used only to facilitate estimation of the nodes number on the containers preallocation if this number is not specified in the file header. (default=`1’)

-d, –detailed

detailed (verbose) results output

(default=off)

Omega Index options :

-o, –omega

evaluate Omega Index (a fuzzy version of the

Adjusted Rand Index, identical to the Fuzzy Rand Index and on the non-overlapping clusterings equals to ARI). (default=off)

-x, –extended

evaluate extended (Soft) Omega Index, which

does not excessively penalize distinctly shared nodes. (default=off)

Mean F1 options:

-f, –f1[=ENUM]

evaluate mean F1 of the [weighted] average of

the greatest (maximal) match by F1 or partial probability. NOTE: F1p <= F1h <= F1a, where:

  • p (F1p or Ph): Harmonic mean (F1) of two [weighted] averages of the Partial Probabilities, the most indicative as satisfies the largest number of the Formal Constraints (homogeneity, completeness and size/quantity except the rag bag in some cases);

  • h (F1h): Harmonic mean (F1) of two [weighted] averages of all local F1 (harmonic means of the Precision and Recall of the best matches of the clusters);

  • a (F1a): Arithmetic mean (average) of two [weighted] averages of all local F1, the least discriminative and satisfies the lowest number of the Formal Constraints.

(possible values=”partprob”,”harmonic”, “average” default=`partprob’)

-k, –kind[=ENUM]

kind of the matching policy:

  • w - Weighted by the number of nodes in each cluster

  • u - Unweighed, where each cluster is treated equally

  • c - Combined(w, u) using geometric mean (drops the value not so much as harmonic mean)

(possible values=”weighted”, “unweighed”, “combined” default=`weighted’)

Reference

F1a (Average F1-Score)

Yang, Jaewon, and Jure Leskovec. “Overlapping community detection at scale: a nonnegative matrix factorization approach.” Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 2013.

Omega

Collins, Linda M., and Clyde W. Dent. “Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions.” Multivariate Behavioral Research 23.2 (1988): 231-242.