Metrics¶
-
class
gct.
GraphMetrics
(data)¶ metrics for a graph. e.g. density, diameter etc.
The class is here just for convenience. It is not powerful and efficient. One may analyze the graph use other graph libraries (e.g. SNAP, igraph, networkit, etc) using converting functions.
-
__init__
(data)¶ - Parameters
data – a
gct.Dataset
object
-
property
directed
¶ Return True if the graph is directed
-
property
weighted
¶ Return True if the graph is weighted
-
property
num_edges
¶ Return number of edges
-
property
num_vertices
¶ Return number of vertices (nodes)
-
property
density1
¶ Return density of \(\frac{|E|}{|V|}\)
-
property
density
¶ Return density of \(\frac{|E|}{|all\ possible\ edges|}\).
-
property
degrees
¶ return weighted node degrees
-
property
unweighted_degrees
¶ return unweighted node degrees (e.g. number of out edges)
-
-
class
gct.
SNAPGraphMetrics
(data)¶ metrics for a graph using SNAP
-
__init__
(data)¶ - Parameters
data – a
gct.Dataset
object
-
property
degree_histogram
¶ return list[(degree,num_node)]
-
property
degree_in_histogram
¶ return list[(degree,num_node)] of in degree
-
property
degree_out_histogram
¶ return list[(degree,num_node)] of out degree
-
property
node_degrees
¶ return list[(node,degree)]
-
property
num_self_edges
¶ return number of self edges
-
property
scc_distribution
¶ return list[(scc_size, count)] for strongly connected components
-
property
wcc_distribution
¶ return list[(cc_size, count)] for weakly connected components
-
property
edge_bridges
¶ return list[edge] where edge is a bridge.
An edge is a bridge if, when removed, increases the number of connected components.
-
effect_diameter
(n_node=100, isDir=False)¶ Returns the (approximation of the) Effective Diameter (90-th percentile of the distribution of shortest path lengths) of a graph
- Parameters
n_node – number of nodes to sample
isDir – consider direct or not
-
diameter
(n_node=100, isDir=False)¶ Computes the diameter, or ‘longest shortest path’, of a Graph
This diameter is approximate, as it is calculated with an n_node number of random starting nodes.
- Parameters
n_node – number of nodes to sample
isDir – consider direct or not
-
sample_shortest_path
(n_node=100, isDir=False)¶ sample diameter, e.g. ‘shortest path’, of a Graph
- Parameters
n_node – number of nodes to sample
isDir – consider direct or not
-
sample_degree_centrality
(n_node=100)¶ Degree centrality of a node is defined as its degree/(N-1), where N is the number of nodes in the network.
- Parameters
n_node – number of nodes to sample
-
sample_betweenness_centrality
(quality=1, isDir=False)¶ Computes (approximate) Node and Edge Betweenness Centrality based on a sample
- Parameters
quality – Quality of the approximation. 1.0 gives exact betweenness values.
isDir – consider direct or not
-
sample_closeness_centrality
(n_node=100, normalized=True, isDir=False)¶ Returns closeness centrality sample in Graph. Closeness centrality is equal to 1/farness centrality.
- Parameters
n_node – number of nodes to sample
normalized – Output should be normalized (True) or not (False).
isDir – consider direct or not
-
sample_farness_centrality
(n_node=100, normalized=True, isDir=False)¶ Returns farness centrality sample in Graph. Farness centrality of a node is the average shortest path length to all other nodes that reside in the same connected component as the given node.
- Parameters
n_node – number of nodes to sample
normalized – Output should be normalized (True) or not (False).
isDir – consider direct or not
-
page_rank_score
(C=0.85, Eps=0.0001, MaxIter=100)¶ Computes the PageRank score of every node in Graph
- Parameters
C – Damping factor.
Eps – Convergence difference.
MaxIter – Maximum number of iterations.
Computes the Hubs and Authorities score of every node in Graph
return tuple of hubs score and authorrities score :param MaxIter: Maximum number of iterations.
-
sample_node_eccentricity
(n_node=100, normalized=True, isDir=False)¶ Returns node eccentricity, the largest shortest-path distance from the node to any other node in the Graph.
return tuple of hubs score and authorrities score :param MaxIter: Maximum number of iterations.
-
eigenvector_centrality
(Eps=0.0001, MaxIter=100)¶ Computes eigenvector centrality of all nodes in Graph. Eigenvector Centrality of a node N is defined recursively as the average of centrality values of N’s neighbors in the network.
- Parameters
Eps – Convergence difference.
MaxIter – Maximum number of iterations.
-
average_clustering_coefficient
(sample_node=False)¶ Computes the average clustering coefficient as defined in Watts and Strogatz, Collective dynamics of ‘small-world’ networks
- Parameters
sample_node – compute clustering coefficient only for a random sample of SampleNodes nodes. Useful for approximate but quick computations.
-
distribution_clustering_coefficient
(sample_node=False)¶ Computes the distribution of clustering coefficient as defined in Watts and Strogatz, Collective dynamics of ‘small-world’ networks
- Parameters
sample_node – compute clustering coefficient only for a random sample of SampleNodes nodes. Useful for approximate but quick computations.
-
k_core
(K)¶ Returns the K-core of the graph Graph. If the core of order K does not exist, the function returns an empty graph.
-
-
class
gct.
GraphClusterMetrics
(data, clusteringobj)¶ metrics for a clustering of a graph. e.g. modularity.
-
__init__
(data, clusteringobj)¶ - Parameters
data – a
gct.Dataset
objectclusteringobj – a
gct.Clustering
object or refer to the groundtruth parameter ofgct.from_edgelist()
-
property
num_edges
¶ number of edges of the graph
-
property
num_vertices
¶ number of vertices (nodes) of the graph
-
property
node_degrees
¶ degrees for all the nodes
-
property
node_weights
¶ weight sum for each node
-
property
unweighted_degrees
¶ unweighted degrees of nodes
-
property
weighted_degrees
¶ weighted degrees of nodes
-
property
num_clusters
¶ number of partitions
-
property
cluster_indexes
¶ cluster identifiers
-
property
cluster_sizes
¶ return cluster size for each cluster
-
property
cluster_expansions
¶ return expansions for each cluster where expansions is
\[\frac{c_s}{n_s}\]where \(c_s\) is the cluster size of cluster \(s\) and \(n_s\) is the number of vertices for the cluster.
-
property
cluster_cut_ratios
¶ return cut ratio for each cluster which is
\[\frac{c_s}{n_s(n-n_s)}\]where \(c_s\) is the cluster size of cluster \(s\), n is the number of node of graph and \(n_s\) is the number of vertices for the cluster.
-
property
intra_cluster_densities
¶ return internal cluster density for each cluster which is
\[\frac{m_s}{n_s(n-n_s)/2}\]where \(m_s\) is number of edges of cluster \(s\), n is the number of node of graph and \(n_s\) is the number of vertices for the cluster.
-
property
inter_cluster_density
¶ return internal cluster density for the clustring which is
\[\frac{|\{ (u,v) | u \in C_i, v \in C_j, i \neq j \}|}{n(n-1)/2 + \sum_{s=1}^{k} n_s(n-n_s)/2}\]where \(m_s\) is number of edges of cluster \(s\), n is the number of node of graph, \(n_s\) is the number of vertices for the cluster, \(C_i\) is the set of nodes in i-th cluster
-
property
relative_cluster_densities
¶ return relative cluster density for each cluster which is
\[\frac{deg_{intra}}{deg_{intra}+deg_{out}}\]where the degree is weighted degree.
-
property
modularity
¶ return modularity of the clustering for weighted or unweighted graph. For unweighted graph it is
\[Q=1/(2m) \sum_{i,j} (A_{ij}- \frac{k_i k_j}{2m}) 1_{(i=j)}\]where m is the number of edges, A is the adjacency matrix, 1 is indicator function, \(k_i k_j\) is the expected number of random edges between the two nodes.
-
property
conductance
¶ return conductance of a cluster S, for unweighted graph which is
\[\frac{Cut_S}{\min (deg_S, deg_{(V \setminus S}))}\]where S is a cluster
-
property
normalized_cut
¶ normalized cut
-
property
cluster_max_out_degree_fraction
¶ max out degree fraction
-
property
cluster_avg_out_degree_fraction
¶ average out degree fraction
-
property
cluster_flake_out_degree_fraction
¶ Flake out degree fraction
-
property
separability
¶ separability \(m_s/c_s\)
-
property
cluster_clustering_coefficient
¶ (global) clustering coefficient
-
property
cluster_local_clustering_coefficient
¶ local clustering coefficient
-
-
class
gct.
ClusterComparator
(clusteringobj1, clusteringobj2)¶ metrics for two clustering. e.g. nmi, overlap nmi etc. In case that some metrics requires ground truth, make ground truth as the first parameter.
When calculating a metric that does not support overlapping, the overlapped nodes are removed or retrun None
-
property
ground_truth
¶ Alias for for the first Clustering of constructor
-
property
predition
¶ Alias for for the second Clustering of constructor
-
sklean_nmi
()¶ sklearn normalized_mutual_info_score
-
sklean_ami
()¶ sklearn adjusted_mutual_info_score
-
sklean_ars
()¶ sklearn adjusted_rand_score
-
sklean_completeness
()¶ sklearn completeness_score
-
GenConvNMI
(sync=None, id_remap=None, nmis=None, fnmi=True, risk=None, error=None, fast=None, membership=None, retain_dups=None)¶ A wrapper for https://github.com/eXascaleInfolab/GenConvNMI
Arguments
Usage: ./gecmi [options] <clusters1> <clusters2> clusters - clusters file in the CNL format, where each line lists space separated ids of the cluster members
-h [ –help ]
produce help message
–input arg
name of the input files
-s [ –sync ]
synchronize the node base omitting the non-matching nodes for the fair evaluation. The node base is selected automatically as a clustering having the least number of nodes.
-i [ –id-remap ]
remap ids allowing arbitrary input ids (non-contiguous ranges), otherwise ids should form a solid range and start from 0 or 1
-n [ –nmis ]
output both NMI [max] and NMI_sqrt
-f [ –fnmi ]
evaluate also FNMI, includes ‘-n’
-r [ –risk ] arg (=0.01)
probability of value being outside
-e [ –error ] arg (=0.01)
admissible error
-a [ –fast ]
apply fast approximate evaluations that are less accurate, but much faster on large networks
-m [–membership] arg (=1)
average expected membership of nodes in the clusters, > 0, typically >= 1
-d [ –retain-dups ]
retain duplicated clusters if any instead of filtering them out (not recommended)
Reference
overlap nmi
Esquivel, Alcides Viamontes, and Martin Rosvall. “Comparing network covers using mutual information.” arXiv preprint arXiv:1202.0425 (2012).
fair nmi
Amelio, Alessia, and Clara Pizzuti. “Is normalized mutual information a fair measure for comparing community detection methods?.” Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. ACM, 2015.
-
OvpNMI
(sync=None, allnmi=None, omega=None, membership=None, verbose=None)¶ A wrapper for https://github.com/eXascaleInfolab/OvpNMI
Compare sets of clusters by their members (nodes) using various measures (NMI, Omega) and considering overlaps
Arguments
Usage: onmi [OPTIONS] clsfile1 clsfile2
-a, –allnmis
output all NMIs (sqrt and sum-denominators, LFK besides the max-denominator) (default=off)
-m, –membership=FLOAT
average expected membership of nodes in the clusters, > 0, typically >= 1 (default=`1’)
-o, –omega
print the Omega measure (can be slow) (default=off)
-v, –verbose
detailed debugging (default=off)
Reference
overlap nmi
McDaid, Aaron F., Derek Greene, and Neil Hurley. “Normalized mutual information to evaluate overlapping community finding algorithms.” arXiv preprint arXiv:1110.2515 (2011).
overlap nmi
Lancichinetti, Andrea, Santo Fortunato, and János Kertész. “Detecting the overlapping and hierarchical community structure in complex networks.” New Journal of Physics 11.3 (2009): 033015.
Omega
Collins, Linda M., and Clyde W. Dent. “Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions.” Multivariate Behavioral Research 23.2 (1988): 231-242.
-
xmeasure_nmi
(sync=None, all=False, membership=None, detailed=None)¶ A wrapper for https://github.com/eXascaleInfolab/xmeasures.
Normalized Mutual Information, normalized by either max or also sqrt, avg and min information content denominators.
ATTENTION: This is a standard NMI, which should be used ONLY for the HARD partitioning evaluation (non-overlapping clustering on a single resolution). It penalizes overlapping and multi-resolution structures.
Arguments
Usage: onmi [OPTIONS] clsfile1 clsfile2
-m, –membership=FLOAT
- average expected membership of the nodes in the
clusters, > 0, typically >= 1. Used only to facilitate estimation of the nodes number on the containers preallocation if this number is not specified in the file header. (default=`1’)
-n, –nmi
- evaluate NMI (Normalized Mutual Information),
applicable only to the non-overlapping clusters (default=off)
-a, –all
- evaluate all NMIs using sqrt, avg and min
denominators besides the max one (default=off)
-d, –detailed
- detailed (verbose) results output
(default=off)
-
xmeasure
(sync=None, ovp=None, unique=None, omega=None, extended=None, f1=None, kind=False, membership=None, detailed=None)¶ Evaluating measures are:
OI: Omega Index (a fuzzy version of the Adjusted Rand Index, identical to the Fuzzy Rand Index), which yields the same value as Adjusted Rand Index when applied to the non-overlapping clusterings.
[M]F1: various [mean] F1 measures of the Greatest (Max) Match including the Average F1-Score (suggested by J. Leskovec) with optional weighting. NOTE: There are 3 matching policies available for each kind of F1. The most representative evaluation is performed by the F1p with combined matching policy (considers both micro and macro weighting).
Arguments
-O, –ovp
- evaluate overlapping instead of the
multi-resolution clusters, where max matching for any shared member between R overlapping clusters is 1/R (the member is shared) instead of 1 (the member fully belongs to each [hierarchical sub]group) for the member belonging to R distinct clusters on R resolutions. NOTE: It has no effect for the Omega Index evaluation. (default=off)
-q, –unique
- ensure on loading that all cluster members are
unique by removing all duplicates. (default=off)
-m, –membership=FLOAT
- average expected membership of the nodes in the
clusters, > 0, typically >= 1. Used only to facilitate estimation of the nodes number on the containers preallocation if this number is not specified in the file header. (default=`1’)
-d, –detailed
- detailed (verbose) results output
(default=off)
Omega Index options :
-o, –omega
- evaluate Omega Index (a fuzzy version of the
Adjusted Rand Index, identical to the Fuzzy Rand Index and on the non-overlapping clusterings equals to ARI). (default=off)
-x, –extended
- evaluate extended (Soft) Omega Index, which
does not excessively penalize distinctly shared nodes. (default=off)
Mean F1 options:
-f, –f1[=ENUM]
- evaluate mean F1 of the [weighted] average of
the greatest (maximal) match by F1 or partial probability. NOTE: F1p <= F1h <= F1a, where:
p (F1p or Ph): Harmonic mean (F1) of two [weighted] averages of the Partial Probabilities, the most indicative as satisfies the largest number of the Formal Constraints (homogeneity, completeness and size/quantity except the rag bag in some cases);
h (F1h): Harmonic mean (F1) of two [weighted] averages of all local F1 (harmonic means of the Precision and Recall of the best matches of the clusters);
a (F1a): Arithmetic mean (average) of two [weighted] averages of all local F1, the least discriminative and satisfies the lowest number of the Formal Constraints.
(possible values=”partprob”,”harmonic”, “average” default=`partprob’)
-k, –kind[=ENUM]
kind of the matching policy:
w - Weighted by the number of nodes in each cluster
u - Unweighed, where each cluster is treated equally
c - Combined(w, u) using geometric mean (drops the value not so much as harmonic mean)
(possible values=”weighted”, “unweighed”, “combined” default=`weighted’)
Reference
F1a (Average F1-Score)
Yang, Jaewon, and Jure Leskovec. “Overlapping community detection at scale: a nonnegative matrix factorization approach.” Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 2013.
Omega
Collins, Linda M., and Clyde W. Dent. “Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions.” Multivariate Behavioral Research 23.2 (1988): 231-242.
-
property