Dataset (aka graph and/or cluster)¶
-
class
gct.
Dataset
(name=None, description='', groundtruthObj=None, edgesObj=None, directed=False, weighted=False, overide=False, additional_meta=None, is_edge_mirrored=False)¶ A dataset represents a graph and optional one or more ground truth. The graph is either directed or undirected, weighted or unweighted. The ground truth is either disjointed or overlapped.
Every dataset has a name. Once created it saved on local disk (under $GCT_DATA).
Instead of construct a dataset directly, mostly one creates a dataset use other methods, e.g.
gct.load_dataset()
,gct.from_snap()
,gct.from_igraph()
,gct.from_networkx()
,gct.from_networkit()
There are also building graphs that can be loaded directly. Use
gct.list_dataset()
to find them.
-
class
gct.
Result
(result)¶ The result after running a clustering algorithm. An algorithm may result one or multiple clusterings.
-
class
gct.
Clustering
(clusteringobj)¶ A clustering (or partition) of a graph. May be disjointed or overlapped. The partitions in the clustering are also called clusters.
-
gct.
remove_results
(data_pattern='*', run_pattern='*', dry_run=True)¶ remove data and algorithm run results by patterns.
-
gct.
list_dataset
(pattern=None)¶ list dataset matching the pattern. The pattern supports Unix shell-style wildcards. For example
>>> import gct >>> for v in gct.list_dataset('*snap*'): ... print (v) ... ('snap', 'com-DBLP') ('snap', 'com-LiveJournal')
lists two datasets in the format of (category, name), which can be used to invoke
gct.load_dataset()
-
gct.
load_dataset
(name, cat=None)¶ load a named dataset. If catetory is not specified, all available datasets will be searched.
- Parameters
name – name of the dataset
cat – category of the dataset
- Return type
gct.Dataset
a dataset- Raises
Exception
-
gct.
from_edgelist
(name, edgelist, groundtruth=None, directed=False, description='', overide=True)¶ create a graph from edge list.
- Parameters
name – identifier of the dataset
edgelist – a 2d list (list of list) or a 2d numpy ndaray in [[src node, target node, weight],…] format. Or a dataframe that has columns of “src”,”dest”,”weight”. Weight is optional, if missing it is an unweighted graph.
groundtruth – None or a 2d list (list of list) or a 2d numpy ndaray in [[node, cluster],…] format. Or a dataframe that has columns of “node”,”cluster”.
directed – this is a directed graph
description – discription
overide – When true and the named dataset already exists, it will be deleted
- Return type
-
gct.
from_networkx
(name, graph, weighted=False, data='weight', default=1, description='', overide=True)¶ create a datast from networkx graph
- Parameters
name – identifier of the dataset
graph – a networkx graph
weight – is it a weighted graph?
data – the name of the edge data which is taken as weights.
default – default weight if networkx edge data is missing.
description – discription
overide – When true and the named dataset already exists, it will be deleted
- Return type
-
gct.
from_snap
(name, graph, description='', overide=False)¶ create a datast from a SNAP graph
- Parameters
name – identifier of the dataset
graph – a SNAP graph
description – discription
overide – When true and the named dataset already exists, it will be deleted
- Return type
-
gct.
from_igraph
(name, graph, data='weight', description='', overide=True)¶ create a datast from iGraph graph
- Parameters
name – identifier of the dataset
graph – a igraph graph
data – the name of the edge data which is taken as weights. Ignore for unweighted graph.
description – discription
overide – When true and the named dataset already exists, it will be deleted
- Return type
-
gct.
to_networkx
(data)¶ convert the dataset to a networkx graph.
- Parameters
data –
gct.Dataset
- Return type
networkx graph
-
gct.
to_igraph
(data)¶ convert the dataset to a iGraph graph.
- Parameters
data –
gct.Dataset
- Return type
igraph graph
-
gct.
to_networkit
(data)¶ convert the dataset to a networkit graph.
- Parameters
data –
gct.Dataset
- Return type
networkit graph
-
gct.
to_graph_tool
(data)¶ convert the dataset to a graph-tool graph.
(TBD) graph_tool support weights?
- Parameters
data –
gct.Dataset
- Return type
graph-tool graph
-
gct.
to_coo_adjacency_matrix
(data, simalarity=False, distance_fun=None)¶ convert the dataset to a sparse coo adjacency matrix.
- Parameters
data –
gct.Dataset
- Return type
scipy coo_matrix
-
gct.
as_undirected
(data, newname, description='', overide=False)¶ convert the dataset to undirected dataset
- Parameters
data –
gct.Dataset
newname – the name of the new dataset
description – discription
overide – When true and the named dataset already exists, it will be deleted
- Return type
-
gct.
as_unweighted
(data, newname, description='', overide=False)¶ convert the dataset to unweighted dataset
- Parameters
data –
gct.Dataset
newname – the name of the new dataset
description – discription
overide – When true and the named dataset already exists, it will be deleted
- Return type
-
gct.
as_unweighted_undirected
(data, newname, description='', overide=False)¶ convert the dataset to undirected unweighted dataset
- Parameters
data –
gct.Dataset
newname – the name of the new dataset
description – discription
overide – When true and the named dataset already exists, it will be deleted
- Return type
-
gct.
local_graph_exists
(name)¶ check if dataset ‘name’ exists locally.
- Parameters
name – name of a dataset
- Return type
bool
-
gct.
load_local_graph
(name)¶ load a local dataset
- Parameters
name – name of a dataset
-
gct.
list_local_graph
()¶ list all local datasets
-
gct.
remove_local_graph
(name, rm_graph_data=True, rm_clustering_result=True)¶ remove a local dataset
- Parameters
name – name of a dataset
rm_clustering_result – remove local graph data
rm_clustering_result – remove clustering results associated with the graph.
-
gct.
list_clustering_result
(dataset_name)¶ list clustering results associated with the dataset
- Parameters
dataset_name – name of a dataset
-
gct.
list_all_clustering_results
(print_format=False)¶ list clustering results associated with all available datasets
-
gct.
remove_data
(data_pattern='*', with_results=True, dry_run=True)¶ remove data (and algorithm run results) by data name pattern.
-
gct.
generate_undirected_unweighted_random_graph_LFR
(name, N, k, maxk, mu, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, C=None, seed=None, overide=False)¶ Lancichinetti-Fortunato-Radicchi Benchmark geneartor. Original from https://sites.google.com/site/andrealancichinetti/files
Parameter:
-N
[number of nodes]
-k
[average degree]
-maxk
[maximum degree]
-mu
[mixing parameter]
-t1
[minus exponent for the degree sequence]
-t2
[minus exponent for the community size distribution]
-minc
[minimum for the community sizes]
-maxc
[maximum for the community sizes]
-on
[number of overlapping nodes]
-om
[number of memberships of the overlapping nodes]
-C
[Average clustering coefficient]
- -N, -k, -maxk, -mu have to be specified. For the others, the program can use default values:
t1=2, t2=1, on=0, om=0, minc and maxc will be chosen close to the degree sequence extremes.
If you don’t specify -C the rewiring process for raising the average clustering coefficient will not be performed
If you set a parameter twice, the latter one will be taken.
——————– Other options —————————
To have a random network use: -rand
Using this option will set mu=0, and minc=maxc=N, i.e. there will be one only community.
Use option -sup (-inf) if you want to produce a benchmark whose distribution of the ratio of external degree/total degree is superiorly (inferiorly) bounded by the mixing parameter.
——————– Examples —————————
Example1: ./benchmark -N 1000 -k 15 -maxk 50 -mu 0.1 -minc 20 -maxc 50 -C 0.7
Example2: ./benchmark -f flags.dat -t1 3
- Reference
Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.
-
gct.
generate_directed_unweighted_random_graph_LFR
(name, N, k=None, maxk=None, mu=None, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, C=None, seed=None, overide=False)¶ Lancichinetti-Fortunato-Radicchi Benchmark geneartor. Original from https://sites.google.com/site/andrealancichinetti/files
Parameter:
-N
[number of nodes]
-k
[average in-degree]
-maxk
[maximum in-degree]
-mu
[mixing parameter]
-t1
[minus exponent for the degree sequence]
-t2
[minus exponent for the community size distribution]
-minc
[minimum for the community sizes]
-maxc
[maximum for the community sizes]
-on
[number of overlapping nodes]
-om
[number of memberships of the overlapping nodes]
-N, -k, -maxk, -mu have to be specified. For the others, the program can use default values:
t1=2, t2=1, on=0, om=0, minc and maxc will be chosen close to the degree sequence extremes.
If you set a parameter twice, the latter one will be taken.
——————– Other options —————————
To have a random network use: -rand
Using this option will set mu=0, and minc=maxc=N, i.e. there will be one only community.
Use option -sup (-inf) if you want to produce a benchmark whose distribution of the ratio of external in-degree/total in-degree is superiorly (inferiorly) bounded by the mixing parameter.
——————– Examples —————————
Example1: ./benchmark -N 1000 -k 15 -maxk 50 -mu 0.1 -minc 20 -maxc 50
Example2: ./benchmark -f flags.dat -t1 3
- Reference
Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.
-
gct.
generate_undirected_weighted_random_graph_LFR
(name, N, k=None, maxk=None, mut=None, muw=None, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, C=None, seed=None, overide=False)¶ Lancichinetti-Fortunato-Radicchi Benchmark geneartor. Original from https://sites.google.com/site/andrealancichinetti/files
Parameter:
-N
[number of nodes]
-k
[average degree]
-maxk
[maximum degree]
-mut
[mixing parameter for the topology]
-muw
[mixing parameter for the weights]
-beta
[exponent for the weight distribution]
-t1
[minus exponent for the degree sequence]
-t2
[minus exponent for the community size distribution]
-minc
[minimum for the community sizes]
-maxc
[maximum for the community sizes]
-on
[number of overlapping nodes]
-om
[number of memberships of the overlapping nodes]
-C
[Average clustering coefficient]
-N, -k, -maxk, -muw have to be specified. For the others, the program can use default values:
t1=2, t2=1, on=0, om=0, beta=1.5, mut=muw, minc and maxc will be chosen close to the degree sequence extremes.
If you don’t specify -C the rewiring process for raising the average clustering coefficient will not be performed
If you set a parameter twice, the latter one will be taken.
——————– Other options —————————
To have a random network use: -rand
Using this option will set muw=0, mut=0, and minc=maxc=N, i.e. there will be one only community.
Use option -sup (-inf) if you want to produce a benchmark whose distribution of the ratio of external degree/total degree is superiorly (inferiorly) bounded by the mixing parameter.
——————– Examples —————————
Example1: ./benchmark -N 1000 -k 15 -maxk 50 -muw 0.1 -minc 20 -maxc 50
Example2: ./benchmark -f flags.dat -t1 3
- Reference
Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.
-
gct.
generate_directed_weighted_random_graph_LFR
(name, N, k=None, maxk=None, mut=None, muw=None, beta=None, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, C=None, seed=None, overide=False)¶ Lancichinetti-Fortunato-Radicchi Benchmark geneartor. Original from https://sites.google.com/site/andrealancichinetti/files
Parameter:
-N
[number of nodes]
-k
[average in-degree]
-maxk
[maximum in-degree]
-mut
[mixing parameter for the topology]
-muw
[mixing parameter for the weights]
-beta
[exponent for the weight distribution]
-t1
[minus exponent for the degree sequence]
-t2
[minus exponent for the community size distribution]
-minc
[minimum for the community sizes]
-maxc
[maximum for the community sizes]
-on
[number of overlapping nodes]
-om
[number of memberships of the overlapping nodes]
-N, -k, -maxk, -muw have to be specified. For the others, the program can use default values:
t1=2, t2=1, on=0, om=0, beta=1.5, mut=muw, minc and maxc will be chosen close to the degree sequence extremes.
If you set a parameter twice, the latter one will be taken.
——————– Other options —————————
To have a random network use: -rand
Using this option will set muw=0, mut=0, and minc=maxc=N, i.e. there will be one only community.
Use option -sup (-inf) if you want to produce a benchmark whose distribution of the ratio of external in-degree/total in-degree is superiorly (inferiorly) bounded by the mixing parameter.
——————– Examples —————————
Example1: ./benchmark -N 1000 -k 15 -maxk 50 -muw 0.1 -minc 20 -maxc 50 Example2: ./benchmark -f flags.dat -t1 3
- Reference
Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.
-
gct.
generate_undirected_unweighted_hier_random_graph_LFR
(name, N, k=None, maxk=None, mu1=None, mu2=None, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, minC=None, maxC=None, seed=None, overide=False)¶ Lancichinetti-Fortunato-Radicchi Benchmark geneartor. Original from https://sites.google.com/site/andrealancichinetti/files
Parameter
-N
[number of nodes]
-k
[average degree]
-maxk
[maximum degree]
-t1
[minus exponent for the degree sequence]
-t2
[minus exponent for the community size distribution]
-minc
[minimum for the micro community sizes]
-maxc
[maximum for the micro community sizes]
-on
[number of overlapping nodes]
-om
[number of memberships of the overlapping nodes]
-minC
[minimum for the macro community size]
-maxC
[maximum for the macro community size]
-mu1
[mixing parameter for the macro communities (see Readme file)]
-mu2
[mixing parameter for the micro communities (see Readme file)]
——————– Examples —————————
Example2: ./hbenchmark -f flags.dat
./hbenchmark -N 10000 -k 20 -maxk 50 -mu2 0.3 -minc 20 -maxc 50 -minC 100 -maxC 1000 -mu1 0.1
- Reference
Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.
-
gct.
generate_random_ovp_graph_LFR
(name, N, k=None, maxk=None, mut=None, muw=None, beta=None, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, C=None, a=0, weighted=False, seed=None, overide=False)¶ Extended version of the Lancichinetti-Fortunato-Radicchi Benchmark for Weighted Overlapping networks to evaluate clustering algorithms using generated ground-truth communities.
Refer https://github.com/eXascaleInfolab/LFR-Benchmark_UndirWeightOvp
Parameter
-N
[number of nodes]
-k
[average degree]
-maxk
[maximum degree]
-mut
[mixing parameter for the topology]
-muw
[mixing parameter for the weights]
-beta
[exponent for the weight distribution]
-t1
[minus exponent for the degree sequence]
-t2
[minus exponent for the community size distribution]
-minc
[minimum for the community sizes]
-maxc
[maximum for the community sizes]
-on
[number of overlapping nodes]
-om
[number of memberships of the overlapping nodes]
-C
[Average clustering coefficient]
-cnl
[output communities as strings of nodes (input format for NMI evaluation)]
-name
[base name for the output files]. It is used for the network, communities and statistics; files extensions are added automatically:
.nsa - network, represented by space/tab separated arcs .nse - network, represented by space/tab separated edges {.cnl, .nmc} - communities, represented by nodes lists ‘.cnl’ if ‘-cnl’ is used, otherwise as a nodes membership in communities ‘.nmc’) .nst - network statistics
-seed
[file name of the random seed, default: seed.txt]
-a
[{0, 1} yield directed network (1 - arcs) rather than undirected (0 - edges), default: 0 - edges]
- Reference
Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.