Dataset (aka graph and/or cluster)

class gct.Dataset(name=None, description='', groundtruthObj=None, edgesObj=None, directed=False, weighted=False, overide=False, additional_meta=None, is_edge_mirrored=False)

A dataset represents a graph and optional one or more ground truth. The graph is either directed or undirected, weighted or unweighted. The ground truth is either disjointed or overlapped.

Every dataset has a name. Once created it saved on local disk (under $GCT_DATA).

Instead of construct a dataset directly, mostly one creates a dataset use other methods, e.g. gct.load_dataset(), gct.from_snap(), gct.from_igraph(), gct.from_networkx(), gct.from_networkit()

There are also building graphs that can be loaded directly. Use gct.list_dataset() to find them.

class gct.Result(result)

The result after running a clustering algorithm. An algorithm may result one or multiple clusterings.

class gct.Clustering(clusteringobj)

A clustering (or partition) of a graph. May be disjointed or overlapped. The partitions in the clustering are also called clusters.

gct.remove_results(data_pattern='*', run_pattern='*', dry_run=True)

remove data and algorithm run results by patterns.

gct.list_dataset(pattern=None)

list dataset matching the pattern. The pattern supports Unix shell-style wildcards. For example

>>> import gct
>>> for v in gct.list_dataset('*snap*'):
...     print (v)
... 
('snap', 'com-DBLP')
('snap', 'com-LiveJournal')

lists two datasets in the format of (category, name), which can be used to invoke gct.load_dataset()

gct.load_dataset(name, cat=None)

load a named dataset. If catetory is not specified, all available datasets will be searched.

Parameters
  • name – name of the dataset

  • cat – category of the dataset

Return type

gct.Dataset a dataset

Raises

Exception

gct.from_edgelist(name, edgelist, groundtruth=None, directed=False, description='', overide=True)

create a graph from edge list.

Parameters
  • name – identifier of the dataset

  • edgelist – a 2d list (list of list) or a 2d numpy ndaray in [[src node, target node, weight],…] format. Or a dataframe that has columns of “src”,”dest”,”weight”. Weight is optional, if missing it is an unweighted graph.

  • groundtruth – None or a 2d list (list of list) or a 2d numpy ndaray in [[node, cluster],…] format. Or a dataframe that has columns of “node”,”cluster”.

  • directed – this is a directed graph

  • description – discription

  • overide – When true and the named dataset already exists, it will be deleted

Return type

gct.Dataset

gct.from_networkx(name, graph, weighted=False, data='weight', default=1, description='', overide=True)

create a datast from networkx graph

Parameters
  • name – identifier of the dataset

  • graph – a networkx graph

  • weight – is it a weighted graph?

  • data – the name of the edge data which is taken as weights.

  • default – default weight if networkx edge data is missing.

  • description – discription

  • overide – When true and the named dataset already exists, it will be deleted

Return type

gct.Dataset

gct.from_snap(name, graph, description='', overide=False)

create a datast from a SNAP graph

Parameters
  • name – identifier of the dataset

  • graph – a SNAP graph

  • description – discription

  • overide – When true and the named dataset already exists, it will be deleted

Return type

gct.Dataset

gct.from_igraph(name, graph, data='weight', description='', overide=True)

create a datast from iGraph graph

Parameters
  • name – identifier of the dataset

  • graph – a igraph graph

  • data – the name of the edge data which is taken as weights. Ignore for unweighted graph.

  • description – discription

  • overide – When true and the named dataset already exists, it will be deleted

Return type

gct.Dataset

gct.to_networkx(data)

convert the dataset to a networkx graph.

Parameters

datagct.Dataset

Return type

networkx graph

gct.to_igraph(data)

convert the dataset to a iGraph graph.

Parameters

datagct.Dataset

Return type

igraph graph

gct.to_networkit(data)

convert the dataset to a networkit graph.

Parameters

datagct.Dataset

Return type

networkit graph

gct.to_graph_tool(data)

convert the dataset to a graph-tool graph.

(TBD) graph_tool support weights?

Parameters

datagct.Dataset

Return type

graph-tool graph

gct.to_coo_adjacency_matrix(data, simalarity=False, distance_fun=None)

convert the dataset to a sparse coo adjacency matrix.

Parameters

datagct.Dataset

Return type

scipy coo_matrix

gct.as_undirected(data, newname, description='', overide=False)

convert the dataset to undirected dataset

Parameters
  • datagct.Dataset

  • newname – the name of the new dataset

  • description – discription

  • overide – When true and the named dataset already exists, it will be deleted

Return type

gct.Dataset

gct.as_unweighted(data, newname, description='', overide=False)

convert the dataset to unweighted dataset

Parameters
  • datagct.Dataset

  • newname – the name of the new dataset

  • description – discription

  • overide – When true and the named dataset already exists, it will be deleted

Return type

gct.Dataset

gct.as_unweighted_undirected(data, newname, description='', overide=False)

convert the dataset to undirected unweighted dataset

Parameters
  • datagct.Dataset

  • newname – the name of the new dataset

  • description – discription

  • overide – When true and the named dataset already exists, it will be deleted

Return type

gct.Dataset

gct.local_graph_exists(name)

check if dataset ‘name’ exists locally.

Parameters

name – name of a dataset

Return type

bool

gct.load_local_graph(name)

load a local dataset

Parameters

name – name of a dataset

gct.list_local_graph()

list all local datasets

gct.remove_local_graph(name, rm_graph_data=True, rm_clustering_result=True)

remove a local dataset

Parameters
  • name – name of a dataset

  • rm_clustering_result – remove local graph data

  • rm_clustering_result – remove clustering results associated with the graph.

gct.list_clustering_result(dataset_name)

list clustering results associated with the dataset

Parameters

dataset_name – name of a dataset

gct.list_all_clustering_results(print_format=False)

list clustering results associated with all available datasets

gct.remove_data(data_pattern='*', with_results=True, dry_run=True)

remove data (and algorithm run results) by data name pattern.

gct.generate_undirected_unweighted_random_graph_LFR(name, N, k, maxk, mu, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, C=None, seed=None, overide=False)

Lancichinetti-Fortunato-Radicchi Benchmark geneartor. Original from https://sites.google.com/site/andrealancichinetti/files

Parameter:

-N

[number of nodes]

-k

[average degree]

-maxk

[maximum degree]

-mu

[mixing parameter]

-t1

[minus exponent for the degree sequence]

-t2

[minus exponent for the community size distribution]

-minc

[minimum for the community sizes]

-maxc

[maximum for the community sizes]

-on

[number of overlapping nodes]

-om

[number of memberships of the overlapping nodes]

-C

[Average clustering coefficient]

-N, -k, -maxk, -mu have to be specified. For the others, the program can use default values:

t1=2, t2=1, on=0, om=0, minc and maxc will be chosen close to the degree sequence extremes.

If you don’t specify -C the rewiring process for raising the average clustering coefficient will not be performed

If you set a parameter twice, the latter one will be taken.

——————– Other options —————————

To have a random network use: -rand

Using this option will set mu=0, and minc=maxc=N, i.e. there will be one only community.

Use option -sup (-inf) if you want to produce a benchmark whose distribution of the ratio of external degree/total degree is superiorly (inferiorly) bounded by the mixing parameter.

——————– Examples —————————

Example1: ./benchmark -N 1000 -k 15 -maxk 50 -mu 0.1 -minc 20 -maxc 50 -C 0.7

Example2: ./benchmark -f flags.dat -t1 3

Reference

Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.

gct.generate_directed_unweighted_random_graph_LFR(name, N, k=None, maxk=None, mu=None, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, C=None, seed=None, overide=False)

Lancichinetti-Fortunato-Radicchi Benchmark geneartor. Original from https://sites.google.com/site/andrealancichinetti/files

Parameter:

-N

[number of nodes]

-k

[average in-degree]

-maxk

[maximum in-degree]

-mu

[mixing parameter]

-t1

[minus exponent for the degree sequence]

-t2

[minus exponent for the community size distribution]

-minc

[minimum for the community sizes]

-maxc

[maximum for the community sizes]

-on

[number of overlapping nodes]

-om

[number of memberships of the overlapping nodes]

-N, -k, -maxk, -mu have to be specified. For the others, the program can use default values:

t1=2, t2=1, on=0, om=0, minc and maxc will be chosen close to the degree sequence extremes.

If you set a parameter twice, the latter one will be taken.

——————– Other options —————————

To have a random network use: -rand

Using this option will set mu=0, and minc=maxc=N, i.e. there will be one only community.

Use option -sup (-inf) if you want to produce a benchmark whose distribution of the ratio of external in-degree/total in-degree is superiorly (inferiorly) bounded by the mixing parameter.

——————– Examples —————————

Example1: ./benchmark -N 1000 -k 15 -maxk 50 -mu 0.1 -minc 20 -maxc 50

Example2: ./benchmark -f flags.dat -t1 3

Reference

Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.

gct.generate_undirected_weighted_random_graph_LFR(name, N, k=None, maxk=None, mut=None, muw=None, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, C=None, seed=None, overide=False)

Lancichinetti-Fortunato-Radicchi Benchmark geneartor. Original from https://sites.google.com/site/andrealancichinetti/files

Parameter:

-N

[number of nodes]

-k

[average degree]

-maxk

[maximum degree]

-mut

[mixing parameter for the topology]

-muw

[mixing parameter for the weights]

-beta

[exponent for the weight distribution]

-t1

[minus exponent for the degree sequence]

-t2

[minus exponent for the community size distribution]

-minc

[minimum for the community sizes]

-maxc

[maximum for the community sizes]

-on

[number of overlapping nodes]

-om

[number of memberships of the overlapping nodes]

-C

[Average clustering coefficient]

-N, -k, -maxk, -muw have to be specified. For the others, the program can use default values:

t1=2, t2=1, on=0, om=0, beta=1.5, mut=muw, minc and maxc will be chosen close to the degree sequence extremes.

If you don’t specify -C the rewiring process for raising the average clustering coefficient will not be performed

If you set a parameter twice, the latter one will be taken.

——————– Other options —————————

To have a random network use: -rand

Using this option will set muw=0, mut=0, and minc=maxc=N, i.e. there will be one only community.

Use option -sup (-inf) if you want to produce a benchmark whose distribution of the ratio of external degree/total degree is superiorly (inferiorly) bounded by the mixing parameter.

——————– Examples —————————

Example1: ./benchmark -N 1000 -k 15 -maxk 50 -muw 0.1 -minc 20 -maxc 50

Example2: ./benchmark -f flags.dat -t1 3

Reference

Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.

gct.generate_directed_weighted_random_graph_LFR(name, N, k=None, maxk=None, mut=None, muw=None, beta=None, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, C=None, seed=None, overide=False)

Lancichinetti-Fortunato-Radicchi Benchmark geneartor. Original from https://sites.google.com/site/andrealancichinetti/files

Parameter:

-N

[number of nodes]

-k

[average in-degree]

-maxk

[maximum in-degree]

-mut

[mixing parameter for the topology]

-muw

[mixing parameter for the weights]

-beta

[exponent for the weight distribution]

-t1

[minus exponent for the degree sequence]

-t2

[minus exponent for the community size distribution]

-minc

[minimum for the community sizes]

-maxc

[maximum for the community sizes]

-on

[number of overlapping nodes]

-om

[number of memberships of the overlapping nodes]

-N, -k, -maxk, -muw have to be specified. For the others, the program can use default values:

t1=2, t2=1, on=0, om=0, beta=1.5, mut=muw, minc and maxc will be chosen close to the degree sequence extremes.

If you set a parameter twice, the latter one will be taken.

——————– Other options —————————

To have a random network use: -rand

Using this option will set muw=0, mut=0, and minc=maxc=N, i.e. there will be one only community.

Use option -sup (-inf) if you want to produce a benchmark whose distribution of the ratio of external in-degree/total in-degree is superiorly (inferiorly) bounded by the mixing parameter.

——————– Examples —————————

Example1: ./benchmark -N 1000 -k 15 -maxk 50 -muw 0.1 -minc 20 -maxc 50 Example2: ./benchmark -f flags.dat -t1 3

Reference

Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.

gct.generate_undirected_unweighted_hier_random_graph_LFR(name, N, k=None, maxk=None, mu1=None, mu2=None, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, minC=None, maxC=None, seed=None, overide=False)

Lancichinetti-Fortunato-Radicchi Benchmark geneartor. Original from https://sites.google.com/site/andrealancichinetti/files

Parameter

-N

[number of nodes]

-k

[average degree]

-maxk

[maximum degree]

-t1

[minus exponent for the degree sequence]

-t2

[minus exponent for the community size distribution]

-minc

[minimum for the micro community sizes]

-maxc

[maximum for the micro community sizes]

-on

[number of overlapping nodes]

-om

[number of memberships of the overlapping nodes]

-minC

[minimum for the macro community size]

-maxC

[maximum for the macro community size]

-mu1

[mixing parameter for the macro communities (see Readme file)]

-mu2

[mixing parameter for the micro communities (see Readme file)]

——————– Examples —————————

Example2: ./hbenchmark -f flags.dat

./hbenchmark -N 10000 -k 20 -maxk 50 -mu2 0.3 -minc 20 -maxc 50 -minC 100 -maxC 1000 -mu1 0.1

Reference

Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.

gct.generate_random_ovp_graph_LFR(name, N, k=None, maxk=None, mut=None, muw=None, beta=None, t1=None, t2=None, minc=None, maxc=None, on=None, om=None, C=None, a=0, weighted=False, seed=None, overide=False)

Extended version of the Lancichinetti-Fortunato-Radicchi Benchmark for Weighted Overlapping networks to evaluate clustering algorithms using generated ground-truth communities.

Refer https://github.com/eXascaleInfolab/LFR-Benchmark_UndirWeightOvp

Parameter

-N

[number of nodes]

-k

[average degree]

-maxk

[maximum degree]

-mut

[mixing parameter for the topology]

-muw

[mixing parameter for the weights]

-beta

[exponent for the weight distribution]

-t1

[minus exponent for the degree sequence]

-t2

[minus exponent for the community size distribution]

-minc

[minimum for the community sizes]

-maxc

[maximum for the community sizes]

-on

[number of overlapping nodes]

-om

[number of memberships of the overlapping nodes]

-C

[Average clustering coefficient]

-cnl

[output communities as strings of nodes (input format for NMI evaluation)]

-name

[base name for the output files]. It is used for the network, communities and statistics; files extensions are added automatically:

.nsa - network, represented by space/tab separated arcs .nse - network, represented by space/tab separated edges {.cnl, .nmc} - communities, represented by nodes lists ‘.cnl’ if ‘-cnl’ is used, otherwise as a nodes membership in communities ‘.nmc’) .nst - network statistics

-seed

[file name of the random seed, default: seed.txt]

-a

[{0, 1} yield directed network (1 - arcs) rather than undirected (0 - edges), default: 0 - edges]

Reference

Lancichinetti, Andrea, and Santo Fortunato. “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities.” Physical Review E 80.1 (2009): 016118.