Welcome to Geo2DR’s Docs!¶
Geo2DR is a Python library for constructing methods capable of learning distributed representations of graphs. Here, embeddings of substructure patterns (walks, graphlets, trees, etc.) and whole graphs are learned by exploiting the distributive hypothesis used in statistical language modelling. It attempts to make the various algorithms used for inducing discrete structures, creating corpi, training neural language models available under a simple easy to use library. This will allow the rapid recreation of existing methods as well as construction of completely novel unpublished methods in a robust and reliable manner.
Note
This is an actively developing project and a lot more documentation is planned to be included (you can view the raw rst file to see the upcoming tutorials and reference materials to be included in the coming weeks). It’s quite tough going handling this project alone, so please bear with me!
Installation¶
The current version of Geo2DR or more specifically geometric2dr
relies on some functionalities in other packages which need to be installed in advance. These packages can be installed after geometric2dr but either way are necessary for the
Note
We do not recommend installation as root user on your system python.
Please set up a virtual environment with a virtualenv. This can be done with a command such as virtualenv -p python3 venv
which creates a new venv
which contains its self-contained python3 instance. The rest of this document assumes a virtual environment is being used and is activated.
geometric2dr
’s two main dependencies are PyNauty and PyTorch. PyNauty is used for efficient hash or certificate generation of topological structures used in some decomposition algorithms. PyTorch is the machine learning and numerical framework we use to define our neural language models and related utilities.
Installing PyNauty dependency¶
PyNauty is reliant on Nauty, a C library, and therefore requires some building and symbolic linking as part of its installation we have tried to condense this here into as few steps as possible. To build Pynauty for Geo2DR one needs:
- Python 3.6+
- The most recent version of Nauty
- An ANSI C compiler
Download PyNauty’s compressed tar sources and unpack it. That should create a directory like pynauty-X.Y.Z/ containing the source files.
Also download the most recent sources of nauty. That file should be something like nautyXYZ.tar.gz.
Change to the directory pynauty-X.Y.Z/ replacing X.Y.Z with the actual version of pynauty:
cd pynauty-X.Y.Z
Unpack nautyXYZ.tar.gz inside the pynauty-X.Y.Z/ directory. Create a symbolic link named nauty to nauty’s source directory replacing XYZ with the actual version of Nauty and build it:
ln -s nautyXYZ nauty
make pynauty
To install pynauty to the virtual environment call the following in the pynauty folder whilst the virtual environment is activated.
make virtenv-ins
To uninstall simply use the corresponding make command
make virtenv-unins
Installing PyTorch dependency¶
Pytorch is installed based on the available hardware of the machine (GPU or CPU) please follow the appropriate pip installation on the official PyTorch website.
Installing Geo2DR¶
geometric2dr can be installed either from Pypi or the sources from the GitHub repository. Pypi installation is done through pip.
pip install geometric2dr
Installation from sources¶
Geo2DR can be installed into the virtualenvironment from the source folder to get its latest features. If ones wishes to simply use Geo2DR modules one can use the standard pip source install from the project root folder containing setup.py
pip install .
If you wish to modify some of the source code in geometric2dr for your own task, you can make a developer install. It is slightly slower, but immediately takes on any changes into the source code.
pip install -e .
A quick primer on distributed representations of graphs¶

Here we provide a brief and simplified primer on learning distributed representations of graphs. This will not fully describe the various intricacies of existing methods, but cover a conceptual framework common to almost all distributed representations of graphs particularly for learning representations of substructure patterns and whole graphs. The above figure is a diagrammatic representation of this conceptual framework.
Given a set of graphs \(\mathbb{G} = \{ \mathcal{G}_1, \mathcal{G}_2, ... \mathcal{G}_n \}\) one can induce discrete substructure patterns such as shortest paths, rooted subgraphs, graphlets, etc. using side-effects of algorithms such as the Floyd-Warshall or Weisfeiler-Lehman Graph Isomorphism test, and so on. This can be used to produce pattern frequency vectors \(\mathbf{X} = \{\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_n \}\) describing the occurrence frequency of substructure patterns over a shared vocabulary \(\mathbb{V}\). \(\mathbb{V}\) is the set of unique substructure patterns induced across all of the graphs in the dataset \(\mathbb{G}\).
Classically one may directly use these pattern frequency vectors within standard machine learning methods using vector inputs to perform some task. This is the approach taken by a variety of graph kernels. Unfortunately, as the graphs of \(\mathbb{G}\) and subtructure patterns induced become more complex through size or specificity, the number of induced patterns increases dramatically. This, in turn, causes the pattern frequency vectors of \(\mathbf{X}\) to be extremely sparse and high-dimensional. The high specificity of the patterns and the sparsity of the pattern frequency vectors cause a phenomenon known as diagonal dominance across the kernel matrices wherein each graph becomes more similar to itself and dissimilar from others, degrading the classification performance [1].
To address this issue it is possible to learn dense and low dimensional distributed representations of graphs that are inductively biased to be similar when they contain similar substructure patterns and dissimilar when they do not. To achieve this, the construction of a corpus dataset \(\mathcal{D}\) is required detailing the target-context relationship between a graph and its induced substructure as in our example or a substructure pattern to other substructure patterns. In the simplest form for graph-level representation learning one can implement \(\mathcal{D}\) as tuples of graphs and substructure pattern \((\mathcal{G}_i, p_j) \in \mathcal{D}\) if \(p_j \in \mathbb{V}\) and \(p_j \in \mathcal{G}_i\).
The corpus is utilised with a method that incorporates Harris’ distributive hypothesis [2] to learn the distributed representations of graphs. skipgram, cbow, PV-DM, PV-DBOW [3, 4] are a few examples of neural embedding methods that incorporate this inductive bias and are all present in the Geo2DR library. In skipgram with negative sampling, as used in Graph2Vec [5], the distributed representations can be learned by optimizing
over the corpus observations where \(\Phi \in \mathbb{R}^{|\mathbb{G}| \times d}\) is the \(d\) dimensional matrix of graph embeddings we desire of the graph dataset \(\mathbb{G}\), and \(\Phi_i\) is embedding for \(\mathcal{G}_i \in \mathbb{G}\). Similarly, \(\mathcal{S} \in \mathbb{R}^{|\mathbb{V}| \times d}\) are the \(d\) dimensional embeddings of the substructure patterns in the vocabulary \(\mathbb{V}\) so \(\mathcal{S}_p\) represents the vector embedding corresponding to substructure pattern \(p\). The embeddings of the substructure patterns are also tuned but ultimately not used, as we are interested in the graph embeddings in \(\Phi\). \(k\) is the number of negative samples with \(t_N\) being the sampled context pattern, drawn according to the empirical unigram distribution \(P_D (p) = \frac{|\{p | \forall G_i \in \mathbb{G}, (G_i, p) \in \mathcal{D}\}|}{|D|}\).
The optimization of the above utility function creates the desired distributed representations of the targets in \(\Phi\), in this the case graph-level embeddings. These may be used as input for any downstream machine learning task and method that take vector inputs. The distributed representations benefit from having lower dimensionality than the pattern frequency vectors, in other words \(|\mathbb{V}| >> d\), being non-sparse, and being inductively biased via the distributive hypothesis in an unsupervised manner. For more in-depth reading we recommend [1-5].
- Yanardag, P. and Vishwanathan, S. “Deep graph kernels.” InProceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,KDD’15, pp. 1365–1374, New York, NY, USA, 2015.ACM. ISBN 978-1-4503-3664-2. doi: 10.1145/2783258.2783417.
- Harris,Z.S. “Distributional structure.” WORD, 10(2-3):146–162, 1954. doi: 10.1080/00437956.1954.11659520
- Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficientestimation of word representations in vector space. In1stInternational Conference on Learning Representations,ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013.
- Le, Q. and Mikolov, T. Distributed representations of sen-tences and documents. InProceedings of the 31st In-ternational Conference on International Conference onMachine Learning - Volume 32, ICML’14, pp. II–1188–II–1196. JMLR.org, 2014
- Narayanan, A., Chandramohan, M., Venkatesan, R., Chen,L., Liu, Y., and Jaiswal, S. “graph2vec:Learning distributed representations of graphs.” CoRR,abs/1707.05005, 2017.
Introduction with example¶
Geo2DR is a python library conceived to quickly construct systems capable of learning distributed representations of graphs. To do so, the library contains various graph decomposition algorithms and neural language models that serve as the main components in a two step design methodology for learning distributed representations. Also included are data preprocessing tools for common benchmark datasets and efficient corpus data/dataloader implementations for interfacing with the included neural language models written with PyTorch. Thereby the library provides a unified set of tools for quickly constructing systems capabale of learning distributed representations of graphs and its substructures. Considerable care was taken to shape outputs that are compatible with other libraries such as Gensim, custom PyTorch and Tensorflow implementations useful for learning distributed representations and network analysis tools such as Gephi and NetworkX.
In this page we will cover some of the common design principles underpinning popular systems for learning distributed representations of graphs. Then we proceed to give a practical example of obtaining data and building Narayanan et al’s Graph2Vec system for learning graph level embeddings.
Basic theory and methodology¶

Popular sytems such as Deep Graph Kernels (Yanardag and Vishwanathan, 2015), Graph2Vec (Narayanan et al., 2017), and Anonymous Walk Embeddings (AWE) (Ivanov and Burnaev, 2018) are all methods which learn distributed representations of arbitrary sized graphs. Such systems can be characterised by a common pipeline described below (and in the figure above).
- Decomposition of graphs into descriptive substructures: Each graph in the dataset of graphs is reduced to a set of induced substructure patterns according to some decomposition algorithm. An example is finding all the rooted subgraphs of a graph using the Weisfeiler-Lehman algorithm (Shervashidze et al., 2011) as done in Graph2Vec, or shortest paths as in Deep Graph Kernels to name a few. The union of the sets of substructure patterns induced in each graph across the dataset defines a common “vocabulary” that can describe a graph in relation to another graph based on the induced subgraphs patterns.
- Learning distributed vector representations: The distributive hypothesis (Harris, 1954) posits that words which are used and exist within the same context have similar semantic meanings. In a similar way we may define that a graph is contextualized by the substructure patterns producing a new dataset of (graph, induced_subgraph_pattern) pairs. Embedding methods which exploit the distributive hypothesis such as skipgram (Mikolov et al., 2014) can then be used to learn fixed-size distributed vector embeddings of each substructure pattern or graph in an unsupervised manner.
Once the distributed vector representations are learned for each of the graphs in a dataset. The graph embeddings may be used in any downstream application such as graph classification, regression, etc.
A full example¶
To start we need to download a dataset. We will use the well known benchmark dataset MUTAG downloaded from the TU Dortmund Graph Kernel Benchmark website. Let us save the unpacked archive into a folder called original_data/
so the dataset as downloaded will be a directory original_data/MUTAG
.
Geo2DR assumes graphs to be saved in the GEXF (Graph Exchange XML Format) because it is compatible with network analysis software such as Gephi and NetworkX, and its useful to be able to study a single graph/single file in isolation. We will transform the dataset of graphs into a corresponding dataset inside another directory with the following code.
from geometric2dr.data import DortmundGexf
gexifier = DortmundGexf("MUTAG", "original_data/", "data/")
gexifier.format_dataset()
This will result in the following dataset format.
data/MUTAG/
: folder containing individual gexf files of each graph. A graph will be denoted by the graph IDs used in the original data. In this case graph 0 would bedata/MUTAG/0.gexf
data/MUTAG.Labels
: a plain-text file with each line containing a graph’s file_name and its classification
Given the formatted data we can now induce substructure patterns across the graphs files (in this case all of those in the MUTAG folder). Here we will induce rooted subgraphs up to depth 3 using the Weisfeiler-Lehman node relabeling algorithm outlined in Shervashidze et al.
from geometric2dr.decomposition.weisfeiler_lehman_patterns import wl_corpus
import geometric2dr.embedding_methods.utils as utils
dataset_path = "data/MUTAG"
wl_depth = 3
graph_files = utils.get_files(dataset_path, ".gexf")
wl_corpus(graph_files, wl_depth)
The wl_corpus()
function induces rooted subgraph patterns across list of gexf files in graph_files
, and for each graph builds a document which describes the induced patterns within it (more on this can be found in the data formatting tutorial). These have a special extension specific to each decomposition algorithm or set by the user; in this case the extension will be .wld3
for Weisfeiler-lehman decomposition to depth 3. Having permanent files being generated as a side effect of the graph decomposition process is useful for later study and also if we want to use the same induced patterns in the upcoming step of learning distributed representations of the graphs.
To learn distributed representations we need to construct a new target-context dataset. In Graph2Vec a graph is contextualised by the substructure patterns within it, and uses the PV-DBOW architecture with negative sampling to directly learn graph-level embeddings. Hence we use the PVDBOWInMemoryCorpus
which is a extension of a PyTorch dataset. This can interface with a standard PyTorch dataloader to load the data into a skipgram
model that we train in a simple loop using a simple and recognizable PyTorch.nn workflow.
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
from geometric2dr.embedding_methods.pvdbow_data_reader import PVDBOWInMemoryCorpus
from geometric2dr.embedding_methods.skipgram import Skipgram
# Instantiate corpus dataset, dataloader and skipgram architecture
corpus = PVDBOWCorpus(dataset_path, ".wld3") # generates the target-context dataset
dataloader = DataLoader(corpus, batch_size=1000, shuffle=False, collate_fn = corpus.collate)
skipgram = Skipgram(num_targets=corpus.num_graphs, vocab_size=corpus.num_subgraphs, emb_dimension=32)
# Set torch device, optimizers and make a training loop
if torch.cuda.is_available():
device = torch.device("cuda")
skipgram.cuda()
else:
device = torch.device("cpu")
optimizer = optim.SGD(skipgram.parameters(), lr=0.1)
for epoch in range(100):
print("### Epoch: " + str(epoch))
running_loss = 0.0
for i, sample_batched in enumerate(dataloader):
if len(sample_batched[0]) > 1:
pos_target = sample_batched[0].to(device)
pos_context = sample_batched[1].to(device)
neg_context = sample_batched[2].to(device)
optimizer.zero_grad()
loss = skipgram.forward(pos_target, pos_context, neg_context) # the loss is integrated into the forward function
loss.backward()
optimizer.step()
running_loss = running_loss * 0.9 + loss.item() * 0.1
print(" Loss: " + str(running_loss))
final_graph_embeddings = skipgram.target_embeddings.weight.cpu().data.numpy()
And we have our graph embeddings! As this is such a common set up, Geo2DR also comes with a number of Trainer
classes which build corpus datasets, loaders, train neural language models, and save their outputs. All of the above code can be replaced with this short trainer.
from geometric2dr.embedding_methods.pvdbow_trainer import InMemoryTrainer
# Instantiate a PV-DBOW trainer to learn distributed reps directly.
trainer = InMemoryTrainer(corpus_dir=dataset_path, extension=".wld3", output_fh="graph_embeddings.json",
emb_dimension=32, batch_size=1000, epochs=100, initial_lr=0.1,
min_count=0)
trainer.train()
final_graph_embeddings = trainer.skipgram.give_target_embeddings()
Geo2DR implements a variety of graph decomposition algorithms (such as Weisfeiler-Lehman, anonymous walks, graphlets) and learning models which exploits the distributive hypothesis (such as skipgram with noise contrastive sampling, PV-DM). This enables the quick recreation of existing systems such as Graph2Vec or AWE but also the creation of new combinations leading to new(!) systems capable of learning distributed representations. This enables deeper studies into how we can build better representations of graphs and more reliable comparative analyses on the same codebase.
License¶
MIT License
Copyright (c) 2019-2020 Paul Scherer
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
geometric2dr.data¶
geometric2dr.data.dortmund_formatter¶
Gexifier for TU Dortmund graph kernel based datasets.
-
class
geometric2dr.data.dortmund_formatter.
DortmundGexf
(dataset, path_to_dataset, output_dir_for_graph_files)[source]¶ A class which reads TU Dortmund style datasets and processes them into a corresponding set of .gexf graphs and an associated .Labels file
This class helps turn datasets from the format with which TU Graph Kernel datasets are written into Gexf datasets, which Geo2DR can work with.
It reads the DS_A.txt, DS_graph_indicator.txt, and DS_graph_labels.txt to create a folder of graphs in GEXF format and a graph-id to graph-classification label file.
The saved format will be dataset_name/dataset_name/<name>.gexf : folder containing individual gexf files of each graph. dataset_name/dataset_name.Labels : a file denoting each gexf file to the integer class label
See tu_gexifier for a more basic script based version. This class version will also contain various metadata about the dataset which may be useful for downstream decomposition algorithms and other analysis
Parameters: - dataset (str) – string name of directory containing dataset, eg. “MUTAG”.
- path_to_dataset (str) – path to directory containing directory of dataset.
- output_dir_for_graph_files (str) – path to where new dataset and labels file will be saved.
geometric2dr.decomposition¶
geometric2dr.decomposition.anonymous_walk_patterns¶
This module contains algorithms useful for inducing anonymous walks in graphs.
The algorithms are adapted from the original source code by Ivanov and Burnaev 2018 “Anonymous Walk Embeddings” [1]. Original reference implementation can be found in: https://github.com/nd7141/AWE
-
geometric2dr.decomposition.anonymous_walk_patterns.
all_paths
(paths, steps, keep_last=False)[source]¶ Get all possible anonymous walks of length up to steps.
-
geometric2dr.decomposition.anonymous_walk_patterns.
all_paths_edges
(paths, steps, keep_last=True)[source]¶ Get all possible anonymous walks of length up to steps, using edge labels
-
geometric2dr.decomposition.anonymous_walk_patterns.
all_paths_edges_nodes
(paths, steps, keep_last=True)[source]¶ Get all possible anonymous walks of length up to steps, using edge-node labels
-
geometric2dr.decomposition.anonymous_walk_patterns.
all_paths_nodes
(paths, steps, keep_last=True)[source]¶ Get all possible anonymous walks of length up to steps, using node labels
-
geometric2dr.decomposition.anonymous_walk_patterns.
anonymous_walk_node
(graph, rw_graph, node, steps, label_setting=None)[source]¶ Creates anonymous walk from a node.
-
geometric2dr.decomposition.anonymous_walk_patterns.
anonymous_walks
(graph, neighborhood_size, walk_ids, awe_length, label_setting)[source]¶ Generates anonymous walks for a given graph and input hyperparameters
-
geometric2dr.decomposition.anonymous_walk_patterns.
awe_corpus
(corpus_dir, awe_length, label_setting, neighborhood_size=10, saving_graph_docs=True)[source]¶ Induces anonymous walks up to a supplied length across a set of graphs.
This function extracts anonymous walks of awe_length across the graphs in a dataset into a vocabulary. The function saves a graphdoc for each graph which records string identifiers of the anonymous walks present in each graph
The main use case for this function is for the user to supply a path to the directory containing the .gexf file which contains .gexf files of each graph in a dataset. The decomposition function will produce a file with the extension .awe_<`awe_length`>_<`label_setting`> for each .gexf file which contains the anonymous walks patterns specified by a string hash controlled by the vocabulary of patterns across the list of graphs being studied.
Parameters: - corpus_dir (str) – path to folder containing gexf graph files on which the substructure patterns should be induced
- awe_length (int) – desired length of anonymous walk patterns
- label_setting (str) – information of node/edge labels that should be used; this can be None, ‘nodes’, ‘edges’, ‘edges_nodes’
- neighborhood_size (int (default=10)) – the number of anonymous walks to take from a source node
- saving_graph_docs (bool (default=True)) – boolean value which dictates whether graph documents will be generated and saved.
Returns: None – Induces the anonymous walks of awe_length across the list of gexf file paths supplied in corpus_dir. Graph “documents” with the extension .awe_<`awe_length`>_<`label_setting`> are created for each of the gexf files in the same directory containing string identifiers of the patterns induced in each graph.
Return type: None
-
geometric2dr.decomposition.anonymous_walk_patterns.
create_random_walk_graph
(nx_graph)[source]¶ Generate a random walk graph equivalent of a graph as described in anonymous walk embeddings
-
geometric2dr.decomposition.anonymous_walk_patterns.
load_graph
(file_handle)[source]¶ Loads a nx graph object and a corresponding numpy adjacency matrix given a GEXF file graph.
Parameters: file_handle (str) – path to gexf file Returns: - graph (networkx graph) – Networkx graph instance of input gexf file
- adj_matrix (numpy ndarray) – The adjacency matrix of the graph
-
geometric2dr.decomposition.anonymous_walk_patterns.
random_step_node
(rw_graph, node)[source]¶ Moves one step from the current according to probabilities of outgoing edges. Return next node.
-
geometric2dr.decomposition.anonymous_walk_patterns.
random_walk_node
(rw_graph, node, steps)[source]¶ Creates anonymous walk from a node for arbitrary steps. Returns a tuple with consequent nodes.
-
geometric2dr.decomposition.anonymous_walk_patterns.
random_walk_with_label_edges
(graph, rw_graph, node, steps)[source]¶ Creates anonymous walk from a node for arbitrary steps with usage of edge labels. Returns a tuple with consequent nodes.
geometric2dr.decomposition.graphlet_patterns¶
Graphlet based graph decomposition algorithm to create graph documents. Inspired and adapted from Yanardag and Vishwanathan “Deep Graph Kernels” [2].
-
geometric2dr.decomposition.graphlet_patterns.
get_graphlet
(window, num_graphlets)[source]¶ Compute the Nauty certificate of the graphlet in within a window of the adjacency matrix
This function takes the upper triangle of a nxn matrix and computes its hash certificate using nauty. Given the parameters this usually involved computing the graphlet of num_graphlets size
This is used for comparison with a bank of known certificates as loaded in get_maps().
Parameters: - window (numpy ndarray) – submatrix inside the adjacency matrix of a graph
- num_graphlets (int) – the size of the graphlets to extract
Returns: cert – certicate of the graphlet produced by finding its canonical representation with Nauty.
Return type: byte str
-
geometric2dr.decomposition.graphlet_patterns.
get_maps
(num_graphlets)[source]¶ Load certificates created by the canonical representations of graphlets into canonical maps dictionary for quick isomorphism check when we extract graphlets from graphs in the users dataset
Parameters: num_graphlets (int) – The number of nodes in the graphlet pattern, aka the size of a graphlet pattern Returns: canonical_map – A dict of Nauty certificates of the canonical representations of graphlets of the size num_graphlets. Return type: dict
-
geometric2dr.decomposition.graphlet_patterns.
graphlet_corpus
(corpus_dir, num_graphlets, samplesize)[source]¶ Function which induces the graphlet patterns over the graphs inside a given directory.
The main use case is for the user to input a path containing individual graphs of the dataset in gexf format. The decomposition algorithm will induce graphlet patterns for graphs recording the dataset/”global” vocabulary of patterns within a dictionary. The graph and its associated patterns (by IDs given through our hash function) are saved into a <graphid>.wldr<depth> file which contains a line delimited list of all the substructure pattern ids.
Parameters: - corpus_dir (str) – path to directory containing graph files of dataset in .gexf format.
- num_graphlets (int) – the size of the graphlet patterns to be extracted
- samplesize (int) – the number of samples to take from a graph.
Returns: - corpus (list of lists of str) – a list of lists, with each inner list containing all the samplesize*(1+ num_graphlets) patterns in one graph of the dataset
- vocabulary (list) – a set of the unique graphlet pattern ids
- prob_map (dict) – a map {gidx: {graphlet_idx: normalized_prob}} of normalized probabilities of a graphlet pattern appearing in a graph based on counts made in generation
- num_graphs (int) – the number of graphs in the dataset
- graph_map (dict) – a map {gidx: {graphlet_idx: count}} of the number of times a certain graphlet pattern appeared in a graph for each graph gidx in the dataset
-
geometric2dr.decomposition.graphlet_patterns.
load_graph
(file_handle)[source]¶ Load the gexf format file as a nx_graph and an adjacency matrix.
Parameters: file_handle (str) – path to gexf file Returns: - graph (networkx graph) – Networkx graph instance of input gexf file
- adj_matrix (numpy ndarray) – The adjacency matrix of the graph
-
geometric2dr.decomposition.graphlet_patterns.
save_graphlet_document
(gexf_fh, gidx, num_graphlets, samplesize, cooccurence_corpus)[source]¶ Saves the induced graphlet patterns into dataset folder, it is only used in conjunction with graphlet_corpus()
Parameters: - gexf_fh (str) – path to gexf file of graph
- gidx (int) – integer id of the graph, typically matches number in gexf_fh, if using TU Dortmund benchmark datasets
- num_graphlets (int) – size of graphlet patterns induced
- samplesize (int) – number of graphlet patterns sampled from graph
- cooccurrence_corpus (list) – list of graphlet patterns induced under cooccurrence rules, in this case all graphlets immediately adjacent to an induced graphlet pattern.
Returns: None – The decomposition algorithm will induce graphlet patterns for graphs recording the dataset/”global” vocabulary of patterns within a dictionary. The graph and its associated patterns (by IDs given through our hash function) are saved into a <graphid>.wldr<depth> file which contains a line delimited list of all the substructure pattern ids.
Return type: None
geometric2dr.decomposition.shortest_path_patterns¶
Shortest path based decomposition algorithm to create graph documents. Inspired and adapted from Yanardag and Vishwanathan “Deep Graph Kernels” [2].
-
geometric2dr.decomposition.shortest_path_patterns.
load_graph
(file_handle)[source]¶ Load the gexf format file as a nx_graph and an adjacency matrix.
Parameters: file_handle (str) – path to gexf file Returns: - graph (networkx graph) – Networkx graph instance of input gexf file
- adj_matrix (numpy ndarray) – The adjacency matrix of the graph
-
geometric2dr.decomposition.shortest_path_patterns.
save_sp_doc
(gexf_fh, gidx, coocurrence_corpus)[source]¶ Saves a shortest path graph doc with the extension .spp
Parameters: - gexf_fh (str) – sdf
- gidx (int) – int id of the gexf (note: deprecated and will be removed in next version)
- cooccurrence_corpus (list) – list of lists containing cooccuring shortest paths (ie those starting from the same starting node)
Returns: None – Saves the induced shortest path patterns into a graph document for the graph specified in the gexf_fh
Return type: None
-
geometric2dr.decomposition.shortest_path_patterns.
sp_corpus
(corpus_dir)[source]¶ Function which induces the shortest path patterns over the graphs inside a given directory.
The main use case for this script is for the user to supply a path to the directory containing the .gexf graph files of a dataset. The decomposition function will produce a .spp file for each .gexf file which contains the shortest path patterns of a graph given the vocabulary of patterns across the dataset.
Parameters: corpus_dir (str) – path to directory containing graph files of dataset in .gexf format Returns: - corpus (list of lists of str) – a list of lists, with each inner list containing all the shortest path patterns in one graph of the dataset
- vocabulary (list) – a set of the unique shortest path pattern ids
- prob_map (dict) – a map {gidx: {path: normalized_prob}} of normalized probabilities of a shortest path pattern appearing in a graph based on counts made in generation
- num_graphs (int) – the number of graphs in the dataset
- graph_map (dict) – a map {gidx: {path: count}} of the number of times a certain shortest path pattern appeared in a graph for each graph gidx in the dataset
geometric2dr.decomposition.weisfeiler_lehman_patterns¶
Functions for inducing Weisfeiler Lehman graph decomposition algorithm via node relabeling as described in Shervashidze et al. [3].
Based on the implementation available in the original source code of Graph2Vec [4] and adapted for Geo2DR https://github.com/MLDroid/graph2vec_tf which has no license
-
geometric2dr.decomposition.weisfeiler_lehman_patterns.
initial_relabel
(graph, node_label_attr_name='Label')[source]¶ The initial relabeling of the graphs in the dataset.
Taking the attributed label of the node as stated by the attr in the gexf file, it gives new labels to these node types (without regard for neighbours) hence it really is just a relabeling of the initial labels into our own “0+<newlabel>” format (to use with WL relabeling scheme after
Parameters: - graph (networkx graph) – a networkx graph
- node_label_attr_name (str) – string literal of the attribute name used as a label in the gexf file NB: “label” in the gexf refers to nodeID “Label” refers to the dataset node label
Returns: graph – the same nx graph but with a new “relabel” attribute with the 0th wlk-h entry label
Return type: networkx graph
-
geometric2dr.decomposition.weisfeiler_lehman_patterns.
save_wl_doc
(fname, max_h, graph=None)[source]¶ Saves the induced rooted subgraph patterns
Saves the subgraph sentences in format <center> <context> <context> …. In other words we are saving the relabelings of node from the WL algorithm into a text document which can be fed into our skipgram architecture
Parameters: - fname (str) – path/filename of the graph
- max_h (int) – highest_iteration wlk_h specified (ie depth of rooted subgraph)
- graph (networkx graph) – the nx graph of the filename
Returns: None – The rooted subgraph patterns are saved into a text file in the format <center> <context> <context> <context> ….
Return type: None
-
geometric2dr.decomposition.weisfeiler_lehman_patterns.
wl_corpus
(fnames, max_h, node_label_attr_name='Label')[source]¶ Induce rooted subgraph patterns using the WL node relabeling algorith given list gexf files and save corresponding graph files.
Given a set of graphs from the dataset, a maximum h for WL, and the label attribute name used in the gexf files we initially relabel the original labels into a compliant relabeling (caesar shift for ease) then perform max-h iterations of the WL relabeling algorithm (1979) to create new labels which are compressed versions of the rooted subgraphs for each node in the graph. These are all present in the nx graph objects’s nodes as attributes, with the original label being ‘Label’ and our subsequent relabelings in the “relabel” attribute
The main use case is for the user to input a path containing individual graphs of the dataset in gexf format. The decomposition algorithm will induce substructure patterns for graphs recording the dataset/”global” vocabulary of patterns within a dictionary. The graph and its associated patterns (by IDs given through our hash function) are saved into a <graphid>.wldr<depth> file which contains a line delimited list of all the substructure pattern ids.
Parameters: - fnames (list) – list of gexf file paths for the graphs in the dataset
- max_h (int) – the maximum depth of rooted subgraph pattern to induce across the dataset of graphs
- node_label_attr_name (str) – string literal of the attribute name used as a label in the gexf file NB: “label” in the gexf refers to nodeID “Label” refers to the dataset node label
Returns: - corpus (list of lists of str) – a list of lists, with each inner list containing all the rooted subgraph patterns in one graph of the dataset
- vocabulary (list) – a set of the unique rooted subgraph pattern ids
- prob_map (dict) – a map {gidx: {wl_pattern: normalized_prob}} of normalized probabilities of a rooted subgraph pattern appearing in a graph based on counts made in generation
- num_graphs (int) – the number of graphs in the dataset
- graph_map (dict) – a map {gidx: {wl_pattern: count}} of the number of times a certain rooted subgraph pattern appeared in a graph for each graph gidx in the dataset
- None (None) – The rooted subgraph patterns are also saved into a text file for each graph in ‘fnames’ in the format <center> <context> <context> <context> ….
-
geometric2dr.decomposition.weisfeiler_lehman_patterns.
wl_relabel
(graph, it)[source]¶ Runs an iteration of the WL relabeling algorithm, onto the graph using the global label_to_compressed_label_map
Parameters: - graph (networkx graph) – a networkx graph from the dataset which we want to relabel (has had to be initally relabeled, ie have the graph.nodes.[node][‘relabel’] attribute)
- it (int) – an int, signifiying iteration in the WL relabeling algorithm.
Returns: graph – the input nx graph with more labels in the “relabel” attribute
Return type: networkx graph
geometric2dr.decomposition.random_walk_patterns¶
Functions for inducing random walk patterns within the graphs as in Gartner et al.
Inspired by the implementation of anonymous walks by Sergey Ivanov and Evgeny Burnaev [1]
-
geometric2dr.decomposition.random_walk_patterns.
create_random_walk_graph
(nx_graph)[source]¶ Generate a random walk graph equivalent of a graph as described in anonymous walk embeddings
-
geometric2dr.decomposition.random_walk_patterns.
load_graph
(file_handle)[source]¶ Loads a nx graph object and a corresponding numpy adjacency matrix given a GEXF file graph.
Parameters: file_handle (str) – path to gexf file Returns: - graph (networkx graph) – Networkx graph instance of input gexf file
- adj_matrix (numpy ndarray) – The adjacency matrix of the graph
-
geometric2dr.decomposition.random_walk_patterns.
random_step_node
(rw_graph, node)[source]¶ Moves one step from the current according to probabilities of outgoing edges. Return next node.
-
geometric2dr.decomposition.random_walk_patterns.
random_walk_with_label_nodes
(graph, rw_graph, node, steps)[source]¶ Creates anonymous walk from a node for arbitrary steps with usage of node labels. Returns a tuple with consequent nodes.
-
geometric2dr.decomposition.random_walk_patterns.
random_walks_graph
(nx_graph, walk_length, neighborhood_size=0)[source]¶ Perform a random walk with cooccurring “neighbouring” walks starting from the same node for all the nodes in the graph.
-
geometric2dr.decomposition.random_walk_patterns.
rw_corpus
(corpus_dir, walk_length, neighborhood_size, saving_graph_docs=True)[source]¶ Induces random walks up to a desired length across the input set of graphs
Parameters: - corpus_dir (str) – path to directory containing graph files
- walk_length (int) – desired length of the random walk
- neighborhood_size (int) – number of cooccuring walks to find per walk source (node)
- saving_graph_docs (bool) – boolean on whether the graph documents should be generated or not
Returns: None – Induces the random walks of walk_length across the list of graphs files supplied in corpus_dir
Return type: None
-
geometric2dr.decomposition.random_walk_patterns.
save_rw_doc
(gexf_fh, coocurrence_corpus, walk_length)[source]¶ Saves a shortest path graph doc with the extension .spp
Parameters: - gexf_fh (str) – sdf
- cooccurrence_corpus (list) – list of lists containing cooccuring shortest paths (ie those starting from the same starting node)
- walk_length (int) – length of the walk
Returns: None – Saves the induced shortest path patterns into a graph document for the graph specified in the gexf_fh
Return type: None
References¶
[1] | (1, 2) Sergey Ivanov, Evgeny Burnaev. “Anonymous Walk Embeddings”. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:2186-2195, 2018. |
[2] | (1, 2)
|
[3] | Shervashidze, Nino & Schweitzer, Pascal & Jan, Erik & Leeuwen, Van & Mehlhorn, Kurt & Borgwardt, Karsten. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research. 1. 1-48., 2010 |
[4] | Narayanan, Annamalai & Mahinthan, Chandramohan & Venkatesan, Rajasekar & Chen, Lihui & Liu, Yang & Jaiswal, Shantanu. “graph2vec: Learning Distributed Representations of Graphs”, 2017 |
geometric2dr.embedding_methods¶
geometric2dr.embedding_methods.cbow¶
CBOW model with negative sampling as in Mikolov et al. [5].
It is used with the corpus classes in cbow_data_reader which handles the data reading and loading. This allows construction of full CBOW based systems. It is one of the choices of neural language model for recreating DGK [2] like systems.
-
class
geometric2dr.embedding_methods.cbow.
Cbow
(num_targets, vocab_size, embedding_dimension)[source]¶ Bases:
torch.nn.modules.module.Module
Pytorch implementation of the CBOW architecture with negative sampling as in Mikolov et al. [5]
This is used in DGK models for example to learn embeddings of substructures for downstream graph kernel definitions.
Parameters: - num_targets (int) – The number of targets to embed. Typically the number of substructure patterns, but can be repurposed to be number of graphs.
- vocab_size (int) – The size of the vocabulary; the number of unique substructure patterns
- embedding_dimension (int) – The desired dimensionality of the embeddings.
Returns: self – a torch.nn.Module of the CBOW model
Return type: CBow
-
forward
(pos_target, pos_contexts, pos_negatives)[source]¶ Forward pass in network
Parameters: - pos_target (torch.Long) – index of target embedding
- pos_contexts (torch.Long) – indices of context embeddings
- pos_negatives (torch.Long) – indices of negatives
Returns: the negative sampling loss
Return type: torch.float
geometric2dr.embedding_methods.cbow_data_reader¶
Data_reader module containing corpus construction utilities for CBOW models
-
class
geometric2dr.embedding_methods.cbow_data_reader.
CbowCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Class which representes all of the graph documents in a graph dataset serves context for CBOW models, This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but is much quicker at loading during training.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the CBOW with negative sampling model.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
geometric2dr.embedding_methods.cbow_trainer¶
Module containining class definitions of trainers for cbow models [5], which are partly used by Deep Graph Kernels [2]
-
class
geometric2dr.embedding_methods.cbow_trainer.
Trainer
(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction, CBOW initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- window_size : int
- the number of cooccuring context subgraph patterns to use
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A Trainer instance Return type: Trainer
geometric2dr.embedding_methods.classify¶
Module containing various functions for classification (on top of the learned embeddings) mainly useful for providing convenience functions on common benchmark classification methods
-
geometric2dr.embedding_methods.classify.
cross_val_accuracy
(corpus_dir, extension, embedding_fname, class_labels_fname, cv=10, mode=None)[source]¶ Performs 10 (default) fold cross validation, returns the mean accuracy and associated standard deviation
Parameters: - corpus_dir (str) – folder containing graphdoc files
- extension (str) – extension of the graphdoc files
- embedding_fname (str) – file containing embeddings
- class_labels_fname (str) – files containing labels of each graph
- cv (int) – integer stating number of folds and therefore experiments to carry out
Returns: tuple – tuple containing the mean accuracies of performing 10 fold cross validation 10 times. This gives a better picture of usual performance expected performance in a Monte Carlo fashion instead of presenting just best performance.
Return type: (acc, std)
-
geometric2dr.embedding_methods.classify.
cross_val_accuracy_rbf_bag_of_words
(P, y_ids, cv=10)[source]¶ cv times Monte Carlo experimentation of 10 fold cross validation, used on given dataset matrix returns overall mean accuracy and associated standard deviation. Terminology and method name will be updated in future version to address overloading term and generalizability of function.
Parameters: - P (numpy ndarray) – a obs x num_features matrix showing dataset
- y_ids (numpy ndarray) – numpy 1 x obs array of class labels for the rows of P
- cv (int (default=10)) – overloaded term of monte carlo restarts of the SVM evaluation over 10 fold CV
Returns: tuple – tuple containing the mean accuracies of performing 10 fold cross validation 10 times. This gives a better picture of usual performance expected performance in a Monte Carlo fashion instead of presenting just best performance.
Return type: (acc, std)
-
geometric2dr.embedding_methods.classify.
linear_svm_classify
(X_train, X_test, Y_train, Y_test)[source]¶ Utility function for quickly performing Scikit Learn GridSearchCV over a linear SVM with 10 fold CrossVal given the train test splits
Parameters: - X_train (numpy ndarray) – training feature vectors
- X_test (numpy ndarray) – testing feature vectors
- Y_train (numpy ndarray) – training set labels
- Y_test (numpy ndarray) – test set labels
Returns: tuple with accuracy, precision, recall, fbeta_score as applicable
Return type: tuple
-
geometric2dr.embedding_methods.classify.
perform_classification
(corpus_dir, extension, embedding_fname, class_labels_fname)[source]¶ Perform classification over the graph files of dataset given they have corresponding embeddings in the saved embedding file and class labels
Parameters: - corpus_dir (str) – folder containing graphdoc files
- extension (str) – extension of the graphdoc files
- embedding_fname (str) – file containing embeddings
- class_labels_fname (str) – files containing labels of each graph
Returns: tuple with accuracy, precision, recall, fbeta_score as applicable
Return type: tuple
-
geometric2dr.embedding_methods.classify.
rbf_svm_classify
(X_train, X_test, Y_train, Y_test)[source]¶ Utility function for quickly performing Scikit Learn GridSearchCV over a rbf kernel SVM with 10 fold CrossVal given the train test splits
Parameters: - X_train (numpy ndarray) – training feature vectors
- X_test (numpy ndarray) – testing feature vectors
- Y_train (numpy ndarray) – training set labels
- Y_test (numpy ndarray) – test set labels
Returns: tuple with accuracy, precision, recall, fbeta_score as applicable
Return type: tuple
geometric2dr.embedding_methods.pvdbow_data_reader¶
Data_reader module containing corpus construction utilities for PVDBOW (skipgram) models.
This module describes the classes which handle graph corpi and datasets which can be loaded into PyTorch dataloaders.
-
class
geometric2dr.embedding_methods.pvdbow_data_reader.
PVDBOWCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Class which represents the target-context dataset created over the graph documents for PVDBOW models. In this version the __getitem__ function loads individual target-context pairs from the hard-drive. As a result, it is quick to set up and memory efficient but may perform slower in training time.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the skipgram with negative sampling model to learn graph-level embeddings.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids for batch
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
-
class
geometric2dr.embedding_methods.pvdbow_data_reader.
PVDBOWInMemoryCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Class which represents the target-context dataset created over the graph documents for PVDBOW models. This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but has a much quicker __getitem__ computation.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the skipgram with negative sampling model to learn graph-level embeddings.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
geometric2dr.embedding_methods.pvdbow_trainer¶
Module containining class definitions of trainers for pvdbow models [6], which are partly used by Deep Graph Kernels [2]
Author: Paul Scherer
-
class
geometric2dr.embedding_methods.pvdbow_trainer.
InMemoryTrainer
(corpus_dir, extension, max_files, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction (in-memory version), PVDBOW initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A trainer instance which has the dataset stored in memory for fast access Return type: InMemoryTrainer
-
class
geometric2dr.embedding_methods.pvdbow_trainer.
Trainer
(corpus_dir, extension, max_files, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction (hard drive version), PVDBOW (skipgram) initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A Trainer instance Return type: Trainer
geometric2dr.embedding_methods.pvdm¶
PVDM model originally introduced in doc2vec paper by Le and Mikolov (2014) [6] Used by AWE-DD model of Anonymous Walk Embeddings by Ivanov and Burnaev (2018) [1]
It is used with the corpus classes in cbow_data_reader which handles the data reading and loading. This allows construction of full PVDM based systems. It is one of the choices of neural language model for recreating AWE [2] like systems.
-
class
geometric2dr.embedding_methods.pvdm.
PVDM
(num_targets, vocab_size, embedding_dimension)[source]¶ Bases:
torch.nn.modules.module.Module
PyTorch implmentation of the PVDM as in Le and Mikolov. [6]
Parameters: - num_targets (int) – The number of targets to embed. Typically the number of substructure patterns, but can be repurposed to be number of graphs.
- vocab_size (int) – The size of the vocabulary; the number of unique substructure patterns
- embedding_dimension (int) – The desired dimensionality of the embeddings.
Returns: self – a torch.nn.Module of the PVDM model
Return type: -
forward
(pos_graph_emb, pos_context_target, pos_contexts, pos_negatives)[source]¶ Forward pass in network
Parameters: - pos_graph_emb (torch.Long) – index of target graph embedding
- pos_context_target (torch.Long) – index of target subgraph pattern embedding
- pos_contexts (torch.Long) – indices of context subgraph patterns around the target subgraph embedding
- pos_negatives (torch.Long) – indices of negatives
Returns: the negative sampling loss
Return type: torch.float
geometric2dr.embedding_methods.pvdm_data_reader¶
Data_reader module containing corpus construction utilities for PVDM models
-
class
geometric2dr.embedding_methods.pvdm_data_reader.
PVDMCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Class which representes all of the graph documents in a graph dataset serves context for PVDM models, This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but is much quicker at loading during training.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the PVDM with negative sampling model.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
geometric2dr.embedding_methods.pvdm_trainer¶
A trainer class which faciliates training of the embedding methods by the set hyperparameters.
Author: Paul Scherer 2020
-
class
geometric2dr.embedding_methods.pvdm_trainer.
PVDM_Trainer
(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction, CBOW initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- window_size : int
- the number of cooccuring context subgraph patterns to use
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A PVDM_Trainer instance Return type: PVDM_Trainer
geometric2dr.embedding_methods.skipgram¶
General Skipgram model with negative sampling originally introduced by word2vec paper Mikolov et al [5]. Used by DGK [2] and Graph2Vec [4] to learn substructure and graph level embeddings
It is used by the SkipgamCorpus and PVDBOWCorpus to build complete Skipgram and PVDBOW systems respectively. SkipgramCorpus and PVDBOWCorpus are found in skipgram_data_reader and pvdbow_data_reader modules respectively
Author: Paul Scherer
-
class
geometric2dr.embedding_methods.skipgram.
Skipgram
(num_targets, vocab_size, embedding_dimension)[source]¶ Bases:
torch.nn.modules.module.Module
Pytorch implementation of the skipgram with negative sampling as in Mikolov et al. [5]
Based on the inputs it can be used as the skipgram described in the original Word2Vec paper [5] , or as Doc2Vec (PV-DBOW) in Le and Mikolov [6]
Parameters: - num_targets (int) – The number of targets to embed. Typically the number of substructure patterns, but can be repurposed to be number of graphs.
- vocab_size (int) – The size of the vocabulary; the number of unique substructure patterns
- embedding_dimension (int) – The desired dimensionality of the embeddings.
Returns: self – a torch.nn.Module of the Skipgram model
Return type: -
forward
(pos_target, pos_context, neg_context)[source]¶ Forward pass in network
Parameters: - pos_target (torch.Long) – index of target embedding
- pos_context (torch.Long) – index of context embedding
- neg_context (torch.Long) – index of negative
Returns: the negative sampling loss
Return type: torch.float
geometric2dr.embedding_methods.skipgram_data_reader¶
Data_reader module containing corpus construction utilities for Skipgram based (ie PVDBOW as well) models.
-
class
geometric2dr.embedding_methods.skipgram_data_reader.
InMemorySkipgramCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Corpus which feeds positions of subgraphs, contextualised by “cooccuring” patterns as defined by the different decomposition algorithms. Designed to support negative sampling. This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but has a much quicker __getitem__ computation.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the skipgram with negative sampling model to learn substructure pattern embeddings.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
-
class
geometric2dr.embedding_methods.skipgram_data_reader.
SkipgramCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Corpus which feeds positions of subgraphs, contextualised by “cooccuring” patterns as defined by the different decomposition algorithms. Designed to support negative sampling. In this version the __getitem__ function loads individual target-context pairs from the hard-drive. As a result, it is quick to set up and memory efficient but may perform slower in training time.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the skipgram with negative sampling model to learn substructure pattern embeddings.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids for batch
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
geometric2dr.embedding_methods.skipgram_trainer¶
Module containining class definitions of trainers for skipgram models, which are partly used by Deep Graph Kernels
Author: Paul Scherer
-
class
geometric2dr.embedding_methods.skipgram_trainer.
InMemoryTrainer
(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction (in-memory version), PVDBOW initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A trainer instance which has the dataset stored in memory for fast access Return type: InMemoryTrainer
-
class
geometric2dr.embedding_methods.skipgram_trainer.
Trainer
(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction (hard drive version), skipgram initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A Trainer instance Return type: Trainer
geometric2dr.embedding_methods.utils¶
General purpose utilities for I/O
Currently Includes: Functions for getting all files in a directory with a given extension Saving graph embeddings into a JSON format Generating a dictionary matching graph files with classification labels
-
geometric2dr.embedding_methods.utils.
get_class_labels
(graph_files, class_labels_fname)[source]¶ Given the list of graph files (as in get_files) and path of the associated class labels returns the list of labels associated with each graph file in graph_files
Parameters: - graph_files (list) – list of paths to graph_files
- class_labels_fname (str) – path to class labels file (.Labels typically) with file names in graph_files
Returns: labels – list of class labels for corresponding to graph files in graph_files
Return type: list
-
geometric2dr.embedding_methods.utils.
get_class_labels_tuples
(graph_files, class_labels_fname)[source]¶ Returns list of tuples associating each of the graph files to their classification labels
Parameters: - graph_files (list) – list of paths to graph_files
- class_labels_fname (str) – path to class labels file (.Labels typically) with file names in graph_files
Returns: labels – list of tuples (base_name_of_graph_file, class_label)
Return type: list
-
geometric2dr.embedding_methods.utils.
get_files
(dname, extension, max_files=0)[source]¶ Returns a list of strings which are all the files with the given extension in a sorted manner
Parameters: - dname (str) – directory with files
- extension (str) – string denoting which extension should be matched in search for files
- max_files (int (default=0)) – the maximum number of files to get, the default of 0 means all files
Returns: all_files – list of all files matching extension inside the directory dname
Return type: list
-
geometric2dr.embedding_methods.utils.
get_kernel_matrix_row_idx_with_class
(corpus, extension, graph_files, class_labels_fname)[source]¶ Returns two arrays, the first is an list of integers each referencing a row in a kernel matrix and thereby a kernel vector corresponding to one of the graphs in the dataset, the second is a list of class labels whose value is the classification of the graph in the same index of the first
Parameters: - corpus (corpus) – a corpus instance (such as SkipgramCorpus)
- extension (str) – extension of graph document under study
- graph_files (list) – list of paths to graph file
- class_labels_fname (str) – path to graph class label file
Returns: kernel_row_x_id, kernel_row_y_id. The first is an list of integers each referencing a row in a kernel matrix and thereby a kernel vector corresponding to one of the graphs in the dataset, the second is a list of class labels whose value is the classification of the graph in the same index of the first
Return type: tuple
-
geometric2dr.embedding_methods.utils.
save_graph_embeddings
(corpus, final_embeddings, opfname)[source]¶ Saves the trained embeddings of a corpus into a dictionary and saves this into a json file on the path given by opfname
Parameters: - corpus (corpus) – any corpus class such as PVDBOWCorpus
- final_embeddings (numpy ndarray) – matrix of target embeddings to be saved
- opfname (str) – path to file where embeddings should be saved in json format (extension optional in Unix)
Returns: embeddings will be saved into path denoted by opfname
Return type: None
-
geometric2dr.embedding_methods.utils.
save_subgraph_embeddings
(corpus, final_embeddings, opfname)[source]¶ Save the embeddings along with a map to the patterns and the corpus
Parameters: - corpus (corpus) – a corpus class such as SkipgramCorpus
- final_embeddings (numpy ndarray) – matrix of target embeddings to be saved
- opfname (str) – path to file where embeddings should be saved in json format
Returns: embeddings will be saved into path denoted by opfname
Return type: None
References¶
[1] | Sergey Ivanov, Evgeny Burnaev. “Anonymous Walk Embeddings”. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:2186-2195, 2018. |
[2] | (1, 2, 3, 4, 5)
|
[3] | Shervashidze, Nino & Schweitzer, Pascal & Jan, Erik & Leeuwen, Van & Mehlhorn, Kurt & Borgwardt, Karsten. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research. 1. 1-48., 2010 |
[4] | Narayanan, Annamalai & Mahinthan, Chandramohan & Venkatesan, Rajasekar & Chen, Lihui & Liu, Yang & Jaiswal, Shantanu. “graph2vec: Learning Distributed Representations of Graphs”, 2017 |
[5] | (1, 2, 3, 4, 5, 6) Mikolov, Tomas & Corrado, G.s & Chen, Kai & Dean, Jeffrey. Efficient Estimation of Word Representations in Vector Space. 1-12., 2013 |
[6] | (1, 2, 3, 4) Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32., 2014 |
If the code or specific examples of existing systems is useful to your work, please cite the original works and also consider citing Geo2DR.
@misc{geometric2dr,
title={Learning distributed representations of graphs with Geo2DR},
author={Paul Scherer and Pietro Lio},
year={2020},
eprint={2003.05926},
archivePrefix={arXiv},
primaryClass={cs.LG}
}