geometric2dr.embedding_methods

geometric2dr.embedding_methods.cbow

CBOW model with negative sampling as in Mikolov et al. [5].

It is used with the corpus classes in cbow_data_reader which handles the data reading and loading. This allows construction of full CBOW based systems. It is one of the choices of neural language model for recreating DGK [2] like systems.

class geometric2dr.embedding_methods.cbow.Cbow(num_targets, vocab_size, embedding_dimension)[source]

Bases: torch.nn.modules.module.Module

Pytorch implementation of the CBOW architecture with negative sampling as in Mikolov et al. [5]

This is used in DGK models for example to learn embeddings of substructures for downstream graph kernel definitions.

Parameters:
  • num_targets (int) – The number of targets to embed. Typically the number of substructure patterns, but can be repurposed to be number of graphs.
  • vocab_size (int) – The size of the vocabulary; the number of unique substructure patterns
  • embedding_dimension (int) – The desired dimensionality of the embeddings.
Returns:

self – a torch.nn.Module of the CBOW model

Return type:

CBow

forward(pos_target, pos_contexts, pos_negatives)[source]

Forward pass in network

Parameters:
  • pos_target (torch.Long) – index of target embedding
  • pos_contexts (torch.Long) – indices of context embeddings
  • pos_negatives (torch.Long) – indices of negatives
Returns:

the negative sampling loss

Return type:

torch.float

give_target_embeddings()[source]

Return the target embeddings as a numpy matrix

Returns:Numpy vocab_size x emb_dimension matrix of substructure pattern embeddings
Return type:numpy ndarray

geometric2dr.embedding_methods.cbow_data_reader

Data_reader module containing corpus construction utilities for CBOW models

class geometric2dr.embedding_methods.cbow_data_reader.CbowCorpus(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]

Bases: torch.utils.data.dataset.Dataset

Class which representes all of the graph documents in a graph dataset serves context for CBOW models, This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but is much quicker at loading during training.

Parameters:
  • corpus_dir (str) – path to folder with graph document files created in decomposition stage
  • extension (str) – extension of the graph document files from which the corpus should be built
  • max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
  • window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns:

self – A corpus dataset that can be used with the CBOW with negative sampling model.

Return type:

CbowCorpus

add_file(full_graph_path)[source]

This method is used to add new graphs into the corpus for inductive learning of new unseen graphs

Parameters:full_graph_path (str) – path to graph document to be part of the new corpus
Returns:New graph and its substructure patterns is made part of the corpus
Return type:None
getNegatives(target, size)[source]

Given target find a size number of negative samples by index

Parameters:
  • target (int) – internal int id of the subgraph pattern
  • size (int) – number of negative samples to find
Returns:

response – list of negative samples by internal int id

Return type:

[int]

pre_load_corpus()[source]

Constructs and loads an entire context-pair dataset into memory

scan_and_load_corpus()[source]

Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids

scan_corpus(min_count)[source]

Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures

Parameters:min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary.
Returns:(Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map
Return type:dict

geometric2dr.embedding_methods.cbow_trainer

Module containining class definitions of trainers for cbow models [5], which are partly used by Deep Graph Kernels [2]

class geometric2dr.embedding_methods.cbow_trainer.Trainer(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]

Handles corpus construction, CBOW initialization and training.

corpus_dir : str
path to directory containing graph files
extension : str
extension used in graph documents produced after decomposition stage
max_files : int
the maximum number of graph files to consider, default of 0 uses all files
window_size : int
the number of cooccuring context subgraph patterns to use
output_fh : str
the path to the file where embeddings should be saved
emb_dimension : int (default=128)
the desired dimension of the embeddings
batch_size : int (default=32)
the desired batch size
epochs : int (default=100)
the desired number of epochs for which the network should be trained
initial_lr : float (default=1e-3)
the initial learning rate
min_count : int (default=1)
the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns:self – A Trainer instance
Return type:Trainer
train()[source]

Train the network with the settings used to initialise the Trainer

geometric2dr.embedding_methods.classify

Module containing various functions for classification (on top of the learned embeddings) mainly useful for providing convenience functions on common benchmark classification methods

geometric2dr.embedding_methods.classify.cross_val_accuracy(corpus_dir, extension, embedding_fname, class_labels_fname, cv=10, mode=None)[source]

Performs 10 (default) fold cross validation, returns the mean accuracy and associated standard deviation

Parameters:
  • corpus_dir (str) – folder containing graphdoc files
  • extension (str) – extension of the graphdoc files
  • embedding_fname (str) – file containing embeddings
  • class_labels_fname (str) – files containing labels of each graph
  • cv (int) – integer stating number of folds and therefore experiments to carry out
Returns:

tuple – tuple containing the mean accuracies of performing 10 fold cross validation 10 times. This gives a better picture of usual performance expected performance in a Monte Carlo fashion instead of presenting just best performance.

Return type:

(acc, std)

geometric2dr.embedding_methods.classify.cross_val_accuracy_rbf_bag_of_words(P, y_ids, cv=10)[source]

cv times Monte Carlo experimentation of 10 fold cross validation, used on given dataset matrix returns overall mean accuracy and associated standard deviation. Terminology and method name will be updated in future version to address overloading term and generalizability of function.

Parameters:
  • P (numpy ndarray) – a obs x num_features matrix showing dataset
  • y_ids (numpy ndarray) – numpy 1 x obs array of class labels for the rows of P
  • cv (int (default=10)) – overloaded term of monte carlo restarts of the SVM evaluation over 10 fold CV
Returns:

tuple – tuple containing the mean accuracies of performing 10 fold cross validation 10 times. This gives a better picture of usual performance expected performance in a Monte Carlo fashion instead of presenting just best performance.

Return type:

(acc, std)

geometric2dr.embedding_methods.classify.linear_svm_classify(X_train, X_test, Y_train, Y_test)[source]

Utility function for quickly performing Scikit Learn GridSearchCV over a linear SVM with 10 fold CrossVal given the train test splits

Parameters:
  • X_train (numpy ndarray) – training feature vectors
  • X_test (numpy ndarray) – testing feature vectors
  • Y_train (numpy ndarray) – training set labels
  • Y_test (numpy ndarray) – test set labels
Returns:

tuple with accuracy, precision, recall, fbeta_score as applicable

Return type:

tuple

geometric2dr.embedding_methods.classify.perform_classification(corpus_dir, extension, embedding_fname, class_labels_fname)[source]

Perform classification over the graph files of dataset given they have corresponding embeddings in the saved embedding file and class labels

Parameters:
  • corpus_dir (str) – folder containing graphdoc files
  • extension (str) – extension of the graphdoc files
  • embedding_fname (str) – file containing embeddings
  • class_labels_fname (str) – files containing labels of each graph
Returns:

tuple with accuracy, precision, recall, fbeta_score as applicable

Return type:

tuple

geometric2dr.embedding_methods.classify.rbf_svm_classify(X_train, X_test, Y_train, Y_test)[source]

Utility function for quickly performing Scikit Learn GridSearchCV over a rbf kernel SVM with 10 fold CrossVal given the train test splits

Parameters:
  • X_train (numpy ndarray) – training feature vectors
  • X_test (numpy ndarray) – testing feature vectors
  • Y_train (numpy ndarray) – training set labels
  • Y_test (numpy ndarray) – test set labels
Returns:

tuple with accuracy, precision, recall, fbeta_score as applicable

Return type:

tuple

geometric2dr.embedding_methods.pvdbow_data_reader

Data_reader module containing corpus construction utilities for PVDBOW (skipgram) models.

This module describes the classes which handle graph corpi and datasets which can be loaded into PyTorch dataloaders.

class geometric2dr.embedding_methods.pvdbow_data_reader.PVDBOWCorpus(corpus_dir=None, extension='.wld2', max_files=0, min_count=0)[source]

Bases: torch.utils.data.dataset.Dataset

Class which represents the target-context dataset created over the graph documents for PVDBOW models. In this version the __getitem__ function loads individual target-context pairs from the hard-drive. As a result, it is quick to set up and memory efficient but may perform slower in training time.

Parameters:
  • corpus_dir (str) – path to folder with graph document files created in decomposition stage
  • extension (str) – extension of the graph document files from which the corpus should be built
  • max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
  • window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns:

self – A corpus dataset that can be used with the skipgram with negative sampling model to learn graph-level embeddings.

Return type:

PVDBOWCorpus

add_file(full_graph_path)[source]

This method is used to add new graphs into the corpus for inductive learning of new unseen graphs

Parameters:full_graph_path (str) – path to graph document to be part of the new corpus
Returns:New graph and its substructure patterns is made part of the corpus
Return type:None
getNegatives(target, size)[source]

Given target find a size number of negative samples by index

Parameters:
  • target (int) – internal int id of the subgraph pattern
  • size (int) – number of negative samples to find
Returns:

response – list of negative samples by internal int id

Return type:

[int]

scan_and_load_corpus()[source]

Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids for batch

scan_corpus(min_count)[source]

Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures

Parameters:min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary.
Returns:(Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map
Return type:dict
class geometric2dr.embedding_methods.pvdbow_data_reader.PVDBOWInMemoryCorpus(corpus_dir=None, extension='.wld2', max_files=0, min_count=0)[source]

Bases: torch.utils.data.dataset.Dataset

Class which represents the target-context dataset created over the graph documents for PVDBOW models. This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but has a much quicker __getitem__ computation.

Parameters:
  • corpus_dir (str) – path to folder with graph document files created in decomposition stage
  • extension (str) – extension of the graph document files from which the corpus should be built
  • max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
  • window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns:

self – A corpus dataset that can be used with the skipgram with negative sampling model to learn graph-level embeddings.

Return type:

PVDBOWInMemoryCorpus

add_file(full_graph_path)[source]

This method is used to add new graphs into the corpus for inductive learning of new unseen graphs

Parameters:full_graph_path (str) – path to graph document to be part of the new corpus
Returns:New graph and its substructure patterns is made part of the corpus
Return type:None
getNegatives(target, size)[source]

Given target find a size number of negative samples by index

Parameters:
  • target (int) – internal int id of the subgraph pattern
  • size (int) – number of negative samples to find
Returns:

response – list of negative samples by internal int id

Return type:

[int]

pre_load_corpus()[source]

Constructs and loads an entire context-pair dataset into memory

scan_and_load_corpus()[source]

Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids

scan_corpus(min_count)[source]

Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures

Parameters:min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary.
Returns:(Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map
Return type:dict

geometric2dr.embedding_methods.pvdbow_trainer

Module containining class definitions of trainers for pvdbow models [6], which are partly used by Deep Graph Kernels [2]

Author: Paul Scherer

class geometric2dr.embedding_methods.pvdbow_trainer.InMemoryTrainer(corpus_dir, extension, max_files, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]

Handles corpus construction (in-memory version), PVDBOW initialization and training.

corpus_dir : str
path to directory containing graph files
extension : str
extension used in graph documents produced after decomposition stage
max_files : int
the maximum number of graph files to consider, default of 0 uses all files
output_fh : str
the path to the file where embeddings should be saved
emb_dimension : int (default=128)
the desired dimension of the embeddings
batch_size : int (default=32)
the desired batch size
epochs : int (default=100)
the desired number of epochs for which the network should be trained
initial_lr : float (default=1e-3)
the initial learning rate
min_count : int (default=1)
the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns:self – A trainer instance which has the dataset stored in memory for fast access
Return type:InMemoryTrainer
train()[source]

Train the network with the settings used to initialise the Trainer

class geometric2dr.embedding_methods.pvdbow_trainer.Trainer(corpus_dir, extension, max_files, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]

Handles corpus construction (hard drive version), PVDBOW (skipgram) initialization and training.

corpus_dir : str
path to directory containing graph files
extension : str
extension used in graph documents produced after decomposition stage
max_files : int
the maximum number of graph files to consider, default of 0 uses all files
output_fh : str
the path to the file where embeddings should be saved
emb_dimension : int (default=128)
the desired dimension of the embeddings
batch_size : int (default=32)
the desired batch size
epochs : int (default=100)
the desired number of epochs for which the network should be trained
initial_lr : float (default=1e-3)
the initial learning rate
min_count : int (default=1)
the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns:self – A Trainer instance
Return type:Trainer
train()[source]

Train the network with the settings used to initialise the Trainer

geometric2dr.embedding_methods.pvdm

PVDM model originally introduced in doc2vec paper by Le and Mikolov (2014) [6] Used by AWE-DD model of Anonymous Walk Embeddings by Ivanov and Burnaev (2018) [1]

It is used with the corpus classes in cbow_data_reader which handles the data reading and loading. This allows construction of full PVDM based systems. It is one of the choices of neural language model for recreating AWE [2] like systems.

class geometric2dr.embedding_methods.pvdm.PVDM(num_targets, vocab_size, embedding_dimension)[source]

Bases: torch.nn.modules.module.Module

PyTorch implmentation of the PVDM as in Le and Mikolov. [6]

Parameters:
  • num_targets (int) – The number of targets to embed. Typically the number of substructure patterns, but can be repurposed to be number of graphs.
  • vocab_size (int) – The size of the vocabulary; the number of unique substructure patterns
  • embedding_dimension (int) – The desired dimensionality of the embeddings.
Returns:

self – a torch.nn.Module of the PVDM model

Return type:

PVDM

forward(pos_graph_emb, pos_context_target, pos_contexts, pos_negatives)[source]

Forward pass in network

Parameters:
  • pos_graph_emb (torch.Long) – index of target graph embedding
  • pos_context_target (torch.Long) – index of target subgraph pattern embedding
  • pos_contexts (torch.Long) – indices of context subgraph patterns around the target subgraph embedding
  • pos_negatives (torch.Long) – indices of negatives
Returns:

the negative sampling loss

Return type:

torch.float

give_target_embeddings()[source]

Return the target embeddings as a numpy matrix

Returns:Numpy num_target x emb_dimension matrix of target graph embeddings
Return type:numpy ndarray

geometric2dr.embedding_methods.pvdm_data_reader

Data_reader module containing corpus construction utilities for PVDM models

class geometric2dr.embedding_methods.pvdm_data_reader.PVDMCorpus(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]

Bases: torch.utils.data.dataset.Dataset

Class which representes all of the graph documents in a graph dataset serves context for PVDM models, This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but is much quicker at loading during training.

Parameters:
  • corpus_dir (str) – path to folder with graph document files created in decomposition stage
  • extension (str) – extension of the graph document files from which the corpus should be built
  • max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
  • window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns:

self – A corpus dataset that can be used with the PVDM with negative sampling model.

Return type:

PVDMCorpus

add_file(full_graph_path)[source]

This method is used to add new graphs into the corpus for inductive learning of new unseen graphs

Parameters:full_graph_path (str) – path to graph document to be part of the new corpus
Returns:New graph and its substructure patterns is made part of the corpus
Return type:None
getNegatives(target, size)[source]

Given target find a size number of negative samples by index

Parameters:
  • target (int) – internal int id of the subgraph pattern
  • size (int) – number of negative samples to find
Returns:

response – list of negative samples by internal int id

Return type:

[int]

pre_load_corpus()[source]

Constructs and loads an entire context-pair dataset into memory

scan_and_load_corpus()[source]

Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids

scan_corpus(min_count)[source]

Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures

Parameters:min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary.
Returns:(Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map
Return type:dict

geometric2dr.embedding_methods.pvdm_trainer

A trainer class which faciliates training of the embedding methods by the set hyperparameters.

Author: Paul Scherer 2020

class geometric2dr.embedding_methods.pvdm_trainer.PVDM_Trainer(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]

Handles corpus construction, CBOW initialization and training.

corpus_dir : str
path to directory containing graph files
extension : str
extension used in graph documents produced after decomposition stage
max_files : int
the maximum number of graph files to consider, default of 0 uses all files
window_size : int
the number of cooccuring context subgraph patterns to use
output_fh : str
the path to the file where embeddings should be saved
emb_dimension : int (default=128)
the desired dimension of the embeddings
batch_size : int (default=32)
the desired batch size
epochs : int (default=100)
the desired number of epochs for which the network should be trained
initial_lr : float (default=1e-3)
the initial learning rate
min_count : int (default=1)
the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns:self – A PVDM_Trainer instance
Return type:PVDM_Trainer
train()[source]

Train the network with the settings used to initialise the PVDM_Trainer

geometric2dr.embedding_methods.skipgram

General Skipgram model with negative sampling originally introduced by word2vec paper Mikolov et al [5]. Used by DGK [2] and Graph2Vec [4] to learn substructure and graph level embeddings

It is used by the SkipgamCorpus and PVDBOWCorpus to build complete Skipgram and PVDBOW systems respectively. SkipgramCorpus and PVDBOWCorpus are found in skipgram_data_reader and pvdbow_data_reader modules respectively

Author: Paul Scherer

class geometric2dr.embedding_methods.skipgram.Skipgram(num_targets, vocab_size, embedding_dimension)[source]

Bases: torch.nn.modules.module.Module

Pytorch implementation of the skipgram with negative sampling as in Mikolov et al. [5]

Based on the inputs it can be used as the skipgram described in the original Word2Vec paper [5] , or as Doc2Vec (PV-DBOW) in Le and Mikolov [6]

Parameters:
  • num_targets (int) – The number of targets to embed. Typically the number of substructure patterns, but can be repurposed to be number of graphs.
  • vocab_size (int) – The size of the vocabulary; the number of unique substructure patterns
  • embedding_dimension (int) – The desired dimensionality of the embeddings.
Returns:

self – a torch.nn.Module of the Skipgram model

Return type:

Skipgram

forward(pos_target, pos_context, neg_context)[source]

Forward pass in network

Parameters:
  • pos_target (torch.Long) – index of target embedding
  • pos_context (torch.Long) – index of context embedding
  • neg_context (torch.Long) – index of negative
Returns:

the negative sampling loss

Return type:

torch.float

geometric2dr.embedding_methods.skipgram_data_reader

Data_reader module containing corpus construction utilities for Skipgram based (ie PVDBOW as well) models.

class geometric2dr.embedding_methods.skipgram_data_reader.InMemorySkipgramCorpus(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]

Bases: torch.utils.data.dataset.Dataset

Corpus which feeds positions of subgraphs, contextualised by “cooccuring” patterns as defined by the different decomposition algorithms. Designed to support negative sampling. This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but has a much quicker __getitem__ computation.

Parameters:
  • corpus_dir (str) – path to folder with graph document files created in decomposition stage
  • extension (str) – extension of the graph document files from which the corpus should be built
  • max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
  • window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns:

self – A corpus dataset that can be used with the skipgram with negative sampling model to learn substructure pattern embeddings.

Return type:

InMemorySkipgramCorpus

add_file(full_graph_path)[source]

This method is used to add new graphs into the corpus for inductive learning of new unseen graphs

Parameters:full_graph_path (str) – path to graph document to be part of the new corpus
Returns:New graph and its substructure patterns is made part of the corpus
Return type:None
getNegatives(target, size)[source]

Given target find a size number of negative samples by index

Parameters:
  • target (int) – internal int id of the subgraph pattern
  • size (int) – number of negative samples to find
Returns:

response – list of negative samples by internal int id

Return type:

[int]

preload_corpus()[source]

Constructs and loads an entire context-pair dataset into memory

scan_and_load_corpus()[source]

Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids

scan_corpus(min_count)[source]

Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures

Parameters:min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary.
Returns:(Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map
Return type:dict
class geometric2dr.embedding_methods.skipgram_data_reader.SkipgramCorpus(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]

Bases: torch.utils.data.dataset.Dataset

Corpus which feeds positions of subgraphs, contextualised by “cooccuring” patterns as defined by the different decomposition algorithms. Designed to support negative sampling. In this version the __getitem__ function loads individual target-context pairs from the hard-drive. As a result, it is quick to set up and memory efficient but may perform slower in training time.

Parameters:
  • corpus_dir (str) – path to folder with graph document files created in decomposition stage
  • extension (str) – extension of the graph document files from which the corpus should be built
  • max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
  • window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns:

self – A corpus dataset that can be used with the skipgram with negative sampling model to learn substructure pattern embeddings.

Return type:

SkipgramCorpus

add_file(full_graph_path)[source]

This method is used to add new graphs into the corpus for inductive learning of new unseen graphs

Parameters:full_graph_path (str) – path to graph document to be part of the new corpus
Returns:New graph and its substructure patterns is made part of the corpus
Return type:None
getNegatives(target, size)[source]

Given target find a size number of negative samples by index

Parameters:
  • target (int) – internal int id of the subgraph pattern
  • size (int) – number of negative samples to find
Returns:

response – list of negative samples by internal int id

Return type:

[int]

scan_and_load_corpus()[source]

Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids for batch

scan_corpus(min_count)[source]

Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures

Parameters:min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary.
Returns:(Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map
Return type:dict

geometric2dr.embedding_methods.skipgram_trainer

Module containining class definitions of trainers for skipgram models, which are partly used by Deep Graph Kernels

Author: Paul Scherer

class geometric2dr.embedding_methods.skipgram_trainer.InMemoryTrainer(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]

Handles corpus construction (in-memory version), PVDBOW initialization and training.

corpus_dir : str
path to directory containing graph files
extension : str
extension used in graph documents produced after decomposition stage
max_files : int
the maximum number of graph files to consider, default of 0 uses all files
output_fh : str
the path to the file where embeddings should be saved
emb_dimension : int (default=128)
the desired dimension of the embeddings
batch_size : int (default=32)
the desired batch size
epochs : int (default=100)
the desired number of epochs for which the network should be trained
initial_lr : float (default=1e-3)
the initial learning rate
min_count : int (default=1)
the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns:self – A trainer instance which has the dataset stored in memory for fast access
Return type:InMemoryTrainer
train()[source]

Train the network with the settings used to initialise the Trainer

class geometric2dr.embedding_methods.skipgram_trainer.Trainer(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]

Handles corpus construction (hard drive version), skipgram initialization and training.

corpus_dir : str
path to directory containing graph files
extension : str
extension used in graph documents produced after decomposition stage
max_files : int
the maximum number of graph files to consider, default of 0 uses all files
output_fh : str
the path to the file where embeddings should be saved
emb_dimension : int (default=128)
the desired dimension of the embeddings
batch_size : int (default=32)
the desired batch size
epochs : int (default=100)
the desired number of epochs for which the network should be trained
initial_lr : float (default=1e-3)
the initial learning rate
min_count : int (default=1)
the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns:self – A Trainer instance
Return type:Trainer
train()[source]

Train the network with the settings used to initialise the Trainer

geometric2dr.embedding_methods.utils

General purpose utilities for I/O

Currently Includes: Functions for getting all files in a directory with a given extension Saving graph embeddings into a JSON format Generating a dictionary matching graph files with classification labels

geometric2dr.embedding_methods.utils.get_class_labels(graph_files, class_labels_fname)[source]

Given the list of graph files (as in get_files) and path of the associated class labels returns the list of labels associated with each graph file in graph_files

Parameters:
  • graph_files (list) – list of paths to graph_files
  • class_labels_fname (str) – path to class labels file (.Labels typically) with file names in graph_files
Returns:

labels – list of class labels for corresponding to graph files in graph_files

Return type:

list

geometric2dr.embedding_methods.utils.get_class_labels_tuples(graph_files, class_labels_fname)[source]

Returns list of tuples associating each of the graph files to their classification labels

Parameters:
  • graph_files (list) – list of paths to graph_files
  • class_labels_fname (str) – path to class labels file (.Labels typically) with file names in graph_files
Returns:

labels – list of tuples (base_name_of_graph_file, class_label)

Return type:

list

geometric2dr.embedding_methods.utils.get_files(dname, extension, max_files=0)[source]

Returns a list of strings which are all the files with the given extension in a sorted manner

Parameters:
  • dname (str) – directory with files
  • extension (str) – string denoting which extension should be matched in search for files
  • max_files (int (default=0)) – the maximum number of files to get, the default of 0 means all files
Returns:

all_files – list of all files matching extension inside the directory dname

Return type:

list

geometric2dr.embedding_methods.utils.get_kernel_matrix_row_idx_with_class(corpus, extension, graph_files, class_labels_fname)[source]

Returns two arrays, the first is an list of integers each referencing a row in a kernel matrix and thereby a kernel vector corresponding to one of the graphs in the dataset, the second is a list of class labels whose value is the classification of the graph in the same index of the first

Parameters:
  • corpus (corpus) – a corpus instance (such as SkipgramCorpus)
  • extension (str) – extension of graph document under study
  • graph_files (list) – list of paths to graph file
  • class_labels_fname (str) – path to graph class label file
Returns:

kernel_row_x_id, kernel_row_y_id. The first is an list of integers each referencing a row in a kernel matrix and thereby a kernel vector corresponding to one of the graphs in the dataset, the second is a list of class labels whose value is the classification of the graph in the same index of the first

Return type:

tuple

geometric2dr.embedding_methods.utils.save_graph_embeddings(corpus, final_embeddings, opfname)[source]

Saves the trained embeddings of a corpus into a dictionary and saves this into a json file on the path given by opfname

Parameters:
  • corpus (corpus) – any corpus class such as PVDBOWCorpus
  • final_embeddings (numpy ndarray) – matrix of target embeddings to be saved
  • opfname (str) – path to file where embeddings should be saved in json format (extension optional in Unix)
Returns:

embeddings will be saved into path denoted by opfname

Return type:

None

geometric2dr.embedding_methods.utils.save_subgraph_embeddings(corpus, final_embeddings, opfname)[source]

Save the embeddings along with a map to the patterns and the corpus

Parameters:
  • corpus (corpus) – a corpus class such as SkipgramCorpus
  • final_embeddings (numpy ndarray) – matrix of target embeddings to be saved
  • opfname (str) – path to file where embeddings should be saved in json format
Returns:

embeddings will be saved into path denoted by opfname

Return type:

None

References

[1]Sergey Ivanov, Evgeny Burnaev. “Anonymous Walk Embeddings”. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:2186-2195, 2018.
[2](1, 2, 3, 4, 5)
  1. Yanardag and S. Vishwanathan, “Deep Graph Kernels”, KDD ‘15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015
[3]Shervashidze, Nino & Schweitzer, Pascal & Jan, Erik & Leeuwen, Van & Mehlhorn, Kurt & Borgwardt, Karsten. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research. 1. 1-48., 2010
[4]Narayanan, Annamalai & Mahinthan, Chandramohan & Venkatesan, Rajasekar & Chen, Lihui & Liu, Yang & Jaiswal, Shantanu. “graph2vec: Learning Distributed Representations of Graphs”, 2017
[5](1, 2, 3, 4, 5, 6) Mikolov, Tomas & Corrado, G.s & Chen, Kai & Dean, Jeffrey. Efficient Estimation of Word Representations in Vector Space. 1-12., 2013
[6](1, 2, 3, 4) Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32., 2014