geometric2dr.embedding_methods¶
geometric2dr.embedding_methods.cbow¶
CBOW model with negative sampling as in Mikolov et al. [5].
It is used with the corpus classes in cbow_data_reader which handles the data reading and loading. This allows construction of full CBOW based systems. It is one of the choices of neural language model for recreating DGK [2] like systems.
-
class
geometric2dr.embedding_methods.cbow.
Cbow
(num_targets, vocab_size, embedding_dimension)[source]¶ Bases:
torch.nn.modules.module.Module
Pytorch implementation of the CBOW architecture with negative sampling as in Mikolov et al. [5]
This is used in DGK models for example to learn embeddings of substructures for downstream graph kernel definitions.
Parameters: - num_targets (int) – The number of targets to embed. Typically the number of substructure patterns, but can be repurposed to be number of graphs.
- vocab_size (int) – The size of the vocabulary; the number of unique substructure patterns
- embedding_dimension (int) – The desired dimensionality of the embeddings.
Returns: self – a torch.nn.Module of the CBOW model
Return type: CBow
-
forward
(pos_target, pos_contexts, pos_negatives)[source]¶ Forward pass in network
Parameters: - pos_target (torch.Long) – index of target embedding
- pos_contexts (torch.Long) – indices of context embeddings
- pos_negatives (torch.Long) – indices of negatives
Returns: the negative sampling loss
Return type: torch.float
geometric2dr.embedding_methods.cbow_data_reader¶
Data_reader module containing corpus construction utilities for CBOW models
-
class
geometric2dr.embedding_methods.cbow_data_reader.
CbowCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Class which representes all of the graph documents in a graph dataset serves context for CBOW models, This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but is much quicker at loading during training.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the CBOW with negative sampling model.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
geometric2dr.embedding_methods.cbow_trainer¶
Module containining class definitions of trainers for cbow models [5], which are partly used by Deep Graph Kernels [2]
-
class
geometric2dr.embedding_methods.cbow_trainer.
Trainer
(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction, CBOW initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- window_size : int
- the number of cooccuring context subgraph patterns to use
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A Trainer instance Return type: Trainer
geometric2dr.embedding_methods.classify¶
Module containing various functions for classification (on top of the learned embeddings) mainly useful for providing convenience functions on common benchmark classification methods
-
geometric2dr.embedding_methods.classify.
cross_val_accuracy
(corpus_dir, extension, embedding_fname, class_labels_fname, cv=10, mode=None)[source]¶ Performs 10 (default) fold cross validation, returns the mean accuracy and associated standard deviation
Parameters: - corpus_dir (str) – folder containing graphdoc files
- extension (str) – extension of the graphdoc files
- embedding_fname (str) – file containing embeddings
- class_labels_fname (str) – files containing labels of each graph
- cv (int) – integer stating number of folds and therefore experiments to carry out
Returns: tuple – tuple containing the mean accuracies of performing 10 fold cross validation 10 times. This gives a better picture of usual performance expected performance in a Monte Carlo fashion instead of presenting just best performance.
Return type: (acc, std)
-
geometric2dr.embedding_methods.classify.
cross_val_accuracy_rbf_bag_of_words
(P, y_ids, cv=10)[source]¶ cv times Monte Carlo experimentation of 10 fold cross validation, used on given dataset matrix returns overall mean accuracy and associated standard deviation. Terminology and method name will be updated in future version to address overloading term and generalizability of function.
Parameters: - P (numpy ndarray) – a obs x num_features matrix showing dataset
- y_ids (numpy ndarray) – numpy 1 x obs array of class labels for the rows of P
- cv (int (default=10)) – overloaded term of monte carlo restarts of the SVM evaluation over 10 fold CV
Returns: tuple – tuple containing the mean accuracies of performing 10 fold cross validation 10 times. This gives a better picture of usual performance expected performance in a Monte Carlo fashion instead of presenting just best performance.
Return type: (acc, std)
-
geometric2dr.embedding_methods.classify.
linear_svm_classify
(X_train, X_test, Y_train, Y_test)[source]¶ Utility function for quickly performing Scikit Learn GridSearchCV over a linear SVM with 10 fold CrossVal given the train test splits
Parameters: - X_train (numpy ndarray) – training feature vectors
- X_test (numpy ndarray) – testing feature vectors
- Y_train (numpy ndarray) – training set labels
- Y_test (numpy ndarray) – test set labels
Returns: tuple with accuracy, precision, recall, fbeta_score as applicable
Return type: tuple
-
geometric2dr.embedding_methods.classify.
perform_classification
(corpus_dir, extension, embedding_fname, class_labels_fname)[source]¶ Perform classification over the graph files of dataset given they have corresponding embeddings in the saved embedding file and class labels
Parameters: - corpus_dir (str) – folder containing graphdoc files
- extension (str) – extension of the graphdoc files
- embedding_fname (str) – file containing embeddings
- class_labels_fname (str) – files containing labels of each graph
Returns: tuple with accuracy, precision, recall, fbeta_score as applicable
Return type: tuple
-
geometric2dr.embedding_methods.classify.
rbf_svm_classify
(X_train, X_test, Y_train, Y_test)[source]¶ Utility function for quickly performing Scikit Learn GridSearchCV over a rbf kernel SVM with 10 fold CrossVal given the train test splits
Parameters: - X_train (numpy ndarray) – training feature vectors
- X_test (numpy ndarray) – testing feature vectors
- Y_train (numpy ndarray) – training set labels
- Y_test (numpy ndarray) – test set labels
Returns: tuple with accuracy, precision, recall, fbeta_score as applicable
Return type: tuple
geometric2dr.embedding_methods.pvdbow_data_reader¶
Data_reader module containing corpus construction utilities for PVDBOW (skipgram) models.
This module describes the classes which handle graph corpi and datasets which can be loaded into PyTorch dataloaders.
-
class
geometric2dr.embedding_methods.pvdbow_data_reader.
PVDBOWCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Class which represents the target-context dataset created over the graph documents for PVDBOW models. In this version the __getitem__ function loads individual target-context pairs from the hard-drive. As a result, it is quick to set up and memory efficient but may perform slower in training time.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the skipgram with negative sampling model to learn graph-level embeddings.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids for batch
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
-
class
geometric2dr.embedding_methods.pvdbow_data_reader.
PVDBOWInMemoryCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Class which represents the target-context dataset created over the graph documents for PVDBOW models. This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but has a much quicker __getitem__ computation.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the skipgram with negative sampling model to learn graph-level embeddings.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
geometric2dr.embedding_methods.pvdbow_trainer¶
Module containining class definitions of trainers for pvdbow models [6], which are partly used by Deep Graph Kernels [2]
Author: Paul Scherer
-
class
geometric2dr.embedding_methods.pvdbow_trainer.
InMemoryTrainer
(corpus_dir, extension, max_files, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction (in-memory version), PVDBOW initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A trainer instance which has the dataset stored in memory for fast access Return type: InMemoryTrainer
-
class
geometric2dr.embedding_methods.pvdbow_trainer.
Trainer
(corpus_dir, extension, max_files, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction (hard drive version), PVDBOW (skipgram) initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A Trainer instance Return type: Trainer
geometric2dr.embedding_methods.pvdm¶
PVDM model originally introduced in doc2vec paper by Le and Mikolov (2014) [6] Used by AWE-DD model of Anonymous Walk Embeddings by Ivanov and Burnaev (2018) [1]
It is used with the corpus classes in cbow_data_reader which handles the data reading and loading. This allows construction of full PVDM based systems. It is one of the choices of neural language model for recreating AWE [2] like systems.
-
class
geometric2dr.embedding_methods.pvdm.
PVDM
(num_targets, vocab_size, embedding_dimension)[source]¶ Bases:
torch.nn.modules.module.Module
PyTorch implmentation of the PVDM as in Le and Mikolov. [6]
Parameters: - num_targets (int) – The number of targets to embed. Typically the number of substructure patterns, but can be repurposed to be number of graphs.
- vocab_size (int) – The size of the vocabulary; the number of unique substructure patterns
- embedding_dimension (int) – The desired dimensionality of the embeddings.
Returns: self – a torch.nn.Module of the PVDM model
Return type: -
forward
(pos_graph_emb, pos_context_target, pos_contexts, pos_negatives)[source]¶ Forward pass in network
Parameters: - pos_graph_emb (torch.Long) – index of target graph embedding
- pos_context_target (torch.Long) – index of target subgraph pattern embedding
- pos_contexts (torch.Long) – indices of context subgraph patterns around the target subgraph embedding
- pos_negatives (torch.Long) – indices of negatives
Returns: the negative sampling loss
Return type: torch.float
geometric2dr.embedding_methods.pvdm_data_reader¶
Data_reader module containing corpus construction utilities for PVDM models
-
class
geometric2dr.embedding_methods.pvdm_data_reader.
PVDMCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Class which representes all of the graph documents in a graph dataset serves context for PVDM models, This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but is much quicker at loading during training.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the PVDM with negative sampling model.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
geometric2dr.embedding_methods.pvdm_trainer¶
A trainer class which faciliates training of the embedding methods by the set hyperparameters.
Author: Paul Scherer 2020
-
class
geometric2dr.embedding_methods.pvdm_trainer.
PVDM_Trainer
(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction, CBOW initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- window_size : int
- the number of cooccuring context subgraph patterns to use
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A PVDM_Trainer instance Return type: PVDM_Trainer
geometric2dr.embedding_methods.skipgram¶
General Skipgram model with negative sampling originally introduced by word2vec paper Mikolov et al [5]. Used by DGK [2] and Graph2Vec [4] to learn substructure and graph level embeddings
It is used by the SkipgamCorpus and PVDBOWCorpus to build complete Skipgram and PVDBOW systems respectively. SkipgramCorpus and PVDBOWCorpus are found in skipgram_data_reader and pvdbow_data_reader modules respectively
Author: Paul Scherer
-
class
geometric2dr.embedding_methods.skipgram.
Skipgram
(num_targets, vocab_size, embedding_dimension)[source]¶ Bases:
torch.nn.modules.module.Module
Pytorch implementation of the skipgram with negative sampling as in Mikolov et al. [5]
Based on the inputs it can be used as the skipgram described in the original Word2Vec paper [5] , or as Doc2Vec (PV-DBOW) in Le and Mikolov [6]
Parameters: - num_targets (int) – The number of targets to embed. Typically the number of substructure patterns, but can be repurposed to be number of graphs.
- vocab_size (int) – The size of the vocabulary; the number of unique substructure patterns
- embedding_dimension (int) – The desired dimensionality of the embeddings.
Returns: self – a torch.nn.Module of the Skipgram model
Return type: -
forward
(pos_target, pos_context, neg_context)[source]¶ Forward pass in network
Parameters: - pos_target (torch.Long) – index of target embedding
- pos_context (torch.Long) – index of context embedding
- neg_context (torch.Long) – index of negative
Returns: the negative sampling loss
Return type: torch.float
geometric2dr.embedding_methods.skipgram_data_reader¶
Data_reader module containing corpus construction utilities for Skipgram based (ie PVDBOW as well) models.
-
class
geometric2dr.embedding_methods.skipgram_data_reader.
InMemorySkipgramCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Corpus which feeds positions of subgraphs, contextualised by “cooccuring” patterns as defined by the different decomposition algorithms. Designed to support negative sampling. This version keeps the entire corpus with negatives in memory which requires a larger initial creation time but has a much quicker __getitem__ computation.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the skipgram with negative sampling model to learn substructure pattern embeddings.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
-
class
geometric2dr.embedding_methods.skipgram_data_reader.
SkipgramCorpus
(corpus_dir=None, extension='.wld2', max_files=0, min_count=0, window_size=1)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Corpus which feeds positions of subgraphs, contextualised by “cooccuring” patterns as defined by the different decomposition algorithms. Designed to support negative sampling. In this version the __getitem__ function loads individual target-context pairs from the hard-drive. As a result, it is quick to set up and memory efficient but may perform slower in training time.
Parameters: - corpus_dir (str) – path to folder with graph document files created in decomposition stage
- extension (str) – extension of the graph document files from which the corpus should be built
- max_files (int (default=0)) – the maximum number of files to include. Useful for debugging or other artificial scenarios. The default of 0 includes all files with matching extension
- window_size (int (default=1)) – The number of context substructure patterns to be considered for every target. This needs to be greater than 0.
Returns: self – A corpus dataset that can be used with the skipgram with negative sampling model to learn substructure pattern embeddings.
Return type: -
add_file
(full_graph_path)[source]¶ This method is used to add new graphs into the corpus for inductive learning of new unseen graphs
Parameters: full_graph_path (str) – path to graph document to be part of the new corpus Returns: New graph and its substructure patterns is made part of the corpus Return type: None
-
getNegatives
(target, size)[source]¶ Given target find a size number of negative samples by index
Parameters: - target (int) – internal int id of the subgraph pattern
- size (int) – number of negative samples to find
Returns: response – list of negative samples by internal int id
Return type: [int]
-
scan_and_load_corpus
()[source]¶ Gets the list of graph file paths, gives them number ids in a map and calls scan_corpus also makes available a list of shuffled graph_ids for batch
-
scan_corpus
(min_count)[source]¶ Maps the graph files to a subgraph alphabet from which we create new_ids for the subgraphs which in turn get used by the skipgram architectures
Parameters: min_count (int) – The minimum number of times a subgraph pattern should appear across the graphs in order to be considered part of the vocabulary. Returns: (Optional) self._subgraph_to_id_map – dictionary of substructure pattern to int id map Return type: dict
geometric2dr.embedding_methods.skipgram_trainer¶
Module containining class definitions of trainers for skipgram models, which are partly used by Deep Graph Kernels
Author: Paul Scherer
-
class
geometric2dr.embedding_methods.skipgram_trainer.
InMemoryTrainer
(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction (in-memory version), PVDBOW initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A trainer instance which has the dataset stored in memory for fast access Return type: InMemoryTrainer
-
class
geometric2dr.embedding_methods.skipgram_trainer.
Trainer
(corpus_dir, extension, max_files, window_size, output_fh, emb_dimension=128, batch_size=32, epochs=100, initial_lr=0.001, min_count=1)[source]¶ Handles corpus construction (hard drive version), skipgram initialization and training.
- corpus_dir : str
- path to directory containing graph files
- extension : str
- extension used in graph documents produced after decomposition stage
- max_files : int
- the maximum number of graph files to consider, default of 0 uses all files
- output_fh : str
- the path to the file where embeddings should be saved
- emb_dimension : int (default=128)
- the desired dimension of the embeddings
- batch_size : int (default=32)
- the desired batch size
- epochs : int (default=100)
- the desired number of epochs for which the network should be trained
- initial_lr : float (default=1e-3)
- the initial learning rate
- min_count : int (default=1)
- the minimum number of times a pattern should occur across the dataset to be considered part of the substructure pattern vocabulary
Returns: self – A Trainer instance Return type: Trainer
geometric2dr.embedding_methods.utils¶
General purpose utilities for I/O
Currently Includes: Functions for getting all files in a directory with a given extension Saving graph embeddings into a JSON format Generating a dictionary matching graph files with classification labels
-
geometric2dr.embedding_methods.utils.
get_class_labels
(graph_files, class_labels_fname)[source]¶ Given the list of graph files (as in get_files) and path of the associated class labels returns the list of labels associated with each graph file in graph_files
Parameters: - graph_files (list) – list of paths to graph_files
- class_labels_fname (str) – path to class labels file (.Labels typically) with file names in graph_files
Returns: labels – list of class labels for corresponding to graph files in graph_files
Return type: list
-
geometric2dr.embedding_methods.utils.
get_class_labels_tuples
(graph_files, class_labels_fname)[source]¶ Returns list of tuples associating each of the graph files to their classification labels
Parameters: - graph_files (list) – list of paths to graph_files
- class_labels_fname (str) – path to class labels file (.Labels typically) with file names in graph_files
Returns: labels – list of tuples (base_name_of_graph_file, class_label)
Return type: list
-
geometric2dr.embedding_methods.utils.
get_files
(dname, extension, max_files=0)[source]¶ Returns a list of strings which are all the files with the given extension in a sorted manner
Parameters: - dname (str) – directory with files
- extension (str) – string denoting which extension should be matched in search for files
- max_files (int (default=0)) – the maximum number of files to get, the default of 0 means all files
Returns: all_files – list of all files matching extension inside the directory dname
Return type: list
-
geometric2dr.embedding_methods.utils.
get_kernel_matrix_row_idx_with_class
(corpus, extension, graph_files, class_labels_fname)[source]¶ Returns two arrays, the first is an list of integers each referencing a row in a kernel matrix and thereby a kernel vector corresponding to one of the graphs in the dataset, the second is a list of class labels whose value is the classification of the graph in the same index of the first
Parameters: - corpus (corpus) – a corpus instance (such as SkipgramCorpus)
- extension (str) – extension of graph document under study
- graph_files (list) – list of paths to graph file
- class_labels_fname (str) – path to graph class label file
Returns: kernel_row_x_id, kernel_row_y_id. The first is an list of integers each referencing a row in a kernel matrix and thereby a kernel vector corresponding to one of the graphs in the dataset, the second is a list of class labels whose value is the classification of the graph in the same index of the first
Return type: tuple
-
geometric2dr.embedding_methods.utils.
save_graph_embeddings
(corpus, final_embeddings, opfname)[source]¶ Saves the trained embeddings of a corpus into a dictionary and saves this into a json file on the path given by opfname
Parameters: - corpus (corpus) – any corpus class such as PVDBOWCorpus
- final_embeddings (numpy ndarray) – matrix of target embeddings to be saved
- opfname (str) – path to file where embeddings should be saved in json format (extension optional in Unix)
Returns: embeddings will be saved into path denoted by opfname
Return type: None
-
geometric2dr.embedding_methods.utils.
save_subgraph_embeddings
(corpus, final_embeddings, opfname)[source]¶ Save the embeddings along with a map to the patterns and the corpus
Parameters: - corpus (corpus) – a corpus class such as SkipgramCorpus
- final_embeddings (numpy ndarray) – matrix of target embeddings to be saved
- opfname (str) – path to file where embeddings should be saved in json format
Returns: embeddings will be saved into path denoted by opfname
Return type: None
References¶
[1] | Sergey Ivanov, Evgeny Burnaev. “Anonymous Walk Embeddings”. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:2186-2195, 2018. |
[2] | (1, 2, 3, 4, 5)
|
[3] | Shervashidze, Nino & Schweitzer, Pascal & Jan, Erik & Leeuwen, Van & Mehlhorn, Kurt & Borgwardt, Karsten. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research. 1. 1-48., 2010 |
[4] | Narayanan, Annamalai & Mahinthan, Chandramohan & Venkatesan, Rajasekar & Chen, Lihui & Liu, Yang & Jaiswal, Shantanu. “graph2vec: Learning Distributed Representations of Graphs”, 2017 |
[5] | (1, 2, 3, 4, 5, 6) Mikolov, Tomas & Corrado, G.s & Chen, Kai & Dean, Jeffrey. Efficient Estimation of Word Representations in Vector Space. 1-12., 2013 |
[6] | (1, 2, 3, 4) Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32., 2014 |