dgl.data

The dgl.data package contains datasets hosted by DGL and also utilities for downloading, processing, saving and loading data from external resources.

Quick links:

Base Dataset Class

class dgl.data.DGLDataset(name, url=None, raw_dir=None, save_dir=None, hash_key=(), force_reload=False, verbose=False)[source]

The basic DGL dataset for creating graph datasets. This class defines a basic template class for DGL Dataset. The following steps will be executed automatically:

  1. Check whether there is a dataset cache on disk (already processed and stored on the disk) by invoking has_cache(). If true, goto 5.

  2. Call download() to download the data.

  3. Call process() to process the data.

  4. Call save() to save the processed dataset on disk and goto 6.

  5. Call load() to load the processed dataset from disk.

  6. Done.

Users can overwite these functions with their own data processing logic.

Parameters
  • name (str) – Name of the dataset

  • url (str) – Url to download the raw dataset

  • raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/

  • save_dir (str) – Directory to save the processed dataset. Default: same as raw_dir

  • hash_key (tuple) – A tuple of values as the input for the hash function. Users can distinguish instances (and their caches on the disk) from the same dataset class by comparing the hash values. Default: (), the corresponding hash value is 'f9065fa7'.

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information

url

The URL to download the dataset

Type

str

name

The dataset name

Type

str

raw_dir

Raw file directory contains the input data folder

Type

str

raw_path

Directory contains the input data files. Default : os.path.join(self.raw_dir, self.name)

Type

str

save_dir

Directory to save the processed dataset

Type

str

save_path

File path to save the processed dataset

Type

str

verbose

Whether to print information

Type

bool

hash

Hash value for the dataset and the setting.

Type

str

abstract __getitem__(idx)[source]

Gets the data object at index.

abstract __len__()[source]

The number of examples in the dataset.

download()[source]

Overwite to realize your own logic of downloading data.

It is recommended to download the to the self.raw_dir folder. Can be ignored if the dataset is already in self.raw_dir.

has_cache()[source]

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

load()[source]

Overwite to realize your own logic of loading the saved dataset from files.

It is recommended to use dgl.data.utils.load_graphs to load dgl graph from files and use dgl.data.utils.load_info to load extra information into python dict object.

process()[source]

Overwrite to realize your own logic of processing the input data.

save()[source]

Overwite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

Node Prediction Datasets

DGL hosted datasets for node classification/regression tasks.

Stanford sentiment treebank dataset

class dgl.data.SSTDataset(mode='train', glove_embed_file=None, vocab_file=None, raw_dir=None, force_reload=False, verbose=False)[source]

Stanford Sentiment Treebank dataset.

    Deprecated since version 0.5.0:
  • trees is deprecated, it is replaced by:

    >>> dataset = SSTDataset()
    >>> for tree in dataset:
    ....    # your code here
    
  • num_vocabs is deprecated, it is replaced by vocab_size.

Each sample is the constituency tree of a sentence. The leaf nodes represent words. The word is a int value stored in the x feature field. The non-leaf node has a special value PAD_WORD in the x field. Each node also has a sentiment annotation: 5 classes (very negative, negative, neutral, positive and very positive). The sentiment label is a int value stored in the y feature field. Official site: http://nlp.stanford.edu/sentiment/index.html

Statistics:

  • Train examples: 8,544

  • Dev examples: 1,101

  • Test examples: 2,210

  • Number of classes for each node: 5

Parameters
  • mode (str, optional) – Should be one of [‘train’, ‘dev’, ‘test’, ‘tiny’] Default: train

  • glove_embed_file (str, optional) – The path to pretrained glove embedding file. Default: None

  • vocab_file (str, optional) – Optional vocabulary file. If not given, the default vacabulary file is used. Default: None

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

vocab

Vocabulary of the dataset

Type

OrderedDict

trees

A list of DGLGraph objects

Type

list

num_classes

Number of classes for each node

Type

int

pretrained_emb

Pretrained glove embedding with respect the vocabulary.

Type

Tensor

vocab_size

The size of the vocabulary

Type

int

num_vocabs

The size of the vocabulary

Type

int

Notes

All the samples will be loaded and preprocessed in the memory first.

Examples

>>> # get dataset
>>> train_data = SSTDataset()
>>> dev_data = SSTDataset(mode='dev')
>>> test_data = SSTDataset(mode='test')
>>> tiny_data = SSTDataset(mode='tiny')
>>>
>>> len(train_data)
8544
>>> train_data.num_classes
5
>>> glove_embed = train_data.pretrained_emb
>>> train_data.vocab_size
19536
>>> train_data[0]
Graph(num_nodes=71, num_edges=70,
  ndata_schemes={'x': Scheme(shape=(), dtype=torch.int64), 'y': Scheme(shape=(), dtype=torch.int64), 'mask': Scheme(shape=(), dtype=torch.int64)}
  edata_schemes={})
>>> for tree in train_data:
...     input_ids = tree.ndata['x']
...     labels = tree.ndata['y']
...     mask = tree.ndata['mask']
...     # your code here
__getitem__(idx)[source]

Get graph by index

Parameters

idx (int) –

Returns

graph structure, word id for each node, node labels and masks.

  • ndata['x']: word id of the node

  • ndata['y']: label of the node

  • ndata['mask']: 1 if the node is a leaf, otherwise 0

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset.

Karate club dataset

class dgl.data.KarateClubDataset[source]

Karate Club dataset for Node Classification

    Deprecated since version 0.5.0:
  • data is deprecated, it is replaced by:

    >>> dataset = KarateClubDataset()
    >>> g = dataset[0]
    

Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002. Official website: http://konect.cc/networks/ucidata-zachary/

Karate Club dataset statistics:

  • Nodes: 34

  • Edges: 156

  • Number of Classes: 2

num_classes

Number of node classes

Type

int

data

A list of dgl.DGLGraph objects

Type

list

Examples

>>> dataset = KarateClubDataset()
>>> num_classes = dataset.num_classes
>>> g = dataset[0]
>>> labels = g.ndata['label']
__getitem__(idx)[source]

Get graph object

Parameters

idx (int) – Item index, KarateClubDataset has only one graph object

Returns

graph structure and labels.

  • ndata['label']: ground truth labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

Citation network dataset

class dgl.data.CoraGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True)[source]

Cora citation network dataset.

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    
  • train_mask is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.ndata['train_mask']
    
  • val_mask is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.ndata['val_mask']
    
  • test_mask is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.ndata['test_mask']
    
  • labels is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    >>> labels = graph.ndata['label']
    
  • feat is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    >>> feat = graph.ndata['feat']
    

Nodes mean paper and edges mean citation relationships. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper.

Statistics:

  • Nodes: 2708

  • Edges: 10556

  • Number of Classes: 7

  • Label split:

    • Train: 140

    • Valid: 500

    • Test: 1000

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

num_classes

Number of label classes

Type

int

graph

Graph structure

Type

networkx.DiGraph

train_mask

Mask of training nodes

Type

numpy.ndarray

val_mask

Mask of validation nodes

Type

numpy.ndarray

test_mask

Mask of test nodes

Type

numpy.ndarray

labels

Ground truth labels of each node

Type

numpy.ndarray

features

Node features

Type

Tensor

Notes

The node feature is row-normalized.

Examples

>>> dataset = CoraGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, CoraGraphDataset has only one graph object

Returns

graph structure, node features and labels.

  • ndata['train_mask']: mask for training node set

  • ndata['val_mask']: mask for validation node set

  • ndata['test_mask']: mask for test node set

  • ndata['feat']: node feature

  • ndata['label']: ground truth labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

class dgl.data.CiteseerGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True)[source]

Citeseer citation network dataset.

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    
  • train_mask is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.ndata['train_mask']
    
  • val_mask is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.ndata['val_mask']
    
  • test_mask is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.ndata['test_mask']
    
  • labels is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    >>> labels = graph.ndata['label']
    
  • feat is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    >>> feat = graph.ndata['feat']
    

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 3703 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.

Statistics:

  • Nodes: 3327

  • Edges: 9228

  • Number of Classes: 6

  • Label Split:

    • Train: 120

    • Valid: 500

    • Test: 1000

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

num_classes

Number of label classes

Type

int

graph

Graph structure

Type

networkx.DiGraph

train_mask

Mask of training nodes

Type

numpy.ndarray

val_mask

Mask of validation nodes

Type

numpy.ndarray

test_mask

Mask of test nodes

Type

numpy.ndarray

labels

Ground truth labels of each node

Type

numpy.ndarray

features

Node features

Type

Tensor

Notes

The node feature is row-normalized.

In citeseer dataset, there are some isolated nodes in the graph. These isolated nodes are added as zero-vecs into the right position.

Examples

>>> dataset = CiteseerGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, CiteseerGraphDataset has only one graph object

Returns

graph structure, node features and labels.

  • ndata['train_mask']: mask for training node set

  • ndata['val_mask']: mask for validation node set

  • ndata['test_mask']: mask for test node set

  • ndata['feat']: node feature

  • ndata['label']: ground truth labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

class dgl.data.PubmedGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True)[source]

Pubmed citation network dataset.

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    
  • train_mask is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.ndata['train_mask']
    
  • val_mask is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.ndata['val_mask']
    
  • test_mask is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.ndata['test_mask']
    
  • labels is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    >>> labels = graph.ndata['label']
    
  • feat is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    >>> feat = graph.ndata['feat']
    

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 500 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.

Statistics:

  • Nodes: 19717

  • Edges: 88651

  • Number of Classes: 3

  • Label Split:

    • Train: 60

    • Valid: 500

    • Test: 1000

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

num_classes

Number of label classes

Type

int

graph

Graph structure

Type

networkx.DiGraph

train_mask

Mask of training nodes

Type

numpy.ndarray

val_mask

Mask of validation nodes

Type

numpy.ndarray

test_mask

Mask of test nodes

Type

numpy.ndarray

labels

Ground truth labels of each node

Type

numpy.ndarray

features

Node features

Type

Tensor

Notes

The node feature is row-normalized.

Examples

>>> dataset = PubmedGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_of_class
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, PubmedGraphDataset has only one graph object

Returns

graph structure, node features and labels.

  • ndata['train_mask']: mask for training node set

  • ndata['val_mask']: mask for validation node set

  • ndata['test_mask']: mask for test node set

  • ndata['feat']: node feature

  • ndata['label']: ground truth labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

CoraFull dataset

class dgl.data.CoraFullDataset(raw_dir=None, force_reload=False, verbose=False)[source]

CORA-Full dataset for node classification task.

    Deprecated since version 0.5.0:
  • data is deprecated, it is repalced by:

>>> dataset = CoraFullDataset()
>>> graph = dataset[0]

Extended Cora dataset. Nodes represent paper and edges represent citations.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics:

  • Nodes: 19,793

  • Edges: 126,842 (note that the original dataset has 65,311 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)

  • Number of Classes: 70

  • Node feature size: 8,710

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node.

Type

int

data

A list of DGLGraph objects

Type

list

Examples

>>> data = CoraFullDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
__getitem__(idx)

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()

Number of graphs in the dataset

RDF datasets

class dgl.data.AIFBDataset(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

AIFB dataset for node classification task

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = AIFBDataset()
    >>> graph = dataset[0]
    
  • train_idx is deprecated, it can be replaced by:

    >>> dataset = AIFBDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.nodes[dataset.category].data['train_mask']
    >>> train_idx = th.nonzero(train_mask, as_tuple=False).squeeze()
    
  • test_idx is deprecated, it can be replaced by:

    >>> dataset = AIFBDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.nodes[dataset.category].data['test_mask']
    >>> test_idx = th.nonzero(test_mask, as_tuple=False).squeeze()
    

AIFB DataSet is a Semantic Web (RDF) dataset used as a benchmark in data mining. It records the organizational structure of AIFB at the University of Karlsruhe.

AIFB dataset statistics:

  • Nodes: 7262

  • Edges: 48810 (including reverse edges)

  • Target Category: Personen

  • Number of Classes: 4

  • Label Split:

    • Train: 140

    • Test: 36

Parameters
  • print_every (int) – Preprocessing log for every X tuples. Default: 10000.

  • insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes to predict

Type

int

predict_category

The entity category (node type) that has labels for prediction

Type

str

labels

All the labels of the entities in predict_category

Type

Tensor

graph

Graph structure

Type

dgl.DGLGraph

train_idx

Entity IDs for training. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

test_idx

Entity IDs for testing. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

Examples

>>> dataset = dgl.data.rdf.AIFBDataset()
>>> graph = dataset[0]
>>> category = dataset.predict_category
>>> num_classes = dataset.num_classes
>>>
>>> train_mask = g.nodes[category].data.pop('train_mask')
>>> test_mask = g.nodes[category].data.pop('test_mask')
>>> labels = g.nodes[category].data.pop('labels')
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, AIFBDataset has only one graph object

Returns

The graph contains:

  • ndata['train_mask']: mask for training node set

  • ndata['test_mask']: mask for testing node set

  • ndata['labels']: mask for labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

Returns

Return type

int

class dgl.data.MUTAGDataset(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

MUTAG dataset for node classification task

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = MUTAGDataset()
    >>> graph = dataset[0]
    
  • train_idx is deprecated, it can be replaced by:

    >>> dataset = MUTAGDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.nodes[dataset.category].data['train_mask']
    >>> train_idx = th.nonzero(train_mask).squeeze()
    
  • test_idx is deprecated, it can be replaced by:

    >>> dataset = MUTAGDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.nodes[dataset.category].data['test_mask']
    >>> test_idx = th.nonzero(test_mask).squeeze()
    

Mutag dataset statistics:

  • Nodes: 27163

  • Edges: 148100 (including reverse edges)

  • Target Category: d

  • Number of Classes: 2

  • Label Split:

    • Train: 272

    • Test: 68

Parameters
  • print_every (int) – Preprocessing log for every X tuples. Default: 10000.

  • insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes to predict

Type

int

predict_category

The entity category (node type) that has labels for prediction

Type

str

labels

All the labels of the entities in predict_category

Type

Tensor

graph

Graph structure

Type

dgl.DGLGraph

train_idx

Entity IDs for training. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

test_idx

Entity IDs for testing. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

Examples

>>> dataset = dgl.data.rdf.MUTAGDataset()
>>> graph = dataset[0]
>>> category = dataset.predict_category
>>> num_classes = dataset.num_classes
>>>
>>> train_mask = g.nodes[category].data.pop('train_mask')
>>> test_mask = g.nodes[category].data.pop('test_mask')
>>> labels = g.nodes[category].data.pop('labels')
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, MUTAGDataset has only one graph object

Returns

The graph contains:

  • ndata['train_mask']: mask for training node set

  • ndata['test_mask']: mask for testing node set

  • ndata['labels']: mask for labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

Returns

Return type

int

class dgl.data.BGSDataset(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

BGS dataset for node classification task

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = BGSDataset()
    >>> graph = dataset[0]
    
  • train_idx is deprecated, it can be replaced by:

    >>> dataset = BGSDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.nodes[dataset.category].data['train_mask']
    >>> train_idx = th.nonzero(train_mask).squeeze()
    
  • test_idx is deprecated, it can be replaced by:

    >>> dataset = BGSDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.nodes[dataset.category].data['test_mask']
    >>> test_idx = th.nonzero(test_mask).squeeze()
    

BGS namespace convention: http://data.bgs.ac.uk/(ref|id)/<Major Concept>/<Sub Concept>/INSTANCE. We ignored all literal nodes and the relations connecting them in the output graph. We also ignored the relation used to mark whether a term is CURRENT or DEPRECATED.

BGS dataset statistics:

  • Nodes: 94806

  • Edges: 672884 (including reverse edges)

  • Target Category: Lexicon/NamedRockUnit

  • Number of Classes: 2

  • Label Split:

    • Train: 117

    • Test: 29

Parameters
  • print_every (int) – Preprocessing log for every X tuples. Default: 10000.

  • insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes to predict

Type

int

predict_category

The entity category (node type) that has labels for prediction

Type

str

labels

All the labels of the entities in predict_category

Type

Tensor

graph

Graph structure

Type

dgl.DGLGraph

train_idx

Entity IDs for training. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

test_idx

Entity IDs for testing. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

Examples

>>> dataset = dgl.data.rdf.BGSDataset()
>>> graph = dataset[0]
>>> category = dataset.predict_category
>>> num_classes = dataset.num_classes
>>>
>>> train_mask = g.nodes[category].data.pop('train_mask')
>>> test_mask = g.nodes[category].data.pop('test_mask')
>>> labels = g.nodes[category].data.pop('labels')
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, BGSDataset has only one graph object

Returns

The graph contains:

  • ndata['train_mask']: mask for training node set

  • ndata['test_mask']: mask for testing node set

  • ndata['labels']: mask for labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

Returns

Return type

int

class dgl.data.AMDataset(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

AM dataset. for node classification task

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = AMDataset()
    >>> graph = dataset[0]
    
  • train_idx is deprecated, it can be replaced by:

    >>> dataset = AMDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.nodes[dataset.category].data['train_mask']
    >>> train_idx = th.nonzero(train_mask).squeeze()
    
  • test_idx is deprecated, it can be replaced by:

    >>> dataset = AMDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.nodes[dataset.category].data['test_mask']
    >>> test_idx = th.nonzero(test_mask).squeeze()
    

Namespace convention:

  • Instance: http://purl.org/collections/nl/am/<type>-<id>

  • Relation: http://purl.org/collections/nl/am/<name>

We ignored all literal nodes and the relations connecting them in the output graph.

AM dataset statistics:

  • Nodes: 881680

  • Edges: 5668682 (including reverse edges)

  • Target Category: proxy

  • Number of Classes: 11

  • Label Split:

    • Train: 802

    • Test: 198

Parameters
  • print_every (int) – Preprocessing log for every X tuples. Default: 10000.

  • insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes to predict

Type

int

predict_category

The entity category (node type) that has labels for prediction

Type

str

labels

All the labels of the entities in predict_category

Type

Tensor

graph

Graph structure

Type

dgl.DGLGraph

train_idx

Entity IDs for training. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

test_idx

Entity IDs for testing. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

Examples

>>> dataset = dgl.data.rdf.AMDataset()
>>> graph = dataset[0]
>>> category = dataset.predict_category
>>> num_classes = dataset.num_classes
>>>
>>> train_mask = g.nodes[category].data.pop('train_mask')
>>> test_mask = g.nodes[category].data.pop('test_mask')
>>> labels = g.nodes[category].data.pop('labels')
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, AMDataset has only one graph object

Returns

The graph contains:

  • ndata['train_mask']: mask for training node set

  • ndata['test_mask']: mask for testing node set

  • ndata['labels']: mask for labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

Returns

Return type

int

Amazon Co-Purchase dataset

class dgl.data.AmazonCoBuyComputerDataset(raw_dir=None, force_reload=False, verbose=False)[source]

‘Computer’ part of the AmazonCoBuy dataset for node classification task.

    Deprecated since version 0.5.0:
  • data is deprecated, it is repalced by:

>>> dataset = AmazonCoBuyComputerDataset()
>>> graph = dataset[0]

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics:

  • Nodes: 13,752

  • Edges: 491,722 (note that the original dataset has 245,778 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)

  • Number of classes: 10

  • Node feature size: 767

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node.

Type

int

data

A list of DGLGraph objects

Type

list

Examples

>>> data = AmazonCoBuyComputerDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
__getitem__(idx)

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()

Number of graphs in the dataset

class dgl.data.AmazonCoBuyPhotoDataset(raw_dir=None, force_reload=False, verbose=False)[source]

AmazonCoBuy dataset for node classification task.

    Deprecated since version 0.5.0:
  • data is deprecated, it is repalced by:

>>> dataset = AmazonCoBuyPhotoDataset()
>>> graph = dataset[0]

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics

  • Nodes: 7,650

  • Edges: 238,163 (note that the original dataset has 119,043 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)

  • Number of classes: 8

  • Node feature size: 745

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node.

Type

int

data

A list of DGLGraph objects

Type

list

Examples

>>> data = AmazonCoBuyPhotoDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
__getitem__(idx)

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()

Number of graphs in the dataset

Coauthor dataset

class dgl.data.CoauthorCSDataset(raw_dir=None, force_reload=False, verbose=False)[source]

‘Computer Science (CS)’ part of the Coauthor dataset for node classification task.

    Deprecated since version 0.5.0:
  • data is deprecated, it is repalced by:

>>> dataset = CoauthorCSDataset()
>>> graph = dataset[0]

Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge. Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics:

  • Nodes: 18,333

  • Edges: 163,788 (note that the original dataset has 81,894 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)

  • Number of classes: 15

  • Node feature size: 6,805

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node.

Type

int

data

A list of DGLGraph objects

Type

list

Examples

>>> data = CoauthorCSDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
__getitem__(idx)

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()

Number of graphs in the dataset

class dgl.data.CoauthorPhysicsDataset(raw_dir=None, force_reload=False, verbose=False)[source]

‘Physics’ part of the Coauthor dataset for node classification task.

    Deprecated since version 0.5.0:
  • data is deprecated, it is repalced by:

>>> dataset = CoauthorPhysicsDataset()
>>> graph = dataset[0]

Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge. Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics

  • Nodes: 34,493

  • Edges: 495,924 (note that the original dataset has 247,962 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)

  • Number of classes: 5

  • Node feature size: 8,415

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node.

Type

int

data

A list of DGLGraph objects

Type

list

Examples

>>> data = CoauthorPhysicsDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
__getitem__(idx)

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()

Number of graphs in the dataset

Protein-Protein Interaction dataset

class dgl.data.PPIDataset(mode='train', raw_dir=None, force_reload=False, verbose=False)[source]

Protein-Protein Interaction dataset for inductive node classification

    Deprecated since version 0.5.0:
  • lables is deprecated, it is replaced by:

    >>> dataset = PPIDataset()
    >>> for g in dataset:
    ....    labels = g.ndata['label']
    ....
    >>>
    
  • features is deprecated, it is replaced by:

    >>> dataset = PPIDataset()
    >>> for g in dataset:
    ....    features = g.ndata['feat']
    ....
    >>>
    

A toy Protein-Protein Interaction network dataset. The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels. 20 graphs for training, 2 for validation and 2 for testing.

Reference: http://snap.stanford.edu/graphsage/

Statistics:

  • Train examples: 20

  • Valid examples: 2

  • Test examples: 2

Parameters
  • mode (str) – Must be one of (‘train’, ‘valid’, ‘test’). Default: ‘train’

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_labels

Number of labels for each node

Type

int

labels

Node labels

Type

Tensor

features

Node features

Type

Tensor

Examples

>>> dataset = PPIDataset(mode='valid')
>>> num_labels = dataset.num_labels
>>> for g in dataset:
....    feat = g.ndata['feat']
....    label = g.ndata['label']
....    # your code here
>>>
__getitem__(item)[source]

Get the item^th sample.

Parameters

item (int) – The sample index.

Returns

graph structure, node features and node labels.

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()[source]

Return number of samples in this dataset.

Reddit dataset

class dgl.data.RedditDataset(self_loop=False, raw_dir=None, force_reload=False, verbose=False)[source]

Reddit dataset for community detection (node classification)

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    
  • num_labels is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> num_classes = dataset.num_classes
    
  • train_mask is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.ndata['train_mask']
    
  • val_mask is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.ndata['val_mask']
    
  • test_mask is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.ndata['test_mask']
    
  • features is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    >>> features = graph.ndata['feat']
    
  • labels is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    >>> labels = graph.ndata['label']
    

This is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. The authors sampled 50 large communities and built a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. We use the first 20 days for training and the remaining days for testing (with 30% used for validation).

Reference: http://snap.stanford.edu/graphsage/

Statistics

  • Nodes: 232,965

  • Edges: 114,615,892

  • Node feature size: 602

  • Number of training samples: 153,431

  • Number of validation samples: 23,831

  • Number of test samples: 55,703

Parameters
  • self_loop (bool) – Whether load dataset with self loop connections. Default: False

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node

Type

int

graph

Graph of the dataset

Type

dgl.DGLGraph

num_labels

Number of classes for each node

Type

int

train_mask

Mask of training nodes

Type

numpy.ndarray

val_mask

Mask of validation nodes

Type

numpy.ndarray

test_mask

Mask of test nodes

Type

numpy.ndarray

features

Node features

Type

Tensor

labels

Node labels

Type

Tensor

Examples

>>> data = RedditDataset()
>>> g = data[0]
>>> num_classes = data.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
>>>
>>> # Train, Validation and Test
__getitem__(idx)[source]

Get graph by index

Parameters

idx (int) – Item index

Returns

graph structure, node labels, node features and splitting masks:

  • ndata['label']: node label

  • ndata['feat']: node feature

  • ndata['train_mask']: mask for training node set

  • ndata['val_mask']: mask for validation node set

  • ndata['test_mask']: mask for test node set

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset

Symmetric Stochastic Block Model Mixture dataset

class dgl.data.SBMMixtureDataset(n_graphs, n_nodes, n_communities, k=2, avg_deg=3, pq='Appendix_C', rng=None)[source]

Symmetric Stochastic Block Model Mixture

Reference: Appendix C of Supervised Community Detection with Hierarchical Graph Neural Networks

Parameters
  • n_graphs (int) – Number of graphs.

  • n_nodes (int) – Number of nodes.

  • n_communities (int) – Number of communities.

  • k (int, optional) – Multiplier. Default: 2

  • avg_deg (int, optional) – Average degree. Default: 3

  • pq (list of pair of nonnegative float or str, optional) – Random densities. This parameter is for future extension, for now it’s always using the default value. Default: Appendix_C

  • rng (numpy.random.RandomState, optional) – Random number generator. If not given, it’s numpy.random.RandomState() with seed=None, which read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise. Default: None

Raises

RuntimeError is raised if pq is not a list or string.

Examples

>>> data = SBMMixtureDataset(n_graphs=16, n_nodes=10000, n_communities=2)
>>> from torch.utils.data import DataLoader
>>> dataloader = DataLoader(data, batch_size=1, collate_fn=data.collate_fn)
>>> for graph, line_graph, graph_degrees, line_graph_degrees, pm_pd in dataloader:
...     # your code here
__getitem__(idx)[source]

Get one example by index

Parameters

idx (int) – Item index

Returns

  • graph (dgl.DGLGraph) – The original graph

  • line_graph (dgl.DGLGraph) – The line graph of graph

  • graph_degree (numpy.ndarray) – In degrees for each node in graph

  • line_graph_degree (numpy.ndarray) – In degrees for each node in line_graph

  • pm_pd (numpy.ndarray) – Edge indicator matrices Pm and Pd

__len__()[source]

Number of graphs in the dataset.

collate_fn(x)[source]

The collate function for dataloader

Parameters

x (tuple) –

a batch of data that contains:

  • graph: dgl.DGLGraph

    The original graph

  • line_graph: dgl.DGLGraph

    The line graph of graph

  • graph_degree: numpy.ndarray

    In degrees for each node in graph

  • line_graph_degree: numpy.ndarray

    In degrees for each node in line_graph

  • pm_pd: numpy.ndarray

    Edge indicator matrices Pm and Pd

Returns

  • g_batch (dgl.DGLGraph) – Batched graphs

  • lg_batch (dgl.DGLGraph) – Batched line graphs

  • degg_batch (numpy.ndarray) – A batch of in degrees for each node in g_batch

  • deglg_batch (numpy.ndarray) – A batch of in degrees for each node in lg_batch

  • pm_pd_batch (numpy.ndarray) – A batch of edge indicator matrices Pm and Pd

Fraud dataset

class dgl.data.FraudDataset(name, raw_dir=None, random_seed=717, train_size=0.7, val_size=0.1, force_reload=False, verbose=True)[source]

Fraud node prediction dataset.

The dataset includes two multi-relational graphs extracted from Yelp and Amazon where nodes represent fraudulent reviews or fraudulent reviewers.

It was first proposed in a CIKM‘20 paper <https://arxiv.org/pdf/2008.08692.pdf> and has been used by a recent WWW‘21 paper <https://ponderly.github.io/pub/PCGNN_WWW2021.pdf> as a benchmark. Another paper <https://arxiv.org/pdf/2104.01404.pdf> also takes the dataset as an example to study the non-homophilous graphs. This dataset is built upon industrial data and has rich relational information and unique properties like class-imbalance and feature inconsistency, which makes the dataset be a good instance to investigate how GNNs perform on real-world noisy graphs. These graphs are bidirected and not self connected.

Reference: <https://github.com/YingtongDou/CARE-GNN>

Parameters
  • name (str) – Name of the dataset

  • raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/

  • random_seed (int) – Specifying the random seed in splitting the dataset. Default: 717

  • train_size (float) – training set size of the dataset. Default: 0.7

  • val_size (float) – validation set size of the dataset, and the size of testing set is (1 - train_size - val_size) Default: 0.1

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of label classes

Type

int

graph

Graph structure, etc.

Type

dgl.DGLGraph

seed

Random seed in splitting the dataset.

Type

int

train_size

Training set size of the dataset.

Type

float

val_size

Validation set size of the dataset

Type

float

Examples

>>> dataset = FraudDataset('yelp')
>>> graph = dataset[0]
>>> num_classes = dataset.num_classes
>>> feat = graph.ndata['feature']
>>> label = graph.ndata['label']
__getitem__(idx)[source]

Get graph object

Parameters

idx (int) – Item index

Returns

graph structure, node features, node labels and masks

  • ndata['feature']: node features

  • ndata['label']: node labels

  • ndata['train_mask']: mask of training set

  • ndata['val_mask']: mask of validation set

  • ndata['test_mask']: mask of testing set

Return type

dgl.DGLGraph

__len__()[source]

number of data examples

class dgl.data.FraudYelpDataset(raw_dir=None, random_seed=717, train_size=0.7, val_size=0.1, force_reload=False, verbose=True)[source]

Fraud Yelp Dataset

The Yelp dataset includes hotel and restaurant reviews filtered (spam) and recommended (legitimate) by Yelp. A spam review detection task can be conducted, which is a binary classification task. 32 handcrafted features from <http://dx.doi.org/10.1145/2783258.2783370> are taken as the raw node features. Reviews are nodes in the graph, and three relations are:

  1. R-U-R: it connects reviews posted by the same user

  2. R-S-R: it connects reviews under the same product with the same star rating (1-5 stars)

  3. R-T-R: it connects two reviews under the same product posted in the same month.

Statistics:

  • Nodes: 45,954

  • Edges:

    • R-U-R: 98,630

    • R-T-R: 1,147,232

    • R-S-R: 6,805,486

  • Classes:

    • Positive (spam): 6,677

    • Negative (legitimate): 39,277

  • Positive-Negative ratio: 1 : 5.9

  • Node feature size: 32

Parameters
  • raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/

  • random_seed (int) – Specifying the random seed in splitting the dataset. Default: 717

  • train_size (float) – training set size of the dataset. Default: 0.7

  • val_size (float) – validation set size of the dataset, and the size of testing set is (1 - train_size - val_size) Default: 0.1

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

Examples

>>> dataset = FraudYelpDataset()
>>> graph = dataset[0]
>>> num_classes = dataset.num_classes
>>> feat = graph.ndata['feature']
>>> label = graph.ndata['label']
__getitem__(idx)

Get graph object

Parameters

idx (int) – Item index

Returns

graph structure, node features, node labels and masks

  • ndata['feature']: node features

  • ndata['label']: node labels

  • ndata['train_mask']: mask of training set

  • ndata['val_mask']: mask of validation set

  • ndata['test_mask']: mask of testing set

Return type

dgl.DGLGraph

__len__()

number of data examples

class dgl.data.FraudAmazonDataset(raw_dir=None, random_seed=717, train_size=0.7, val_size=0.1, force_reload=False, verbose=True)[source]

Fraud Amazon Dataset

The Amazon dataset includes product reviews under the Musical Instruments category. Users with more than 80% helpful votes are labelled as benign entities and users with less than 20% helpful votes are labelled as fraudulent entities. A fraudulent user detection task can be conducted on the Amazon dataset, which is a binary classification task. 25 handcrafted features from <https://arxiv.org/pdf/2005.10150.pdf> are taken as the raw node features .

Users are nodes in the graph, and three relations are: 1. U-P-U : it connects users reviewing at least one same product 2. U-S-U : it connects users having at least one same star rating within one week 3. U-V-U : it connects users with top 5% mutual review text similarities (measured by TF-IDF) among all users.

Statistics:

  • Nodes: 11,944

  • Edges:

    • U-P-U: 351,216

    • U-S-U: 7,132,958

    • U-V-U: 2,073,474

  • Classes:

    • Positive (fraudulent): 821

    • Negative (benign): 7,818

    • Unlabeled: 3,305

  • Positive-Negative ratio: 1 : 10.5

  • Node feature size: 25

Parameters
  • raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/

  • random_seed (int) – Specifying the random seed in splitting the dataset. Default: 717

  • train_size (float) – training set size of the dataset. Default: 0.7

  • val_size (float) – validation set size of the dataset, and the size of testing set is (1 - train_size - val_size) Default: 0.1

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

Examples

>>> dataset = FraudAmazonDataset()
>>> graph = dataset[0]
>>> num_classes = dataset.num_classes
>>> feat = graph.ndata['feature']
>>> label = graph.ndata['label']
__getitem__(idx)

Get graph object

Parameters

idx (int) – Item index

Returns

graph structure, node features, node labels and masks

  • ndata['feature']: node features

  • ndata['label']: node labels

  • ndata['train_mask']: mask of training set

  • ndata['val_mask']: mask of validation set

  • ndata['test_mask']: mask of testing set

Return type

dgl.DGLGraph

__len__()

number of data examples

Edge Prediction Datasets

DGL hosted datasets for edge classification/regression and link prediction tasks.

Knowlege graph dataset

class dgl.data.FB15k237Dataset(reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

FB15k237 link prediction dataset.

    Deprecated since version 0.5.0:
  • train is deprecated, it is replaced by:

    >>> dataset = FB15k237Dataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.edata['train_mask']
    >>> train_idx = th.nonzero(train_mask, as_tuple=False).squeeze()
    >>> src, dst = graph.edges(train_idx)
    >>> rel = graph.edata['etype'][train_idx]
    
  • valid is deprecated, it is replaced by:

    >>> dataset = FB15k237Dataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.edata['val_mask']
    >>> val_idx = th.nonzero(val_mask, as_tuple=False).squeeze()
    >>> src, dst = graph.edges(val_idx)
    >>> rel = graph.edata['etype'][val_idx]
    
  • test is deprecated, it is replaced by:

    >>> dataset = FB15k237Dataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.edata['test_mask']
    >>> test_idx = th.nonzero(test_mask, as_tuple=False).squeeze()
    >>> src, dst = graph.edges(test_idx)
    >>> rel = graph.edata['etype'][test_idx]
    

FB15k-237 is a subset of FB15k where inverse relations are removed. When creating the dataset, a reverse edge with reversed relation types are created for each edge by default.

FB15k237 dataset statistics:

  • Nodes: 14541

  • Number of relation types: 237

  • Number of reversed relation types: 237

  • Label Split:

    • Train: 272115

    • Valid: 17535

    • Test: 20466

Parameters
  • reverse (bool) – Whether to add reverse edge. Default True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_nodes

Number of nodes

Type

int

num_rels

Number of relation types

Type

int

train

A numpy array of triplets (src, rel, dst) for the training graph

Type

numpy.ndarray

valid

A numpy array of triplets (src, rel, dst) for the validation graph

Type

numpy.ndarray

test

A numpy array of triplets (src, rel, dst) for the test graph

Type

numpy.ndarray

Examples

>>> dataset = FB15k237Dataset()
>>> g = dataset.graph
>>> e_type = g.edata['e_type']
>>>
>>> # get data split
>>> train_mask = g.edata['train_mask']
>>> val_mask = g.edata['val_mask']
>>> test_mask = g.edata['test_mask']
>>>
>>> train_set = th.arange(g.number_of_edges())[train_mask]
>>> val_set = th.arange(g.number_of_edges())[val_mask]
>>>
>>> # build train_g
>>> train_edges = train_set
>>> train_g = g.edge_subgraph(train_edges,
                              relabel_nodes=False)
>>> train_g.edata['e_type'] = e_type[train_edges];
>>>
>>> # build val_g
>>> val_edges = th.cat([train_edges, val_edges])
>>> val_g = g.edge_subgraph(val_edges,
                            relabel_nodes=False)
>>> val_g.edata['e_type'] = e_type[val_edges];
>>>
>>> # Train, Validation and Test
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, FB15k237Dataset has only one graph object

Returns

The graph contains

  • edata['e_type']: edge relation type

  • edata['train_edge_mask']: positive training edge mask

  • edata['val_edge_mask']: positive validation edge mask

  • edata['test_edge_mask']: positive testing edge mask

  • edata['train_mask']: training edge set mask (include reversed training edges)

  • edata['val_mask']: validation edge set mask (include reversed validation edges)

  • edata['test_mask']: testing edge set mask (include reversed testing edges)

  • ndata['ntype']: node type. All 0 in this dataset

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

class dgl.data.FB15kDataset(reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

FB15k link prediction dataset.

    Deprecated since version 0.5.0:
  • train is deprecated, it is replaced by:

    >>> dataset = FB15kDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.edata['train_mask']
    >>> train_idx = th.nonzero(train_mask, as_tuple=False).squeeze()
    >>> src, dst = graph.edges(train_idx)
    >>> rel = graph.edata['etype'][train_idx]
    
  • valid is deprecated, it is replaced by:

    >>> dataset = FB15kDataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.edata['val_mask']
    >>> val_idx = th.nonzero(val_mask, as_tuple=False).squeeze()
    >>> src, dst = graph.edges(val_idx)
    >>> rel = graph.edata['etype'][val_idx]
    
  • test is deprecated, it is replaced by:

    >>> dataset = FB15kDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.edata['test_mask']
    >>> test_idx = th.nonzero(test_mask, as_tuple=False).squeeze()
    >>> src, dst = graph.edges(test_idx)
    >>> rel = graph.edata['etype'][test_idx]
    

The FB15K dataset was introduced in Translating Embeddings for Modeling Multi-relational Data. It is a subset of Freebase which contains about 14,951 entities with 1,345 different relations. When creating the dataset, a reverse edge with reversed relation types are created for each edge by default.

FB15k dataset statistics:

  • Nodes: 14,951

  • Number of relation types: 1,345

  • Number of reversed relation types: 1,345

  • Label Split:

    • Train: 483142

    • Valid: 50000

    • Test: 59071

Parameters
  • reverse (bool) – Whether to add reverse edge. Default True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_nodes

Number of nodes

Type

int

num_rels

Number of relation types

Type

int

train

A numpy array of triplets (src, rel, dst) for the training graph

Type

numpy.ndarray

valid

A numpy array of triplets (src, rel, dst) for the validation graph

Type

numpy.ndarray

test

A numpy array of triplets (src, rel, dst) for the test graph

Type

numpy.ndarray

Examples

>>> dataset = FB15kDataset()
>>> g = dataset.graph
>>> e_type = g.edata['e_type']
>>>
>>> # get data split
>>> train_mask = g.edata['train_mask']
>>> val_mask = g.edata['val_mask']
>>>
>>> train_set = th.arange(g.number_of_edges())[train_mask]
>>> val_set = th.arange(g.number_of_edges())[val_mask]
>>>
>>> # build train_g
>>> train_edges = train_set
>>> train_g = g.edge_subgraph(train_edges,
                              relabel_nodes=False)
>>> train_g.edata['e_type'] = e_type[train_edges];
>>>
>>> # build val_g
>>> val_edges = th.cat([train_edges, val_edges])
>>> val_g = g.edge_subgraph(val_edges,
                            relabel_nodes=False)
>>> val_g.edata['e_type'] = e_type[val_edges];
>>>
>>> # Train, Validation and Test
>>>
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, FB15kDataset has only one graph object

Returns

The graph contains

  • edata['e_type']: edge relation type

  • edata['train_edge_mask']: positive training edge mask

  • edata['val_edge_mask']: positive validation edge mask

  • edata['test_edge_mask']: positive testing edge mask

  • edata['train_mask']: training edge set mask (include reversed training edges)

  • edata['val_mask']: validation edge set mask (include reversed validation edges)

  • edata['test_mask']: testing edge set mask (include reversed testing edges)

  • ndata['ntype']: node type. All 0 in this dataset

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

class dgl.data.WN18Dataset(reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

WN18 link prediction dataset.

    Deprecated since version 0.5.0:
  • train is deprecated, it is replaced by:

    >>> dataset = WN18Dataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.edata['train_mask']
    >>> train_idx = th.nonzero(train_mask, as_tuple=False).squeeze()
    >>> src, dst = graph.edges(train_idx)
    >>> rel = graph.edata['etype'][train_idx]
    
  • valid is deprecated, it is replaced by:

    >>> dataset = WN18Dataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.edata['val_mask']
    >>> val_idx = th.nonzero(val_mask, as_tuple=False).squeeze()
    >>> src, dst = graph.edges(val_idx)
    >>> rel = graph.edata['etype'][val_idx]
    
  • test is deprecated, it is replaced by:

    >>> dataset = WN18Dataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.edata['test_mask']
    >>> test_idx = th.nonzero(test_mask, as_tuple=False).squeeze()
    >>> src, dst = graph.edges(test_idx)
    >>> rel = graph.edata['etype'][test_idx]
    

The WN18 dataset was introduced in Translating Embeddings for Modeling Multi-relational Data. It included the full 18 relations scraped from WordNet for roughly 41,000 synsets. When creating the dataset, a reverse edge with reversed relation types are created for each edge by default.

WN18 dataset statistics:

  • Nodes: 40943

  • Number of relation types: 18

  • Number of reversed relation types: 18

  • Label Split:

    • Train: 141442

    • Valid: 5000

    • Test: 5000

Parameters
  • reverse (bool) – Whether to add reverse edge. Default True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_nodes

Number of nodes

Type

int

num_rels

Number of relation types

Type

int

train

A numpy array of triplets (src, rel, dst) for the training graph

Type

numpy.ndarray

valid

A numpy array of triplets (src, rel, dst) for the validation graph

Type

numpy.ndarray

test

A numpy array of triplets (src, rel, dst) for the test graph

Type

numpy.ndarray

Examples

>>> dataset = WN18Dataset()
>>> g = dataset.graph
>>> e_type = g.edata['e_type']
>>>
>>> # get data split
>>> train_mask = g.edata['train_mask']
>>> val_mask = g.edata['val_mask']
>>>
>>> train_set = th.arange(g.number_of_edges())[train_mask]
>>> val_set = th.arange(g.number_of_edges())[val_mask]
>>>
>>> # build train_g
>>> train_edges = train_set
>>> train_g = g.edge_subgraph(train_edges,
                              relabel_nodes=False)
>>> train_g.edata['e_type'] = e_type[train_edges];
>>>
>>> # build val_g
>>> val_edges = th.cat([train_edges, val_edges])
>>> val_g = g.edge_subgraph(val_edges,
                            relabel_nodes=False)
>>> val_g.edata['e_type'] = e_type[val_edges];
>>>
>>> # Train, Validation and Test
>>>
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, WN18Dataset has only one graph object

Returns

The graph contains

  • edata['e_type']: edge relation type

  • edata['train_edge_mask']: positive training edge mask

  • edata['val_edge_mask']: positive validation edge mask

  • edata['test_edge_mask']: positive testing edge mask

  • edata['train_mask']: training edge set mask (include reversed training edges)

  • edata['val_mask']: validation edge set mask (include reversed validation edges)

  • edata['test_mask']: testing edge set mask (include reversed testing edges)

  • ndata['ntype']: node type. All 0 in this dataset

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

BitcoinOTC dataset

class dgl.data.BitcoinOTCDataset(raw_dir=None, force_reload=False, verbose=False)[source]

BitcoinOTC dataset for fraud detection

This is who-trusts-whom network of people who trade using Bitcoin on a platform called Bitcoin OTC. Since Bitcoin users are anonymous, there is a need to maintain a record of users’ reputation to prevent transactions with fraudulent and risky users.

Offical website: https://snap.stanford.edu/data/soc-sign-bitcoin-otc.html

Bitcoin OTC dataset statistics:

  • Nodes: 5,881

  • Edges: 35,592

  • Range of edge weight: -10 to +10

  • Percentage of positive edges: 89%

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

graphs

A list of DGLGraph objects

Type

list

is_temporal

Indicate whether the graphs are temporal graphs

Type

bool

Raises

UserWarning – If the raw data is changed in the remote server by the author.

Examples

>>> dataset = BitcoinOTCDataset()
>>> len(dataset)
136
>>> for g in dataset:
....    # get edge feature
....    edge_weights = g.edata['h']
....    # your code here
>>>
__getitem__(item)[source]

Get graph by index

Parameters

item (int) – Item index

Returns

The graph contains:

  • edata['h'] : edge weights

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

ICEWS18 dataset

class dgl.data.ICEWS18Dataset(mode='train', raw_dir=None, force_reload=False, verbose=False)[source]

ICEWS18 dataset for temporal graph

Integrated Crisis Early Warning System (ICEWS18)

Event data consists of coded interactions between socio-political actors (i.e., cooperative or hostile actions between individuals, groups, sectors and nation states). This Dataset consists of events from 1/1/2018 to 10/31/2018 (24 hours time granularity).

Reference:

Statistics:

  • Train examples: 240

  • Valid examples: 30

  • Test examples: 34

  • Nodes per graph: 23033

Parameters
  • mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

is_temporal

Is the dataset contains temporal graphs

Type

bool

Examples

>>> # get train, valid, test set
>>> train_data = ICEWS18Dataset()
>>> valid_data = ICEWS18Dataset(mode='valid')
>>> test_data = ICEWS18Dataset(mode='test')
>>>
>>> train_size = len(train_data)
>>> for g in train_data:
....    e_feat = g.edata['rel_type']
....    # your code here
....
>>>
__getitem__(idx)[source]

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • edata['rel_type']: edge type

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

GDELT dataset

class dgl.data.GDELTDataset(mode='train', raw_dir=None, force_reload=False, verbose=False)[source]

GDELT dataset for event-based temporal graph

The Global Database of Events, Language, and Tone (GDELT) dataset. This contains events happend all over the world (ie every protest held anywhere in Russia on a given day is collapsed to a single entry). This Dataset consists ofevents collected from 1/1/2018 to 1/31/2018 (15 minutes time granularity).

Reference:

Statistics:

  • Train examples: 2,304

  • Valid examples: 288

  • Test examples: 384

Parameters
  • mode (str) – Must be one of (‘train’, ‘valid’, ‘test’). Default: ‘train’

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

start_time

Start time of the temporal graph

Type

int

end_time

End time of the temporal graph

Type

int

is_temporal

Does the dataset contain temporal graphs

Type

bool

Examples

>>> # get train, valid, test dataset
>>> train_data = GDELTDataset()
>>> valid_data = GDELTDataset(mode='valid')
>>> test_data = GDELTDataset(mode='test')
>>>
>>> # length of train set
>>> train_size = len(train_data)
>>>
>>> for g in train_data:
....    e_feat = g.edata['rel_type']
....    # your code here
....
>>>
__getitem__(t)[source]

Get graph by with events before time t + self.start_time

Parameters

t (int) – Time, its value must be in range [0, self.end_time - self.start_time]

Returns

The graph contains:

  • edata['rel_type']: edge type

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

Graph Prediction Datasets

DGL hosted datasets for graph classification/regression tasks.

QM7b dataset

class dgl.data.QM7bDataset(raw_dir=None, force_reload=False, verbose=False)[source]

QM7b dataset for graph property prediction (regression)

This dataset consists of 7,211 molecules with 14 regression targets. Nodes means atoms and edges means bonds. Edge data ‘h’ means the entry of Coulomb matrix.

Reference: http://quantum-machine.org/datasets/

Statistics:

  • Number of graphs: 7,211

  • Number of regression targets: 14

  • Average number of nodes: 15

  • Average number of edges: 245

  • Edge feature size: 1

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_labels

Number of labels for each graph, i.e. number of prediction tasks

Type

int

Raises

UserWarning – If the raw data is changed in the remote server by the author.

Examples

>>> data = QM7bDataset()
>>> data.num_labels
14
>>>
>>> # iterate over the dataset
>>> for g, label in data:
...     edge_feat = g.edata['h']  # get edge feature
...     # your code here...
...
>>>
__getitem__(idx)[source]

Get graph and label by index

Parameters

idx (int) – Item index

Returns

Return type

(dgl.DGLGraph, Tensor)

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

QM9 dataset

class dgl.data.QM9Dataset(label_keys, cutoff=5.0, raw_dir=None, force_reload=False, verbose=False)[source]

QM9 dataset for graph property prediction (regression)

This dataset consists of 130,831 molecules with 12 regression targets. Nodes correspond to atoms and edges correspond to close atom pairs.

This dataset differs from QM9EdgeDataset in the following aspects:
  1. Edges in this dataset are purely distance-based.

  2. It only provides atoms’ coordinates and atomic numbers as node features

  3. It only provides 12 regression targets.

Reference:

Statistics:

  • Number of graphs: 130,831

  • Number of regression targets: 12

Keys

Property

Description

Unit

mu

\(\mu\)

Dipole moment

\(\textrm{D}\)

alpha

\(\alpha\)

Isotropic polarizability

\({a_0}^3\)

homo

\(\epsilon_{\textrm{HOMO}}\)

Highest occupied molecular orbital energy

\(\textrm{eV}\)

lumo

\(\epsilon_{\textrm{LUMO}}\)

Lowest unoccupied molecular orbital energy

\(\textrm{eV}\)

gap

\(\Delta \epsilon\)

Gap between \(\epsilon_{\textrm{HOMO}}\) and \(\epsilon_{\textrm{LUMO}}\)

\(\textrm{eV}\)

r2

\(\langle R^2 \rangle\)

Electronic spatial extent

\({a_0}^2\)

zpve

\(\textrm{ZPVE}\)

Zero point vibrational energy

\(\textrm{eV}\)

U0

\(U_0\)

Internal energy at 0K

\(\textrm{eV}\)

U

\(U\)

Internal energy at 298.15K

\(\textrm{eV}\)

H

\(H\)

Enthalpy at 298.15K

\(\textrm{eV}\)

G

\(G\)

Free energy at 298.15K

\(\textrm{eV}\)

Cv

\(c_{\textrm{v}}\)

Heat capavity at 298.15K

\(\frac{\textrm{cal}}{\textrm{mol K}}\)

Parameters
  • label_keys (list) – Names of the regression property, which should be a subset of the keys in the table above.

  • cutoff (float) – Cutoff distance for interatomic interactions, i.e. two atoms are connected in the corresponding graph if the distance between them is no larger than this. Default: 5.0 Angstrom

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_labels

Number of labels for each graph, i.e. number of prediction tasks

Type

int

Raises

UserWarning – If the raw data is changed in the remote server by the author.

Examples

>>> data = QM9Dataset(label_keys=['mu', 'gap'], cutoff=5.0)
>>> data.num_labels
2
>>>
>>> # iterate over the dataset
>>> for g, label in data:
...     R = g.ndata['R'] # get coordinates of each atom
...     Z = g.ndata['Z'] # get atomic numbers of each atom
...     # your code here...
>>>
__getitem__(idx)[source]

Get graph and label by index

Parameters

idx (int) – Item index

Returns

  • dgl.DGLGraph – The graph contains:

    • ndata['R']: the coordinates of each atom

    • ndata['Z']: the atomic number

  • Tensor – Property values of molecular graphs

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

QM9Edge dataset

class dgl.data.QM9EdgeDataset(label_keys=None, raw_dir=None, force_reload=False, verbose=True)[source]

QM9Edge dataset for graph property prediction (regression)

This dataset consists of 130,831 molecules with 19 regression targets. Nodes correspond to atoms and edges correspond to bonds.

This dataset differs from QM9Dataset in the following aspects:
  1. It includes the bonds in a molecule in the edges of the corresponding graph while the edges in QM9Dataset are purely distance-based.

  2. It provides edge features, and node features in addition to the atoms’ coordinates and atomic numbers.

  3. It provides another 7 regression tasks(from 12 to 19).

This class is built based on a preprocessed version of the dataset, and we provide the preprocessing datails here.

Reference:

For Statistics:

  • Number of graphs: 130,831.

  • Number of regression targets: 19.

Node attributes:

  • pos: the 3D coordinates of each atom.

  • attr: the 11D atom features.

Edge attributes:

  • edge_attr: the 4D bond features.

Regression targets:

Keys

Property

Description

Unit

mu

\(\mu\)

Dipole moment

\(\textrm{D}\)

alpha

\(\alpha\)

Isotropic polarizability

\({a_0}^3\)

homo

\(\epsilon_{\textrm{HOMO}}\)

Highest occupied molecular orbital energy

\(\textrm{eV}\)

lumo

\(\epsilon_{\textrm{LUMO}}\)

Lowest unoccupied molecular orbital energy

\(\textrm{eV}\)

gap

\(\Delta \epsilon\)

Gap between \(\epsilon_{\textrm{HOMO}}\) and \(\epsilon_{\textrm{LUMO}}\)

\(\textrm{eV}\)

r2

\(\langle R^2 \rangle\)

Electronic spatial extent

\({a_0}^2\)

zpve

\(\textrm{ZPVE}\)

Zero point vibrational energy

\(\textrm{eV}\)

U0

\(U_0\)

Internal energy at 0K

\(\textrm{eV}\)

U

\(U\)

Internal energy at 298.15K

\(\textrm{eV}\)

H

\(H\)

Enthalpy at 298.15K

\(\textrm{eV}\)

G

\(G\)

Free energy at 298.15K

\(\textrm{eV}\)

Cv

\(c_{\textrm{v}}\)

Heat capavity at 298.15K

\(\frac{\textrm{cal}}{\textrm{mol K}}\)

U0_atom

\(U_0^{\textrm{ATOM}}\)

Atomization energy at 0K

\(\textrm{eV}\)

U_atom

\(U^{\textrm{ATOM}}\)

Atomization energy at 298.15K

\(\textrm{eV}\)

H_atom

\(H^{\textrm{ATOM}}\)

Atomization enthalpy at 298.15K

\(\textrm{eV}\)

G_atom

\(G^{\textrm{ATOM}}\)

Atomization free energy at 298.15K

\(\textrm{eV}\)

A

\(A\)

Rotational constant

\(\textrm{GHz}\)

B

\(B\)

Rotational constant

\(\textrm{GHz}\)

C

\(C\)

Rotational constant

\(\textrm{GHz}\)

Parameters
  • label_keys (list) – Names of the regression property, which should be a subset of the keys in the table above. If not provided, it will load all the labels.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False.

  • verbose (bool) – Whether to print out progress information. Default: True.

num_labels

Number of labels for each graph, i.e. number of prediction tasks

Type

int

Raises

UserWarning – If the raw data is changed in the remote server by the author.

Examples

>>> data = QM9EdgeDataset(label_keys=['mu', 'alpha'])
>>> data.num_labels
2
>>> # iterate over the dataset
>>> for graph, labels in data:
...     print(graph) # get information of each graph
...     print(labels) # get labels of the corresponding graph
...     # your code here...
>>>
__getitem__(idx)[source]

Get graph and label by index

Parameters

idx (int) – Item index

Returns

  • dgl.DGLGraph – The graph contains:

    • ndata['pos']: the coordinates of each atom

    • ndata['attr']: the features of each atom

    • edata['edge_attr']: the features of each bond

  • Tensor – Property values of molecular graphs

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

Mini graph classification dataset

class dgl.data.MiniGCDataset(num_graphs, min_num_v, max_num_v, seed=0, save_graph=True, force_reload=False, verbose=False)[source]

The synthetic graph classification dataset class.

The datset contains 8 different types of graphs.

  • class 0 : cycle graph

  • class 1 : star graph

  • class 2 : wheel graph

  • class 3 : lollipop graph

  • class 4 : hypercube graph

  • class 5 : grid graph

  • class 6 : clique graph

  • class 7 : circular ladder graph

Parameters
  • num_graphs (int) – Number of graphs in this dataset.

  • min_num_v (int) – Minimum number of nodes for graphs

  • max_num_v (int) – Maximum number of nodes for graphs

  • seed (int, default is 0) – Random seed for data generation

num_graphs

Number of graphs

Type

int

min_num_v

The minimum number of nodes

Type

int

max_num_v

The maximum number of nodes

Type

int

num_classes

The number of classes

Type

int

Examples

>>> data = MiniGCDataset(100, 16, 32, seed=0)

The dataset instance is an iterable

>>> len(data)
100
>>> g, label = data[64]
>>> g
Graph(num_nodes=20, num_edges=82,
      ndata_schemes={}
      edata_schemes={})
>>> label
tensor(5)

Batch the graphs and labels for mini-batch training

>>> graphs, labels = zip(*[data[i] for i in range(16)])
>>> batched_graphs = dgl.batch(graphs)
>>> batched_labels = torch.tensor(labels)
>>> batched_graphs
Graph(num_nodes=356, num_edges=1060,
      ndata_schemes={}
      edata_schemes={})
__getitem__(idx)[source]

Get the idx-th sample.

Parameters

idx (int) – The sample index.

Returns

The graph and its label.

Return type

(dgl.Graph, Tensor)

__len__()[source]

Return the number of graphs in the dataset.

TU dataset

class dgl.data.TUDataset(name, raw_dir=None, force_reload=False, verbose=False)[source]

TUDataset contains lots of graph kernel datasets for graph classification.

Parameters

name (str) – Dataset Name, such as ENZYMES, DD, COLLAB, MUTAG, can be the datasets name on https://chrsmrrs.github.io/datasets/docs/datasets/.

max_num_node

Maximum number of nodes

Type

int

num_labels

Number of classes

Type

int

Notes

IMPORTANT: Some of the datasets have duplicate edges exist in the graphs, e.g. the edges in IMDB-BINARY are all duplicated. DGL faithfully keeps the duplicates as per the original data. Other frameworks such as PyTorch Geometric removes the duplicates by default. You can remove the duplicate edges with dgl.to_simple().

Examples

>>> data = TUDataset('DD')

The dataset instance is an iterable

>>> len(data)
188
>>> g, label = data[1024]
>>> g
Graph(num_nodes=88, num_edges=410,
      ndata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), 'node_labels': Scheme(shape=(1,), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})
>>> label
tensor([1])

Batch the graphs and labels for mini-batch training

>>> graphs, labels = zip(*[data[i] for i in range(16)])
>>> batched_graphs = dgl.batch(graphs)
>>> batched_labels = torch.tensor(labels)
>>> batched_graphs
Graph(num_nodes=9539, num_edges=47382,
      ndata_schemes={'node_labels': Scheme(shape=(1,), dtype=torch.int64), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})

Notes

Graphs may have node labels, node attributes, edge labels, and edge attributes, varing from different dataset.

Labels are mapped to \(\lbrace 0,\cdots,n-1 \rbrace\) where \(n\) is the number of labels (some datasets have raw labels \(\lbrace -1, 1 \rbrace\) which will be mapped to \(\lbrace 0, 1 \rbrace\)). In previous versions, the minimum label was added so that \(\lbrace -1, 1 \rbrace\) was mapped to \(\lbrace 0, 2 \rbrace\).

__getitem__(idx)[source]

Get the idx-th sample.

Parameters

idx (int) – The sample index.

Returns

Graph with node feature stored in feat field and node label in node_label if available. And its label.

Return type

(dgl.DGLGraph, Tensor)

__len__()[source]

Return the number of graphs in the dataset.

class dgl.data.LegacyTUDataset(name, use_pandas=False, hidden_size=10, max_allow_node=None, raw_dir=None, force_reload=False, verbose=False)[source]

LegacyTUDataset contains lots of graph kernel datasets for graph classification.

Parameters
  • name (str) –

    Dataset Name, such as ENZYMES, DD, COLLAB, MUTAG, can be the datasets name on https://chrsmrrs.github.io/datasets/docs/datasets/.

  • use_pandas (bool) – Numpy’s file read function has performance issue when file is large, using pandas can be faster. Default: False

  • hidden_size (int) – Some dataset doesn’t contain features. Use constant node features initialization instead, with hidden size as hidden_size. Default : 10

  • max_allow_node (int) – Remove graphs that contains more nodes than max_allow_node. Default : None

max_num_node

Maximum number of nodes

Type

int

num_labels

Number of classes

Type

int

Examples

>>> data = LegacyTUDataset('DD')

The dataset instance is an iterable

>>> len(data)
1178
>>> g, label = data[1024]
>>> g
Graph(num_nodes=88, num_edges=410,
      ndata_schemes={'feat': Scheme(shape=(89,), dtype=torch.float32), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})
>>> label
tensor(1)

Batch the graphs and labels for mini-batch training

>>> graphs, labels = zip(*[data[i] for i in range(16)])
>>> batched_graphs = dgl.batch(graphs)
>>> batched_labels = torch.tensor(labels)
>>> batched_graphs
Graph(num_nodes=9539, num_edges=47382,
      ndata_schemes={'feat': Scheme(shape=(89,), dtype=torch.float32), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})

Notes

LegacyTUDataset uses provided node feature by default. If no feature provided, it uses one-hot node label instead. If neither labels provided, it uses constant for node feature.

__getitem__(idx)[source]

Get the idx-th sample.

Parameters

idx (int) – The sample index.

Returns

Graph with node feature stored in feat field and node label in node_label if available. And its label.

Return type

(dgl.DGLGraph, Tensor)

__len__()[source]

Return the number of graphs in the dataset.

Graph isomorphism network dataset

class dgl.data.GINDataset(name, self_loop, degree_as_nlabel=False, raw_dir=None, force_reload=False, verbose=False)[source]

Dataset Class for How Powerful Are Graph Neural Networks?.

This is adapted from https://github.com/weihua916/powerful-gnns/blob/master/dataset.zip.

The class provides an interface for nine datasets used in the paper along with the paper-specific settings. The datasets are 'MUTAG', 'COLLAB', 'IMDBBINARY', 'IMDBMULTI', 'NCI1', 'PROTEINS', 'PTC', 'REDDITBINARY', 'REDDITMULTI5K'.

If degree_as_nlabel is set to False, then ndata['label'] stores the provided node label, otherwise ndata['label'] stores the node in-degrees.

For graphs that have node attributes, ndata['attr'] stores the node attributes. For graphs that have no attribute, ndata['attr'] stores the corresponding one-hot encoding of ndata['label'].

Parameters
  • name (str) – dataset name, one of ('MUTAG', 'COLLAB', 'IMDBBINARY', 'IMDBMULTI', 'NCI1', 'PROTEINS', 'PTC', 'REDDITBINARY', 'REDDITMULTI5K')

  • self_loop (bool) – add self to self edge if true

  • degree_as_nlabel (bool) – take node degree as label and feature if true

Examples

>>> data = GINDataset(name='MUTAG', self_loop=False)

The dataset instance is an iterable

>>> len(data)
188
>>> g, label = data[128]
>>> g
Graph(num_nodes=13, num_edges=26,
      ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'attr': Scheme(shape=(7,), dtype=torch.float32)}
      edata_schemes={})
>>> label
tensor(1)

Batch the graphs and labels for mini-batch training

>>> graphs, labels = zip(*[data[i] for i in range(16)])
>>> batched_graphs = dgl.batch(graphs)
>>> batched_labels = torch.tensor(labels)
>>> batched_graphs
Graph(num_nodes=330, num_edges=748,
      ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'attr': Scheme(shape=(7,), dtype=torch.float32)}
      edata_schemes={})
__getitem__(idx)[source]

Get the idx-th sample.

Parameters

idx (int) – The sample index.

Returns

The graph and its label.

Return type

(dgl.Graph, Tensor)

__len__()[source]

Return the number of graphs in the dataset.

Fake news dataset

class dgl.data.FakeNewsDataset(name, feature_name, raw_dir=None)[source]

Fake News Graph Classification dataset.

The dataset is composed of two sets of tree-structured fake/real news propagation graphs extracted from Twitter. Different from most of the benchmark datasets for the graph classification task, the graphs in this dataset are directed tree-structured graphs where the root node represents the news, the leaf nodes are Twitter users who retweeted the root news. Besides, the node features are encoded user historical tweets using different pretrained language models:

  • bert: the 768-dimensional node feature composed of Twitter user historical tweets encoded by the bert-as-service

  • content: the 310-dimensional node feature composed of a 300-dimensional “spacy” vector plus a 10-dimensional “profile” vector

  • profile: the 10-dimensional node feature composed of ten Twitter user profile attributes.

  • spacy: the 300-dimensional node feature composed of Twitter user historical tweets encoded by the spaCy word2vec encoder.

Reference: <https://github.com/safe-graph/GNN-FakeNews>

Note: this dataset is for academic use only, and commercial use is prohibited.

Statistics:

Politifact:

  • Graphs: 314

  • Nodes: 41,054

  • Edges: 40,740

  • Classes:

    • Fake: 157

    • Real: 157

  • Node feature size:

    • bert: 768

    • content: 310

    • profile: 10

    • spacy: 300

Gossipcop:

  • Graphs: 5,464

  • Nodes: 314,262

  • Edges: 308,798

  • Classes:

    • Fake: 2,732

    • Real: 2,732

  • Node feature size:

    • bert: 768

    • content: 310

    • profile: 10

    • spacy: 300

Parameters
  • name (str) – Name of the dataset (gossipcop, or politifact)

  • feature_name (str) – Name of the feature (bert, content, profile, or spacy)

  • raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/

name

Name of the dataset (gossipcop, or politifact)

Type

str

num_classes

Number of label classes

Type

int

num_graphs

Number of graphs

Type

int

graphs

A list of DGLGraph objects

Type

list

labels

Graph labels

Type

Tensor

feature_name

Name of the feature (bert, content, profile, or spacy)

Type

str

feature

Node features

Type

Tensor

train_mask

Mask of training set

Type

Tensor

val_mask

Mask of validation set

Type

Tensor

test_mask

Mask of testing set

Type

Tensor

Examples

>>> dataset = FakeNewsDataset('gossipcop', 'bert')
>>> graph, label = dataset[0]
>>> num_classes = dataset.num_classes
>>> feat = dataset.feature
>>> labels = dataset.labels
__getitem__(i)[source]

Get graph and label by index

Parameters

i (int) – Item index

Returns

Return type

(dgl.DGLGraph, Tensor)

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

Utilities

utils.get_download_dir()

Get the absolute path to the download directory.

utils.download(url[, path, overwrite, …])

Download a given URL.

utils.check_sha1(filename, sha1_hash)

Check whether the sha1 hash of the file content matches the expected hash.

utils.extract_archive(file, target_dir[, …])

Extract archive file.

utils.split_dataset(dataset[, frac_list, …])

Split dataset into training, validation and test set.

utils.load_labels(filename)

Load label dict from file

utils.save_info(path, info)

Save dataset related information into disk.

utils.load_info(path)

Load dataset related information from disk.

class dgl.data.utils.Subset(dataset, indices)[source]

Subset of a dataset at specified indices

Code adapted from PyTorch.

Parameters
  • dataset – dataset[i] should return the ith datapoint

  • indices (list) – List of datapoint indices to construct the subset

__getitem__(item)[source]

Get the datapoint indexed by item

Returns

datapoint

Return type

tuple

__len__()[source]

Get subset size

Returns

Number of datapoints in the subset

Return type

int