dgl.data¶
The dgl.data
package contains datasets hosted by DGL and also utilities
for downloading, processing, saving and loading data from external resources.
Quick links:
Base Dataset Class¶
-
class
dgl.data.
DGLDataset
(name, url=None, raw_dir=None, save_dir=None, hash_key=(), force_reload=False, verbose=False)[source]¶ The basic DGL dataset for creating graph datasets. This class defines a basic template class for DGL Dataset. The following steps will be executed automatically:
Check whether there is a dataset cache on disk (already processed and stored on the disk) by invoking
has_cache()
. If true, goto 5.Call
download()
to download the data.Call
process()
to process the data.Call
save()
to save the processed dataset on disk and goto 6.Call
load()
to load the processed dataset from disk.Done.
Users can overwite these functions with their own data processing logic.
- Parameters
name (str) – Name of the dataset
url (str) – Url to download the raw dataset
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
save_dir (str) – Directory to save the processed dataset. Default: same as raw_dir
hash_key (tuple) – A tuple of values as the input for the hash function. Users can distinguish instances (and their caches on the disk) from the same dataset class by comparing the hash values. Default: (), the corresponding hash value is
'f9065fa7'
.force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information
-
raw_path
¶ Directory contains the input data files. Default :
os.path.join(self.raw_dir, self.name)
- Type
-
download
()[source]¶ Overwite to realize your own logic of downloading data.
It is recommended to download the to the
self.raw_dir
folder. Can be ignored if the dataset is already inself.raw_dir
.
-
has_cache
()[source]¶ Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
Node Prediction Datasets¶
DGL hosted datasets for node classification/regression tasks.
Stanford sentiment treebank dataset¶
-
class
dgl.data.
SSTDataset
(mode='train', glove_embed_file=None, vocab_file=None, raw_dir=None, force_reload=False, verbose=False)[source]¶ Stanford Sentiment Treebank dataset.
-
Deprecated since version 0.5.0:
trees
is deprecated, it is replaced by:>>> dataset = SSTDataset() >>> for tree in dataset: .... # your code here
num_vocabs
is deprecated, it is replaced byvocab_size
.
Each sample is the constituency tree of a sentence. The leaf nodes represent words. The word is a int value stored in the
x
feature field. The non-leaf node has a special valuePAD_WORD
in thex
field. Each node also has a sentiment annotation: 5 classes (very negative, negative, neutral, positive and very positive). The sentiment label is a int value stored in they
feature field. Official site: http://nlp.stanford.edu/sentiment/index.htmlStatistics:
Train examples: 8,544
Dev examples: 1,101
Test examples: 2,210
Number of classes for each node: 5
- Parameters
mode (str, optional) – Should be one of [‘train’, ‘dev’, ‘test’, ‘tiny’] Default: train
glove_embed_file (str, optional) – The path to pretrained glove embedding file. Default: None
vocab_file (str, optional) – Optional vocabulary file. If not given, the default vacabulary file is used. Default: None
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
vocab
¶ Vocabulary of the dataset
- Type
OrderedDict
-
pretrained_emb
¶ Pretrained glove embedding with respect the vocabulary.
- Type
Tensor
Notes
All the samples will be loaded and preprocessed in the memory first.
Examples
>>> # get dataset >>> train_data = SSTDataset() >>> dev_data = SSTDataset(mode='dev') >>> test_data = SSTDataset(mode='test') >>> tiny_data = SSTDataset(mode='tiny') >>> >>> len(train_data) 8544 >>> train_data.num_classes 5 >>> glove_embed = train_data.pretrained_emb >>> train_data.vocab_size 19536 >>> train_data[0] Graph(num_nodes=71, num_edges=70, ndata_schemes={'x': Scheme(shape=(), dtype=torch.int64), 'y': Scheme(shape=(), dtype=torch.int64), 'mask': Scheme(shape=(), dtype=torch.int64)} edata_schemes={}) >>> for tree in train_data: ... input_ids = tree.ndata['x'] ... labels = tree.ndata['y'] ... mask = tree.ndata['mask'] ... # your code here
Karate club dataset¶
-
class
dgl.data.
KarateClubDataset
[source]¶ Karate Club dataset for Node Classification
-
Deprecated since version 0.5.0:
data
is deprecated, it is replaced by:>>> dataset = KarateClubDataset() >>> g = dataset[0]
Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002. Official website: http://konect.cc/networks/ucidata-zachary/
Karate Club dataset statistics:
Nodes: 34
Edges: 156
Number of Classes: 2
-
data
¶ A list of
dgl.DGLGraph
objects- Type
Examples
>>> dataset = KarateClubDataset() >>> num_classes = dataset.num_classes >>> g = dataset[0] >>> labels = g.ndata['label']
Citation network dataset¶
-
class
dgl.data.
CoraGraphDataset
(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True)[source]¶ Cora citation network dataset.
-
Deprecated since version 0.5.0:
graph
is deprecated, it is replaced by:>>> dataset = CoraGraphDataset() >>> graph = dataset[0]
train_mask
is deprecated, it is replaced by:>>> dataset = CoraGraphDataset() >>> graph = dataset[0] >>> train_mask = graph.ndata['train_mask']
val_mask
is deprecated, it is replaced by:>>> dataset = CoraGraphDataset() >>> graph = dataset[0] >>> val_mask = graph.ndata['val_mask']
test_mask
is deprecated, it is replaced by:>>> dataset = CoraGraphDataset() >>> graph = dataset[0] >>> test_mask = graph.ndata['test_mask']
labels
is deprecated, it is replaced by:>>> dataset = CoraGraphDataset() >>> graph = dataset[0] >>> labels = graph.ndata['label']
feat
is deprecated, it is replaced by:>>> dataset = CoraGraphDataset() >>> graph = dataset[0] >>> feat = graph.ndata['feat']
Nodes mean paper and edges mean citation relationships. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper.
Statistics:
Nodes: 2708
Edges: 10556
Number of Classes: 7
Label split:
Train: 140
Valid: 500
Test: 1000
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
-
graph
¶ Graph structure
- Type
-
train_mask
¶ Mask of training nodes
- Type
-
val_mask
¶ Mask of validation nodes
- Type
-
test_mask
¶ Mask of test nodes
- Type
-
labels
¶ Ground truth labels of each node
- Type
-
features
¶ Node features
- Type
Tensor
Notes
The node feature is row-normalized.
Examples
>>> dataset = CoraGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_classes >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
-
__getitem__
(idx)[source]¶ Gets the graph object
- Parameters
idx (int) – Item index, CoraGraphDataset has only one graph object
- Returns
graph structure, node features and labels.
ndata['train_mask']
: mask for training node setndata['val_mask']
: mask for validation node setndata['test_mask']
: mask for test node setndata['feat']
: node featurendata['label']
: ground truth labels
- Return type
-
class
dgl.data.
CiteseerGraphDataset
(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True)[source]¶ Citeseer citation network dataset.
-
Deprecated since version 0.5.0:
graph
is deprecated, it is replaced by:>>> dataset = CiteseerGraphDataset() >>> graph = dataset[0]
train_mask
is deprecated, it is replaced by:>>> dataset = CiteseerGraphDataset() >>> graph = dataset[0] >>> train_mask = graph.ndata['train_mask']
val_mask
is deprecated, it is replaced by:>>> dataset = CiteseerGraphDataset() >>> graph = dataset[0] >>> val_mask = graph.ndata['val_mask']
test_mask
is deprecated, it is replaced by:>>> dataset = CiteseerGraphDataset() >>> graph = dataset[0] >>> test_mask = graph.ndata['test_mask']
labels
is deprecated, it is replaced by:>>> dataset = CiteseerGraphDataset() >>> graph = dataset[0] >>> labels = graph.ndata['label']
feat
is deprecated, it is replaced by:>>> dataset = CiteseerGraphDataset() >>> graph = dataset[0] >>> feat = graph.ndata['feat']
Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 3703 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.
Statistics:
Nodes: 3327
Edges: 9228
Number of Classes: 6
Label Split:
Train: 120
Valid: 500
Test: 1000
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
-
graph
¶ Graph structure
- Type
-
train_mask
¶ Mask of training nodes
- Type
-
val_mask
¶ Mask of validation nodes
- Type
-
test_mask
¶ Mask of test nodes
- Type
-
labels
¶ Ground truth labels of each node
- Type
-
features
¶ Node features
- Type
Tensor
Notes
The node feature is row-normalized.
In citeseer dataset, there are some isolated nodes in the graph. These isolated nodes are added as zero-vecs into the right position.
Examples
>>> dataset = CiteseerGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_classes >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
-
__getitem__
(idx)[source]¶ Gets the graph object
- Parameters
idx (int) – Item index, CiteseerGraphDataset has only one graph object
- Returns
graph structure, node features and labels.
ndata['train_mask']
: mask for training node setndata['val_mask']
: mask for validation node setndata['test_mask']
: mask for test node setndata['feat']
: node featurendata['label']
: ground truth labels
- Return type
-
class
dgl.data.
PubmedGraphDataset
(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True)[source]¶ Pubmed citation network dataset.
-
Deprecated since version 0.5.0:
graph
is deprecated, it is replaced by:>>> dataset = PubmedGraphDataset() >>> graph = dataset[0]
train_mask
is deprecated, it is replaced by:>>> dataset = PubmedGraphDataset() >>> graph = dataset[0] >>> train_mask = graph.ndata['train_mask']
val_mask
is deprecated, it is replaced by:>>> dataset = PubmedGraphDataset() >>> graph = dataset[0] >>> val_mask = graph.ndata['val_mask']
test_mask
is deprecated, it is replaced by:>>> dataset = PubmedGraphDataset() >>> graph = dataset[0] >>> test_mask = graph.ndata['test_mask']
labels
is deprecated, it is replaced by:>>> dataset = PubmedGraphDataset() >>> graph = dataset[0] >>> labels = graph.ndata['label']
feat
is deprecated, it is replaced by:>>> dataset = PubmedGraphDataset() >>> graph = dataset[0] >>> feat = graph.ndata['feat']
Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 500 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.
Statistics:
Nodes: 19717
Edges: 88651
Number of Classes: 3
Label Split:
Train: 60
Valid: 500
Test: 1000
- Parameters
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
-
graph
¶ Graph structure
- Type
-
train_mask
¶ Mask of training nodes
- Type
-
val_mask
¶ Mask of validation nodes
- Type
-
test_mask
¶ Mask of test nodes
- Type
-
labels
¶ Ground truth labels of each node
- Type
-
features
¶ Node features
- Type
Tensor
Notes
The node feature is row-normalized.
Examples
>>> dataset = PubmedGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_of_class >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
-
__getitem__
(idx)[source]¶ Gets the graph object
- Parameters
idx (int) – Item index, PubmedGraphDataset has only one graph object
- Returns
graph structure, node features and labels.
ndata['train_mask']
: mask for training node setndata['val_mask']
: mask for validation node setndata['test_mask']
: mask for test node setndata['feat']
: node featurendata['label']
: ground truth labels
- Return type
CoraFull dataset¶
-
class
dgl.data.
CoraFullDataset
(raw_dir=None, force_reload=False, verbose=False)[source]¶ CORA-Full dataset for node classification task.
-
Deprecated since version 0.5.0:
data
is deprecated, it is repalced by:
>>> dataset = CoraFullDataset() >>> graph = dataset[0]
Extended Cora dataset. Nodes represent paper and edges represent citations.
Reference: https://github.com/shchur/gnn-benchmark#datasets
Statistics:
Nodes: 19,793
Edges: 126,842 (note that the original dataset has 65,311 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)
Number of Classes: 70
Node feature size: 8,710
- Parameters
Examples
>>> data = CoraFullDataset() >>> g = data[0] >>> num_class = data.num_classes >>> feat = g.ndata['feat'] # get node feature >>> label = g.ndata['label'] # get node labels
-
__getitem__
(idx)¶ Get graph by index
- Parameters
idx (int) – Item index
- Returns
The graph contains:
ndata['feat']
: node featuresndata['label']
: node labels
- Return type
-
__len__
()¶ Number of graphs in the dataset
RDF datasets¶
-
class
dgl.data.
AIFBDataset
(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]¶ AIFB dataset for node classification task
-
Deprecated since version 0.5.0:
graph
is deprecated, it is replaced by:>>> dataset = AIFBDataset() >>> graph = dataset[0]
train_idx
is deprecated, it can be replaced by:>>> dataset = AIFBDataset() >>> graph = dataset[0] >>> train_mask = graph.nodes[dataset.category].data['train_mask'] >>> train_idx = th.nonzero(train_mask, as_tuple=False).squeeze()
test_idx
is deprecated, it can be replaced by:>>> dataset = AIFBDataset() >>> graph = dataset[0] >>> test_mask = graph.nodes[dataset.category].data['test_mask'] >>> test_idx = th.nonzero(test_mask, as_tuple=False).squeeze()
AIFB DataSet is a Semantic Web (RDF) dataset used as a benchmark in data mining. It records the organizational structure of AIFB at the University of Karlsruhe.
AIFB dataset statistics:
Nodes: 7262
Edges: 48810 (including reverse edges)
Target Category: Personen
Number of Classes: 4
Label Split:
Train: 140
Test: 36
- Parameters
print_every (int) – Preprocessing log for every X tuples. Default: 10000.
insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
labels
¶ All the labels of the entities in
predict_category
- Type
Tensor
-
graph
¶ Graph structure
- Type
-
train_idx
¶ Entity IDs for training. All IDs are local IDs w.r.t. to
predict_category
.- Type
Tensor
-
test_idx
¶ Entity IDs for testing. All IDs are local IDs w.r.t. to
predict_category
.- Type
Tensor
Examples
>>> dataset = dgl.data.rdf.AIFBDataset() >>> graph = dataset[0] >>> category = dataset.predict_category >>> num_classes = dataset.num_classes >>> >>> train_mask = g.nodes[category].data.pop('train_mask') >>> test_mask = g.nodes[category].data.pop('test_mask') >>> labels = g.nodes[category].data.pop('labels')
-
class
dgl.data.
MUTAGDataset
(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]¶ MUTAG dataset for node classification task
-
Deprecated since version 0.5.0:
graph
is deprecated, it is replaced by:>>> dataset = MUTAGDataset() >>> graph = dataset[0]
train_idx
is deprecated, it can be replaced by:>>> dataset = MUTAGDataset() >>> graph = dataset[0] >>> train_mask = graph.nodes[dataset.category].data['train_mask'] >>> train_idx = th.nonzero(train_mask).squeeze()
test_idx
is deprecated, it can be replaced by:>>> dataset = MUTAGDataset() >>> graph = dataset[0] >>> test_mask = graph.nodes[dataset.category].data['test_mask'] >>> test_idx = th.nonzero(test_mask).squeeze()
Mutag dataset statistics:
Nodes: 27163
Edges: 148100 (including reverse edges)
Target Category: d
Number of Classes: 2
Label Split:
Train: 272
Test: 68
- Parameters
print_every (int) – Preprocessing log for every X tuples. Default: 10000.
insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
labels
¶ All the labels of the entities in
predict_category
- Type
Tensor
-
graph
¶ Graph structure
- Type
-
train_idx
¶ Entity IDs for training. All IDs are local IDs w.r.t. to
predict_category
.- Type
Tensor
-
test_idx
¶ Entity IDs for testing. All IDs are local IDs w.r.t. to
predict_category
.- Type
Tensor
Examples
>>> dataset = dgl.data.rdf.MUTAGDataset() >>> graph = dataset[0] >>> category = dataset.predict_category >>> num_classes = dataset.num_classes >>> >>> train_mask = g.nodes[category].data.pop('train_mask') >>> test_mask = g.nodes[category].data.pop('test_mask') >>> labels = g.nodes[category].data.pop('labels')
-
class
dgl.data.
BGSDataset
(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]¶ BGS dataset for node classification task
-
Deprecated since version 0.5.0:
graph
is deprecated, it is replaced by:>>> dataset = BGSDataset() >>> graph = dataset[0]
train_idx
is deprecated, it can be replaced by:>>> dataset = BGSDataset() >>> graph = dataset[0] >>> train_mask = graph.nodes[dataset.category].data['train_mask'] >>> train_idx = th.nonzero(train_mask).squeeze()
test_idx
is deprecated, it can be replaced by:>>> dataset = BGSDataset() >>> graph = dataset[0] >>> test_mask = graph.nodes[dataset.category].data['test_mask'] >>> test_idx = th.nonzero(test_mask).squeeze()
BGS namespace convention:
http://data.bgs.ac.uk/(ref|id)/<Major Concept>/<Sub Concept>/INSTANCE
. We ignored all literal nodes and the relations connecting them in the output graph. We also ignored the relation used to mark whether a term is CURRENT or DEPRECATED.BGS dataset statistics:
Nodes: 94806
Edges: 672884 (including reverse edges)
Target Category: Lexicon/NamedRockUnit
Number of Classes: 2
Label Split:
Train: 117
Test: 29
- Parameters
print_every (int) – Preprocessing log for every X tuples. Default: 10000.
insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
labels
¶ All the labels of the entities in
predict_category
- Type
Tensor
-
graph
¶ Graph structure
- Type
-
train_idx
¶ Entity IDs for training. All IDs are local IDs w.r.t. to
predict_category
.- Type
Tensor
-
test_idx
¶ Entity IDs for testing. All IDs are local IDs w.r.t. to
predict_category
.- Type
Tensor
Examples
>>> dataset = dgl.data.rdf.BGSDataset() >>> graph = dataset[0] >>> category = dataset.predict_category >>> num_classes = dataset.num_classes >>> >>> train_mask = g.nodes[category].data.pop('train_mask') >>> test_mask = g.nodes[category].data.pop('test_mask') >>> labels = g.nodes[category].data.pop('labels')
-
class
dgl.data.
AMDataset
(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]¶ AM dataset. for node classification task
-
Deprecated since version 0.5.0:
graph
is deprecated, it is replaced by:>>> dataset = AMDataset() >>> graph = dataset[0]
train_idx
is deprecated, it can be replaced by:>>> dataset = AMDataset() >>> graph = dataset[0] >>> train_mask = graph.nodes[dataset.category].data['train_mask'] >>> train_idx = th.nonzero(train_mask).squeeze()
test_idx
is deprecated, it can be replaced by:>>> dataset = AMDataset() >>> graph = dataset[0] >>> test_mask = graph.nodes[dataset.category].data['test_mask'] >>> test_idx = th.nonzero(test_mask).squeeze()
Namespace convention:
Instance:
http://purl.org/collections/nl/am/<type>-<id>
Relation:
http://purl.org/collections/nl/am/<name>
We ignored all literal nodes and the relations connecting them in the output graph.
AM dataset statistics:
Nodes: 881680
Edges: 5668682 (including reverse edges)
Target Category: proxy
Number of Classes: 11
Label Split:
Train: 802
Test: 198
- Parameters
print_every (int) – Preprocessing log for every X tuples. Default: 10000.
insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
labels
¶ All the labels of the entities in
predict_category
- Type
Tensor
-
graph
¶ Graph structure
- Type
-
train_idx
¶ Entity IDs for training. All IDs are local IDs w.r.t. to
predict_category
.- Type
Tensor
-
test_idx
¶ Entity IDs for testing. All IDs are local IDs w.r.t. to
predict_category
.- Type
Tensor
Examples
>>> dataset = dgl.data.rdf.AMDataset() >>> graph = dataset[0] >>> category = dataset.predict_category >>> num_classes = dataset.num_classes >>> >>> train_mask = g.nodes[category].data.pop('train_mask') >>> test_mask = g.nodes[category].data.pop('test_mask') >>> labels = g.nodes[category].data.pop('labels')
Amazon Co-Purchase dataset¶
-
class
dgl.data.
AmazonCoBuyComputerDataset
(raw_dir=None, force_reload=False, verbose=False)[source]¶ ‘Computer’ part of the AmazonCoBuy dataset for node classification task.
-
Deprecated since version 0.5.0:
data
is deprecated, it is repalced by:
>>> dataset = AmazonCoBuyComputerDataset() >>> graph = dataset[0]
Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.
Reference: https://github.com/shchur/gnn-benchmark#datasets
Statistics:
Nodes: 13,752
Edges: 491,722 (note that the original dataset has 245,778 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)
Number of classes: 10
Node feature size: 767
- Parameters
Examples
>>> data = AmazonCoBuyComputerDataset() >>> g = data[0] >>> num_class = data.num_classes >>> feat = g.ndata['feat'] # get node feature >>> label = g.ndata['label'] # get node labels
-
__getitem__
(idx)¶ Get graph by index
- Parameters
idx (int) – Item index
- Returns
The graph contains:
ndata['feat']
: node featuresndata['label']
: node labels
- Return type
-
__len__
()¶ Number of graphs in the dataset
-
class
dgl.data.
AmazonCoBuyPhotoDataset
(raw_dir=None, force_reload=False, verbose=False)[source]¶ AmazonCoBuy dataset for node classification task.
-
Deprecated since version 0.5.0:
data
is deprecated, it is repalced by:
>>> dataset = AmazonCoBuyPhotoDataset() >>> graph = dataset[0]
Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.
Reference: https://github.com/shchur/gnn-benchmark#datasets
Statistics
Nodes: 7,650
Edges: 238,163 (note that the original dataset has 119,043 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)
Number of classes: 8
Node feature size: 745
- Parameters
Examples
>>> data = AmazonCoBuyPhotoDataset() >>> g = data[0] >>> num_class = data.num_classes >>> feat = g.ndata['feat'] # get node feature >>> label = g.ndata['label'] # get node labels
-
__getitem__
(idx)¶ Get graph by index
- Parameters
idx (int) – Item index
- Returns
The graph contains:
ndata['feat']
: node featuresndata['label']
: node labels
- Return type
-
__len__
()¶ Number of graphs in the dataset
Coauthor dataset¶
‘Computer Science (CS)’ part of the Coauthor dataset for node classification task.
-
Deprecated since version 0.5.0:
data
is deprecated, it is repalced by:
>>> dataset = CoauthorCSDataset() >>> graph = dataset[0]
Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge. Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.
Reference: https://github.com/shchur/gnn-benchmark#datasets
Statistics:
Nodes: 18,333
Edges: 163,788 (note that the original dataset has 81,894 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)
Number of classes: 15
Node feature size: 6,805
- Parameters
Number of classes for each node.
- Type
A list of DGLGraph objects
- Type
Examples
>>> data = CoauthorCSDataset() >>> g = data[0] >>> num_class = data.num_classes >>> feat = g.ndata['feat'] # get node feature >>> label = g.ndata['label'] # get node labels
Get graph by index
- Parameters
idx (int) – Item index
- Returns
The graph contains:
ndata['feat']
: node featuresndata['label']
: node labels
- Return type
Number of graphs in the dataset
‘Physics’ part of the Coauthor dataset for node classification task.
-
Deprecated since version 0.5.0:
data
is deprecated, it is repalced by:
>>> dataset = CoauthorPhysicsDataset() >>> graph = dataset[0]
Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge. Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.
Reference: https://github.com/shchur/gnn-benchmark#datasets
Statistics
Nodes: 34,493
Edges: 495,924 (note that the original dataset has 247,962 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)
Number of classes: 5
Node feature size: 8,415
- Parameters
Number of classes for each node.
- Type
A list of DGLGraph objects
- Type
Examples
>>> data = CoauthorPhysicsDataset() >>> g = data[0] >>> num_class = data.num_classes >>> feat = g.ndata['feat'] # get node feature >>> label = g.ndata['label'] # get node labels
Get graph by index
- Parameters
idx (int) – Item index
- Returns
The graph contains:
ndata['feat']
: node featuresndata['label']
: node labels
- Return type
Number of graphs in the dataset
Protein-Protein Interaction dataset¶
-
class
dgl.data.
PPIDataset
(mode='train', raw_dir=None, force_reload=False, verbose=False)[source]¶ Protein-Protein Interaction dataset for inductive node classification
-
Deprecated since version 0.5.0:
lables
is deprecated, it is replaced by:>>> dataset = PPIDataset() >>> for g in dataset: .... labels = g.ndata['label'] .... >>>
features
is deprecated, it is replaced by:>>> dataset = PPIDataset() >>> for g in dataset: .... features = g.ndata['feat'] .... >>>
A toy Protein-Protein Interaction network dataset. The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels. 20 graphs for training, 2 for validation and 2 for testing.
Reference: http://snap.stanford.edu/graphsage/
Statistics:
Train examples: 20
Valid examples: 2
Test examples: 2
- Parameters
mode (str) – Must be one of (‘train’, ‘valid’, ‘test’). Default: ‘train’
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
labels
¶ Node labels
- Type
Tensor
-
features
¶ Node features
- Type
Tensor
Examples
>>> dataset = PPIDataset(mode='valid') >>> num_labels = dataset.num_labels >>> for g in dataset: .... feat = g.ndata['feat'] .... label = g.ndata['label'] .... # your code here >>>
Reddit dataset¶
-
class
dgl.data.
RedditDataset
(self_loop=False, raw_dir=None, force_reload=False, verbose=False)[source]¶ Reddit dataset for community detection (node classification)
-
Deprecated since version 0.5.0:
graph
is deprecated, it is replaced by:>>> dataset = RedditDataset() >>> graph = dataset[0]
num_labels
is deprecated, it is replaced by:>>> dataset = RedditDataset() >>> num_classes = dataset.num_classes
train_mask
is deprecated, it is replaced by:>>> dataset = RedditDataset() >>> graph = dataset[0] >>> train_mask = graph.ndata['train_mask']
val_mask
is deprecated, it is replaced by:>>> dataset = RedditDataset() >>> graph = dataset[0] >>> val_mask = graph.ndata['val_mask']
test_mask
is deprecated, it is replaced by:>>> dataset = RedditDataset() >>> graph = dataset[0] >>> test_mask = graph.ndata['test_mask']
features
is deprecated, it is replaced by:>>> dataset = RedditDataset() >>> graph = dataset[0] >>> features = graph.ndata['feat']
labels
is deprecated, it is replaced by:>>> dataset = RedditDataset() >>> graph = dataset[0] >>> labels = graph.ndata['label']
This is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. The authors sampled 50 large communities and built a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. We use the first 20 days for training and the remaining days for testing (with 30% used for validation).
Reference: http://snap.stanford.edu/graphsage/
Statistics
Nodes: 232,965
Edges: 114,615,892
Node feature size: 602
Number of training samples: 153,431
Number of validation samples: 23,831
Number of test samples: 55,703
- Parameters
self_loop (bool) – Whether load dataset with self loop connections. Default: False
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
graph
¶ Graph of the dataset
- Type
-
train_mask
¶ Mask of training nodes
- Type
-
val_mask
¶ Mask of validation nodes
- Type
-
test_mask
¶ Mask of test nodes
- Type
-
features
¶ Node features
- Type
Tensor
-
labels
¶ Node labels
- Type
Tensor
Examples
>>> data = RedditDataset() >>> g = data[0] >>> num_classes = data.num_classes >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label'] >>> >>> # Train, Validation and Test
-
__getitem__
(idx)[source]¶ Get graph by index
- Parameters
idx (int) – Item index
- Returns
graph structure, node labels, node features and splitting masks:
ndata['label']
: node labelndata['feat']
: node featurendata['train_mask']
: mask for training node setndata['val_mask']
: mask for validation node setndata['test_mask']:
mask for test node set
- Return type
Symmetric Stochastic Block Model Mixture dataset¶
-
class
dgl.data.
SBMMixtureDataset
(n_graphs, n_nodes, n_communities, k=2, avg_deg=3, pq='Appendix_C', rng=None)[source]¶ Symmetric Stochastic Block Model Mixture
Reference: Appendix C of Supervised Community Detection with Hierarchical Graph Neural Networks
- Parameters
n_graphs (int) – Number of graphs.
n_nodes (int) – Number of nodes.
n_communities (int) – Number of communities.
k (int, optional) – Multiplier. Default: 2
avg_deg (int, optional) – Average degree. Default: 3
pq (list of pair of nonnegative float or str, optional) – Random densities. This parameter is for future extension, for now it’s always using the default value. Default: Appendix_C
rng (numpy.random.RandomState, optional) – Random number generator. If not given, it’s numpy.random.RandomState() with seed=None, which read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise. Default: None
- Raises
RuntimeError is raised if pq is not a list or string. –
Examples
>>> data = SBMMixtureDataset(n_graphs=16, n_nodes=10000, n_communities=2) >>> from torch.utils.data import DataLoader >>> dataloader = DataLoader(data, batch_size=1, collate_fn=data.collate_fn) >>> for graph, line_graph, graph_degrees, line_graph_degrees, pm_pd in dataloader: ... # your code here
-
__getitem__
(idx)[source]¶ Get one example by index
- Parameters
idx (int) – Item index
- Returns
graph (
dgl.DGLGraph
) – The original graphline_graph (
dgl.DGLGraph
) – The line graph of graphgraph_degree (numpy.ndarray) – In degrees for each node in graph
line_graph_degree (numpy.ndarray) – In degrees for each node in line_graph
pm_pd (numpy.ndarray) – Edge indicator matrices Pm and Pd
-
collate_fn
(x)[source]¶ The collate function for dataloader
- Parameters
x (tuple) –
a batch of data that contains:
- graph:
dgl.DGLGraph
The original graph
- graph:
- line_graph:
dgl.DGLGraph
The line graph of graph
- line_graph:
- graph_degree: numpy.ndarray
In degrees for each node in graph
- line_graph_degree: numpy.ndarray
In degrees for each node in line_graph
- pm_pd: numpy.ndarray
Edge indicator matrices Pm and Pd
- Returns
g_batch (
dgl.DGLGraph
) – Batched graphslg_batch (
dgl.DGLGraph
) – Batched line graphsdegg_batch (numpy.ndarray) – A batch of in degrees for each node in g_batch
deglg_batch (numpy.ndarray) – A batch of in degrees for each node in lg_batch
pm_pd_batch (numpy.ndarray) – A batch of edge indicator matrices Pm and Pd
Fraud dataset¶
-
class
dgl.data.
FraudDataset
(name, raw_dir=None, random_seed=717, train_size=0.7, val_size=0.1, force_reload=False, verbose=True)[source]¶ Fraud node prediction dataset.
The dataset includes two multi-relational graphs extracted from Yelp and Amazon where nodes represent fraudulent reviews or fraudulent reviewers.
It was first proposed in a CIKM‘20 paper <https://arxiv.org/pdf/2008.08692.pdf> and has been used by a recent WWW‘21 paper <https://ponderly.github.io/pub/PCGNN_WWW2021.pdf> as a benchmark. Another paper <https://arxiv.org/pdf/2104.01404.pdf> also takes the dataset as an example to study the non-homophilous graphs. This dataset is built upon industrial data and has rich relational information and unique properties like class-imbalance and feature inconsistency, which makes the dataset be a good instance to investigate how GNNs perform on real-world noisy graphs. These graphs are bidirected and not self connected.
Reference: <https://github.com/YingtongDou/CARE-GNN>
- Parameters
name (str) – Name of the dataset
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
random_seed (int) – Specifying the random seed in splitting the dataset. Default: 717
train_size (float) – training set size of the dataset. Default: 0.7
val_size (float) – validation set size of the dataset, and the size of testing set is (1 - train_size - val_size) Default: 0.1
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
graph
¶ Graph structure, etc.
- Type
Examples
>>> dataset = FraudDataset('yelp') >>> graph = dataset[0] >>> num_classes = dataset.num_classes >>> feat = graph.ndata['feature'] >>> label = graph.ndata['label']
-
__getitem__
(idx)[source]¶ Get graph object
- Parameters
idx (int) – Item index
- Returns
graph structure, node features, node labels and masks
ndata['feature']
: node featuresndata['label']
: node labelsndata['train_mask']
: mask of training setndata['val_mask']
: mask of validation setndata['test_mask']
: mask of testing set
- Return type
-
class
dgl.data.
FraudYelpDataset
(raw_dir=None, random_seed=717, train_size=0.7, val_size=0.1, force_reload=False, verbose=True)[source]¶ Fraud Yelp Dataset
The Yelp dataset includes hotel and restaurant reviews filtered (spam) and recommended (legitimate) by Yelp. A spam review detection task can be conducted, which is a binary classification task. 32 handcrafted features from <http://dx.doi.org/10.1145/2783258.2783370> are taken as the raw node features. Reviews are nodes in the graph, and three relations are:
R-U-R: it connects reviews posted by the same user
R-S-R: it connects reviews under the same product with the same star rating (1-5 stars)
R-T-R: it connects two reviews under the same product posted in the same month.
Statistics:
Nodes: 45,954
Edges:
R-U-R: 98,630
R-T-R: 1,147,232
R-S-R: 6,805,486
Classes:
Positive (spam): 6,677
Negative (legitimate): 39,277
Positive-Negative ratio: 1 : 5.9
Node feature size: 32
- Parameters
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
random_seed (int) – Specifying the random seed in splitting the dataset. Default: 717
train_size (float) – training set size of the dataset. Default: 0.7
val_size (float) – validation set size of the dataset, and the size of testing set is (1 - train_size - val_size) Default: 0.1
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
Examples
>>> dataset = FraudYelpDataset() >>> graph = dataset[0] >>> num_classes = dataset.num_classes >>> feat = graph.ndata['feature'] >>> label = graph.ndata['label']
-
__getitem__
(idx)¶ Get graph object
- Parameters
idx (int) – Item index
- Returns
graph structure, node features, node labels and masks
ndata['feature']
: node featuresndata['label']
: node labelsndata['train_mask']
: mask of training setndata['val_mask']
: mask of validation setndata['test_mask']
: mask of testing set
- Return type
-
__len__
()¶ number of data examples
-
class
dgl.data.
FraudAmazonDataset
(raw_dir=None, random_seed=717, train_size=0.7, val_size=0.1, force_reload=False, verbose=True)[source]¶ Fraud Amazon Dataset
The Amazon dataset includes product reviews under the Musical Instruments category. Users with more than 80% helpful votes are labelled as benign entities and users with less than 20% helpful votes are labelled as fraudulent entities. A fraudulent user detection task can be conducted on the Amazon dataset, which is a binary classification task. 25 handcrafted features from <https://arxiv.org/pdf/2005.10150.pdf> are taken as the raw node features .
Users are nodes in the graph, and three relations are: 1. U-P-U : it connects users reviewing at least one same product 2. U-S-U : it connects users having at least one same star rating within one week 3. U-V-U : it connects users with top 5% mutual review text similarities (measured by TF-IDF) among all users.
Statistics:
Nodes: 11,944
Edges:
U-P-U: 351,216
U-S-U: 7,132,958
U-V-U: 2,073,474
Classes:
Positive (fraudulent): 821
Negative (benign): 7,818
Unlabeled: 3,305
Positive-Negative ratio: 1 : 10.5
Node feature size: 25
- Parameters
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
random_seed (int) – Specifying the random seed in splitting the dataset. Default: 717
train_size (float) – training set size of the dataset. Default: 0.7
val_size (float) – validation set size of the dataset, and the size of testing set is (1 - train_size - val_size) Default: 0.1
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
Examples
>>> dataset = FraudAmazonDataset() >>> graph = dataset[0] >>> num_classes = dataset.num_classes >>> feat = graph.ndata['feature'] >>> label = graph.ndata['label']
-
__getitem__
(idx)¶ Get graph object
- Parameters
idx (int) – Item index
- Returns
graph structure, node features, node labels and masks
ndata['feature']
: node featuresndata['label']
: node labelsndata['train_mask']
: mask of training setndata['val_mask']
: mask of validation setndata['test_mask']
: mask of testing set
- Return type
-
__len__
()¶ number of data examples
Edge Prediction Datasets¶
DGL hosted datasets for edge classification/regression and link prediction tasks.
Knowlege graph dataset¶
-
class
dgl.data.
FB15k237Dataset
(reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]¶ FB15k237 link prediction dataset.
-
Deprecated since version 0.5.0:
train
is deprecated, it is replaced by:>>> dataset = FB15k237Dataset() >>> graph = dataset[0] >>> train_mask = graph.edata['train_mask'] >>> train_idx = th.nonzero(train_mask, as_tuple=False).squeeze() >>> src, dst = graph.edges(train_idx) >>> rel = graph.edata['etype'][train_idx]
valid
is deprecated, it is replaced by:>>> dataset = FB15k237Dataset() >>> graph = dataset[0] >>> val_mask = graph.edata['val_mask'] >>> val_idx = th.nonzero(val_mask, as_tuple=False).squeeze() >>> src, dst = graph.edges(val_idx) >>> rel = graph.edata['etype'][val_idx]
test
is deprecated, it is replaced by:>>> dataset = FB15k237Dataset() >>> graph = dataset[0] >>> test_mask = graph.edata['test_mask'] >>> test_idx = th.nonzero(test_mask, as_tuple=False).squeeze() >>> src, dst = graph.edges(test_idx) >>> rel = graph.edata['etype'][test_idx]
FB15k-237 is a subset of FB15k where inverse relations are removed. When creating the dataset, a reverse edge with reversed relation types are created for each edge by default.
FB15k237 dataset statistics:
Nodes: 14541
Number of relation types: 237
Number of reversed relation types: 237
Label Split:
Train: 272115
Valid: 17535
Test: 20466
- Parameters
reverse (bool) – Whether to add reverse edge. Default True.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
train
¶ A numpy array of triplets (src, rel, dst) for the training graph
- Type
-
valid
¶ A numpy array of triplets (src, rel, dst) for the validation graph
- Type
-
test
¶ A numpy array of triplets (src, rel, dst) for the test graph
- Type
Examples
>>> dataset = FB15k237Dataset() >>> g = dataset.graph >>> e_type = g.edata['e_type'] >>> >>> # get data split >>> train_mask = g.edata['train_mask'] >>> val_mask = g.edata['val_mask'] >>> test_mask = g.edata['test_mask'] >>> >>> train_set = th.arange(g.number_of_edges())[train_mask] >>> val_set = th.arange(g.number_of_edges())[val_mask] >>> >>> # build train_g >>> train_edges = train_set >>> train_g = g.edge_subgraph(train_edges, relabel_nodes=False) >>> train_g.edata['e_type'] = e_type[train_edges]; >>> >>> # build val_g >>> val_edges = th.cat([train_edges, val_edges]) >>> val_g = g.edge_subgraph(val_edges, relabel_nodes=False) >>> val_g.edata['e_type'] = e_type[val_edges]; >>> >>> # Train, Validation and Test
-
__getitem__
(idx)[source]¶ Gets the graph object
- Parameters
idx (int) – Item index, FB15k237Dataset has only one graph object
- Returns
The graph contains
edata['e_type']
: edge relation typeedata['train_edge_mask']
: positive training edge maskedata['val_edge_mask']
: positive validation edge maskedata['test_edge_mask']
: positive testing edge maskedata['train_mask']
: training edge set mask (include reversed training edges)edata['val_mask']
: validation edge set mask (include reversed validation edges)edata['test_mask']
: testing edge set mask (include reversed testing edges)ndata['ntype']
: node type. All 0 in this dataset
- Return type
-
class
dgl.data.
FB15kDataset
(reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]¶ FB15k link prediction dataset.
-
Deprecated since version 0.5.0:
train
is deprecated, it is replaced by:>>> dataset = FB15kDataset() >>> graph = dataset[0] >>> train_mask = graph.edata['train_mask'] >>> train_idx = th.nonzero(train_mask, as_tuple=False).squeeze() >>> src, dst = graph.edges(train_idx) >>> rel = graph.edata['etype'][train_idx]
valid
is deprecated, it is replaced by:>>> dataset = FB15kDataset() >>> graph = dataset[0] >>> val_mask = graph.edata['val_mask'] >>> val_idx = th.nonzero(val_mask, as_tuple=False).squeeze() >>> src, dst = graph.edges(val_idx) >>> rel = graph.edata['etype'][val_idx]
test
is deprecated, it is replaced by:>>> dataset = FB15kDataset() >>> graph = dataset[0] >>> test_mask = graph.edata['test_mask'] >>> test_idx = th.nonzero(test_mask, as_tuple=False).squeeze() >>> src, dst = graph.edges(test_idx) >>> rel = graph.edata['etype'][test_idx]
The FB15K dataset was introduced in Translating Embeddings for Modeling Multi-relational Data. It is a subset of Freebase which contains about 14,951 entities with 1,345 different relations. When creating the dataset, a reverse edge with reversed relation types are created for each edge by default.
FB15k dataset statistics:
Nodes: 14,951
Number of relation types: 1,345
Number of reversed relation types: 1,345
Label Split:
Train: 483142
Valid: 50000
Test: 59071
- Parameters
reverse (bool) – Whether to add reverse edge. Default True.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
train
¶ A numpy array of triplets (src, rel, dst) for the training graph
- Type
-
valid
¶ A numpy array of triplets (src, rel, dst) for the validation graph
- Type
-
test
¶ A numpy array of triplets (src, rel, dst) for the test graph
- Type
Examples
>>> dataset = FB15kDataset() >>> g = dataset.graph >>> e_type = g.edata['e_type'] >>> >>> # get data split >>> train_mask = g.edata['train_mask'] >>> val_mask = g.edata['val_mask'] >>> >>> train_set = th.arange(g.number_of_edges())[train_mask] >>> val_set = th.arange(g.number_of_edges())[val_mask] >>> >>> # build train_g >>> train_edges = train_set >>> train_g = g.edge_subgraph(train_edges, relabel_nodes=False) >>> train_g.edata['e_type'] = e_type[train_edges]; >>> >>> # build val_g >>> val_edges = th.cat([train_edges, val_edges]) >>> val_g = g.edge_subgraph(val_edges, relabel_nodes=False) >>> val_g.edata['e_type'] = e_type[val_edges]; >>> >>> # Train, Validation and Test >>>
-
__getitem__
(idx)[source]¶ Gets the graph object
- Parameters
idx (int) – Item index, FB15kDataset has only one graph object
- Returns
The graph contains
edata['e_type']
: edge relation typeedata['train_edge_mask']
: positive training edge maskedata['val_edge_mask']
: positive validation edge maskedata['test_edge_mask']
: positive testing edge maskedata['train_mask']
: training edge set mask (include reversed training edges)edata['val_mask']
: validation edge set mask (include reversed validation edges)edata['test_mask']
: testing edge set mask (include reversed testing edges)ndata['ntype']
: node type. All 0 in this dataset
- Return type
-
class
dgl.data.
WN18Dataset
(reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]¶ WN18 link prediction dataset.
-
Deprecated since version 0.5.0:
train
is deprecated, it is replaced by:>>> dataset = WN18Dataset() >>> graph = dataset[0] >>> train_mask = graph.edata['train_mask'] >>> train_idx = th.nonzero(train_mask, as_tuple=False).squeeze() >>> src, dst = graph.edges(train_idx) >>> rel = graph.edata['etype'][train_idx]
valid
is deprecated, it is replaced by:>>> dataset = WN18Dataset() >>> graph = dataset[0] >>> val_mask = graph.edata['val_mask'] >>> val_idx = th.nonzero(val_mask, as_tuple=False).squeeze() >>> src, dst = graph.edges(val_idx) >>> rel = graph.edata['etype'][val_idx]
test
is deprecated, it is replaced by:>>> dataset = WN18Dataset() >>> graph = dataset[0] >>> test_mask = graph.edata['test_mask'] >>> test_idx = th.nonzero(test_mask, as_tuple=False).squeeze() >>> src, dst = graph.edges(test_idx) >>> rel = graph.edata['etype'][test_idx]
The WN18 dataset was introduced in Translating Embeddings for Modeling Multi-relational Data. It included the full 18 relations scraped from WordNet for roughly 41,000 synsets. When creating the dataset, a reverse edge with reversed relation types are created for each edge by default.
WN18 dataset statistics:
Nodes: 40943
Number of relation types: 18
Number of reversed relation types: 18
Label Split:
Train: 141442
Valid: 5000
Test: 5000
- Parameters
reverse (bool) – Whether to add reverse edge. Default True.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
-
train
¶ A numpy array of triplets (src, rel, dst) for the training graph
- Type
-
valid
¶ A numpy array of triplets (src, rel, dst) for the validation graph
- Type
-
test
¶ A numpy array of triplets (src, rel, dst) for the test graph
- Type
Examples
>>> dataset = WN18Dataset() >>> g = dataset.graph >>> e_type = g.edata['e_type'] >>> >>> # get data split >>> train_mask = g.edata['train_mask'] >>> val_mask = g.edata['val_mask'] >>> >>> train_set = th.arange(g.number_of_edges())[train_mask] >>> val_set = th.arange(g.number_of_edges())[val_mask] >>> >>> # build train_g >>> train_edges = train_set >>> train_g = g.edge_subgraph(train_edges, relabel_nodes=False) >>> train_g.edata['e_type'] = e_type[train_edges]; >>> >>> # build val_g >>> val_edges = th.cat([train_edges, val_edges]) >>> val_g = g.edge_subgraph(val_edges, relabel_nodes=False) >>> val_g.edata['e_type'] = e_type[val_edges]; >>> >>> # Train, Validation and Test >>>
-
__getitem__
(idx)[source]¶ Gets the graph object
- Parameters
idx (int) – Item index, WN18Dataset has only one graph object
- Returns
The graph contains
edata['e_type']
: edge relation typeedata['train_edge_mask']
: positive training edge maskedata['val_edge_mask']
: positive validation edge maskedata['test_edge_mask']
: positive testing edge maskedata['train_mask']
: training edge set mask (include reversed training edges)edata['val_mask']
: validation edge set mask (include reversed validation edges)edata['test_mask']
: testing edge set mask (include reversed testing edges)ndata['ntype']
: node type. All 0 in this dataset
- Return type
BitcoinOTC dataset¶
-
class
dgl.data.
BitcoinOTCDataset
(raw_dir=None, force_reload=False, verbose=False)[source]¶ BitcoinOTC dataset for fraud detection
This is who-trusts-whom network of people who trade using Bitcoin on a platform called Bitcoin OTC. Since Bitcoin users are anonymous, there is a need to maintain a record of users’ reputation to prevent transactions with fraudulent and risky users.
Offical website: https://snap.stanford.edu/data/soc-sign-bitcoin-otc.html
Bitcoin OTC dataset statistics:
Nodes: 5,881
Edges: 35,592
Range of edge weight: -10 to +10
Percentage of positive edges: 89%
- Parameters
- Raises
UserWarning – If the raw data is changed in the remote server by the author.
Examples
>>> dataset = BitcoinOTCDataset() >>> len(dataset) 136 >>> for g in dataset: .... # get edge feature .... edge_weights = g.edata['h'] .... # your code here >>>
ICEWS18 dataset¶
-
class
dgl.data.
ICEWS18Dataset
(mode='train', raw_dir=None, force_reload=False, verbose=False)[source]¶ ICEWS18 dataset for temporal graph
Integrated Crisis Early Warning System (ICEWS18)
Event data consists of coded interactions between socio-political actors (i.e., cooperative or hostile actions between individuals, groups, sectors and nation states). This Dataset consists of events from 1/1/2018 to 10/31/2018 (24 hours time granularity).
Reference:
Statistics:
Train examples: 240
Valid examples: 30
Test examples: 34
Nodes per graph: 23033
- Parameters
mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
Examples
>>> # get train, valid, test set >>> train_data = ICEWS18Dataset() >>> valid_data = ICEWS18Dataset(mode='valid') >>> test_data = ICEWS18Dataset(mode='test') >>> >>> train_size = len(train_data) >>> for g in train_data: .... e_feat = g.edata['rel_type'] .... # your code here .... >>>
GDELT dataset¶
-
class
dgl.data.
GDELTDataset
(mode='train', raw_dir=None, force_reload=False, verbose=False)[source]¶ GDELT dataset for event-based temporal graph
The Global Database of Events, Language, and Tone (GDELT) dataset. This contains events happend all over the world (ie every protest held anywhere in Russia on a given day is collapsed to a single entry). This Dataset consists ofevents collected from 1/1/2018 to 1/31/2018 (15 minutes time granularity).
Reference:
Statistics:
Train examples: 2,304
Valid examples: 288
Test examples: 384
- Parameters
mode (str) – Must be one of (‘train’, ‘valid’, ‘test’). Default: ‘train’
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
Examples
>>> # get train, valid, test dataset >>> train_data = GDELTDataset() >>> valid_data = GDELTDataset(mode='valid') >>> test_data = GDELTDataset(mode='test') >>> >>> # length of train set >>> train_size = len(train_data) >>> >>> for g in train_data: .... e_feat = g.edata['rel_type'] .... # your code here .... >>>
Graph Prediction Datasets¶
DGL hosted datasets for graph classification/regression tasks.
QM7b dataset¶
-
class
dgl.data.
QM7bDataset
(raw_dir=None, force_reload=False, verbose=False)[source]¶ QM7b dataset for graph property prediction (regression)
This dataset consists of 7,211 molecules with 14 regression targets. Nodes means atoms and edges means bonds. Edge data ‘h’ means the entry of Coulomb matrix.
Reference: http://quantum-machine.org/datasets/
Statistics:
Number of graphs: 7,211
Number of regression targets: 14
Average number of nodes: 15
Average number of edges: 245
Edge feature size: 1
- Parameters
- Raises
UserWarning – If the raw data is changed in the remote server by the author.
Examples
>>> data = QM7bDataset() >>> data.num_labels 14 >>> >>> # iterate over the dataset >>> for g, label in data: ... edge_feat = g.edata['h'] # get edge feature ... # your code here... ... >>>
-
__getitem__
(idx)[source]¶ Get graph and label by index
- Parameters
idx (int) – Item index
- Returns
- Return type
(
dgl.DGLGraph
, Tensor)
QM9 dataset¶
-
class
dgl.data.
QM9Dataset
(label_keys, cutoff=5.0, raw_dir=None, force_reload=False, verbose=False)[source]¶ QM9 dataset for graph property prediction (regression)
This dataset consists of 130,831 molecules with 12 regression targets. Nodes correspond to atoms and edges correspond to close atom pairs.
- This dataset differs from
QM9EdgeDataset
in the following aspects: Edges in this dataset are purely distance-based.
It only provides atoms’ coordinates and atomic numbers as node features
It only provides 12 regression targets.
Reference:
Statistics:
Number of graphs: 130,831
Number of regression targets: 12
Keys
Property
Description
Unit
mu
\(\mu\)
Dipole moment
\(\textrm{D}\)
alpha
\(\alpha\)
Isotropic polarizability
\({a_0}^3\)
homo
\(\epsilon_{\textrm{HOMO}}\)
Highest occupied molecular orbital energy
\(\textrm{eV}\)
lumo
\(\epsilon_{\textrm{LUMO}}\)
Lowest unoccupied molecular orbital energy
\(\textrm{eV}\)
gap
\(\Delta \epsilon\)
Gap between \(\epsilon_{\textrm{HOMO}}\) and \(\epsilon_{\textrm{LUMO}}\)
\(\textrm{eV}\)
r2
\(\langle R^2 \rangle\)
Electronic spatial extent
\({a_0}^2\)
zpve
\(\textrm{ZPVE}\)
Zero point vibrational energy
\(\textrm{eV}\)
U0
\(U_0\)
Internal energy at 0K
\(\textrm{eV}\)
U
\(U\)
Internal energy at 298.15K
\(\textrm{eV}\)
H
\(H\)
Enthalpy at 298.15K
\(\textrm{eV}\)
G
\(G\)
Free energy at 298.15K
\(\textrm{eV}\)
Cv
\(c_{\textrm{v}}\)
Heat capavity at 298.15K
\(\frac{\textrm{cal}}{\textrm{mol K}}\)
- Parameters
label_keys (list) – Names of the regression property, which should be a subset of the keys in the table above.
cutoff (float) – Cutoff distance for interatomic interactions, i.e. two atoms are connected in the corresponding graph if the distance between them is no larger than this. Default: 5.0 Angstrom
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
- Raises
UserWarning – If the raw data is changed in the remote server by the author.
Examples
>>> data = QM9Dataset(label_keys=['mu', 'gap'], cutoff=5.0) >>> data.num_labels 2 >>> >>> # iterate over the dataset >>> for g, label in data: ... R = g.ndata['R'] # get coordinates of each atom ... Z = g.ndata['Z'] # get atomic numbers of each atom ... # your code here... >>>
- This dataset differs from
QM9Edge dataset¶
-
class
dgl.data.
QM9EdgeDataset
(label_keys=None, raw_dir=None, force_reload=False, verbose=True)[source]¶ QM9Edge dataset for graph property prediction (regression)
This dataset consists of 130,831 molecules with 19 regression targets. Nodes correspond to atoms and edges correspond to bonds.
- This dataset differs from
QM9Dataset
in the following aspects: It includes the bonds in a molecule in the edges of the corresponding graph while the edges in
QM9Dataset
are purely distance-based.It provides edge features, and node features in addition to the atoms’ coordinates and atomic numbers.
It provides another 7 regression tasks(from 12 to 19).
This class is built based on a preprocessed version of the dataset, and we provide the preprocessing datails here.
Reference:
For Statistics:
Number of graphs: 130,831.
Number of regression targets: 19.
Node attributes:
pos: the 3D coordinates of each atom.
attr: the 11D atom features.
Edge attributes:
edge_attr: the 4D bond features.
Regression targets:
Keys
Property
Description
Unit
mu
\(\mu\)
Dipole moment
\(\textrm{D}\)
alpha
\(\alpha\)
Isotropic polarizability
\({a_0}^3\)
homo
\(\epsilon_{\textrm{HOMO}}\)
Highest occupied molecular orbital energy
\(\textrm{eV}\)
lumo
\(\epsilon_{\textrm{LUMO}}\)
Lowest unoccupied molecular orbital energy
\(\textrm{eV}\)
gap
\(\Delta \epsilon\)
Gap between \(\epsilon_{\textrm{HOMO}}\) and \(\epsilon_{\textrm{LUMO}}\)
\(\textrm{eV}\)
r2
\(\langle R^2 \rangle\)
Electronic spatial extent
\({a_0}^2\)
zpve
\(\textrm{ZPVE}\)
Zero point vibrational energy
\(\textrm{eV}\)
U0
\(U_0\)
Internal energy at 0K
\(\textrm{eV}\)
U
\(U\)
Internal energy at 298.15K
\(\textrm{eV}\)
H
\(H\)
Enthalpy at 298.15K
\(\textrm{eV}\)
G
\(G\)
Free energy at 298.15K
\(\textrm{eV}\)
Cv
\(c_{\textrm{v}}\)
Heat capavity at 298.15K
\(\frac{\textrm{cal}}{\textrm{mol K}}\)
U0_atom
\(U_0^{\textrm{ATOM}}\)
Atomization energy at 0K
\(\textrm{eV}\)
U_atom
\(U^{\textrm{ATOM}}\)
Atomization energy at 298.15K
\(\textrm{eV}\)
H_atom
\(H^{\textrm{ATOM}}\)
Atomization enthalpy at 298.15K
\(\textrm{eV}\)
G_atom
\(G^{\textrm{ATOM}}\)
Atomization free energy at 298.15K
\(\textrm{eV}\)
A
\(A\)
Rotational constant
\(\textrm{GHz}\)
B
\(B\)
Rotational constant
\(\textrm{GHz}\)
C
\(C\)
Rotational constant
\(\textrm{GHz}\)
- Parameters
label_keys (list) – Names of the regression property, which should be a subset of the keys in the table above. If not provided, it will load all the labels.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False.
verbose (bool) – Whether to print out progress information. Default: True.
- Raises
UserWarning – If the raw data is changed in the remote server by the author.
Examples
>>> data = QM9EdgeDataset(label_keys=['mu', 'alpha']) >>> data.num_labels 2
>>> # iterate over the dataset >>> for graph, labels in data: ... print(graph) # get information of each graph ... print(labels) # get labels of the corresponding graph ... # your code here... >>>
-
__getitem__
(idx)[source]¶ Get graph and label by index
- Parameters
idx (int) – Item index
- Returns
dgl.DGLGraph – The graph contains:
ndata['pos']
: the coordinates of each atomndata['attr']
: the features of each atomedata['edge_attr']
: the features of each bond
Tensor – Property values of molecular graphs
- This dataset differs from
Mini graph classification dataset¶
-
class
dgl.data.
MiniGCDataset
(num_graphs, min_num_v, max_num_v, seed=0, save_graph=True, force_reload=False, verbose=False)[source]¶ The synthetic graph classification dataset class.
The datset contains 8 different types of graphs.
class 0 : cycle graph
class 1 : star graph
class 2 : wheel graph
class 3 : lollipop graph
class 4 : hypercube graph
class 5 : grid graph
class 6 : clique graph
class 7 : circular ladder graph
- Parameters
Examples
>>> data = MiniGCDataset(100, 16, 32, seed=0)
The dataset instance is an iterable
>>> len(data) 100 >>> g, label = data[64] >>> g Graph(num_nodes=20, num_edges=82, ndata_schemes={} edata_schemes={}) >>> label tensor(5)
Batch the graphs and labels for mini-batch training
>>> graphs, labels = zip(*[data[i] for i in range(16)]) >>> batched_graphs = dgl.batch(graphs) >>> batched_labels = torch.tensor(labels) >>> batched_graphs Graph(num_nodes=356, num_edges=1060, ndata_schemes={} edata_schemes={})
TU dataset¶
-
class
dgl.data.
TUDataset
(name, raw_dir=None, force_reload=False, verbose=False)[source]¶ TUDataset contains lots of graph kernel datasets for graph classification.
- Parameters
name (str) – Dataset Name, such as
ENZYMES
,DD
,COLLAB
,MUTAG
, can be the datasets name on https://chrsmrrs.github.io/datasets/docs/datasets/.
Notes
IMPORTANT: Some of the datasets have duplicate edges exist in the graphs, e.g. the edges in
IMDB-BINARY
are all duplicated. DGL faithfully keeps the duplicates as per the original data. Other frameworks such as PyTorch Geometric removes the duplicates by default. You can remove the duplicate edges withdgl.to_simple()
.Examples
>>> data = TUDataset('DD')
The dataset instance is an iterable
>>> len(data) 188 >>> g, label = data[1024] >>> g Graph(num_nodes=88, num_edges=410, ndata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), 'node_labels': Scheme(shape=(1,), dtype=torch.int64)} edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)}) >>> label tensor([1])
Batch the graphs and labels for mini-batch training
>>> graphs, labels = zip(*[data[i] for i in range(16)]) >>> batched_graphs = dgl.batch(graphs) >>> batched_labels = torch.tensor(labels) >>> batched_graphs Graph(num_nodes=9539, num_edges=47382, ndata_schemes={'node_labels': Scheme(shape=(1,), dtype=torch.int64), '_ID': Scheme(shape=(), dtype=torch.int64)} edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})
Notes
Graphs may have node labels, node attributes, edge labels, and edge attributes, varing from different dataset.
Labels are mapped to \(\lbrace 0,\cdots,n-1 \rbrace\) where \(n\) is the number of labels (some datasets have raw labels \(\lbrace -1, 1 \rbrace\) which will be mapped to \(\lbrace 0, 1 \rbrace\)). In previous versions, the minimum label was added so that \(\lbrace -1, 1 \rbrace\) was mapped to \(\lbrace 0, 2 \rbrace\).
-
__getitem__
(idx)[source]¶ Get the idx-th sample.
- Parameters
idx (int) – The sample index.
- Returns
Graph with node feature stored in
feat
field and node label innode_label
if available. And its label.- Return type
(
dgl.DGLGraph
, Tensor)
-
class
dgl.data.
LegacyTUDataset
(name, use_pandas=False, hidden_size=10, max_allow_node=None, raw_dir=None, force_reload=False, verbose=False)[source]¶ LegacyTUDataset contains lots of graph kernel datasets for graph classification.
- Parameters
name (str) –
Dataset Name, such as
ENZYMES
,DD
,COLLAB
,MUTAG
, can be the datasets name on https://chrsmrrs.github.io/datasets/docs/datasets/.use_pandas (bool) – Numpy’s file read function has performance issue when file is large, using pandas can be faster. Default: False
hidden_size (int) – Some dataset doesn’t contain features. Use constant node features initialization instead, with hidden size as
hidden_size
. Default : 10max_allow_node (int) – Remove graphs that contains more nodes than
max_allow_node
. Default : None
Examples
>>> data = LegacyTUDataset('DD')
The dataset instance is an iterable
>>> len(data) 1178 >>> g, label = data[1024] >>> g Graph(num_nodes=88, num_edges=410, ndata_schemes={'feat': Scheme(shape=(89,), dtype=torch.float32), '_ID': Scheme(shape=(), dtype=torch.int64)} edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)}) >>> label tensor(1)
Batch the graphs and labels for mini-batch training
>>> graphs, labels = zip(*[data[i] for i in range(16)]) >>> batched_graphs = dgl.batch(graphs) >>> batched_labels = torch.tensor(labels) >>> batched_graphs Graph(num_nodes=9539, num_edges=47382, ndata_schemes={'feat': Scheme(shape=(89,), dtype=torch.float32), '_ID': Scheme(shape=(), dtype=torch.int64)} edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})
Notes
LegacyTUDataset uses provided node feature by default. If no feature provided, it uses one-hot node label instead. If neither labels provided, it uses constant for node feature.
-
__getitem__
(idx)[source]¶ Get the idx-th sample.
- Parameters
idx (int) – The sample index.
- Returns
Graph with node feature stored in
feat
field and node label innode_label
if available. And its label.- Return type
(
dgl.DGLGraph
, Tensor)
Graph isomorphism network dataset¶
-
class
dgl.data.
GINDataset
(name, self_loop, degree_as_nlabel=False, raw_dir=None, force_reload=False, verbose=False)[source]¶ Dataset Class for How Powerful Are Graph Neural Networks?.
This is adapted from https://github.com/weihua916/powerful-gnns/blob/master/dataset.zip.
The class provides an interface for nine datasets used in the paper along with the paper-specific settings. The datasets are
'MUTAG'
,'COLLAB'
,'IMDBBINARY'
,'IMDBMULTI'
,'NCI1'
,'PROTEINS'
,'PTC'
,'REDDITBINARY'
,'REDDITMULTI5K'
.If
degree_as_nlabel
is set toFalse
, thenndata['label']
stores the provided node label, otherwisendata['label']
stores the node in-degrees.For graphs that have node attributes,
ndata['attr']
stores the node attributes. For graphs that have no attribute,ndata['attr']
stores the corresponding one-hot encoding ofndata['label']
.- Parameters
Examples
>>> data = GINDataset(name='MUTAG', self_loop=False)
The dataset instance is an iterable
>>> len(data) 188 >>> g, label = data[128] >>> g Graph(num_nodes=13, num_edges=26, ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'attr': Scheme(shape=(7,), dtype=torch.float32)} edata_schemes={}) >>> label tensor(1)
Batch the graphs and labels for mini-batch training
>>> graphs, labels = zip(*[data[i] for i in range(16)]) >>> batched_graphs = dgl.batch(graphs) >>> batched_labels = torch.tensor(labels) >>> batched_graphs Graph(num_nodes=330, num_edges=748, ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'attr': Scheme(shape=(7,), dtype=torch.float32)} edata_schemes={})
Fake news dataset¶
-
class
dgl.data.
FakeNewsDataset
(name, feature_name, raw_dir=None)[source]¶ Fake News Graph Classification dataset.
The dataset is composed of two sets of tree-structured fake/real news propagation graphs extracted from Twitter. Different from most of the benchmark datasets for the graph classification task, the graphs in this dataset are directed tree-structured graphs where the root node represents the news, the leaf nodes are Twitter users who retweeted the root news. Besides, the node features are encoded user historical tweets using different pretrained language models:
bert: the 768-dimensional node feature composed of Twitter user historical tweets encoded by the bert-as-service
content: the 310-dimensional node feature composed of a 300-dimensional “spacy” vector plus a 10-dimensional “profile” vector
profile: the 10-dimensional node feature composed of ten Twitter user profile attributes.
spacy: the 300-dimensional node feature composed of Twitter user historical tweets encoded by the spaCy word2vec encoder.
Reference: <https://github.com/safe-graph/GNN-FakeNews>
Note: this dataset is for academic use only, and commercial use is prohibited.
Statistics:
Politifact:
Graphs: 314
Nodes: 41,054
Edges: 40,740
Classes:
Fake: 157
Real: 157
Node feature size:
bert: 768
content: 310
profile: 10
spacy: 300
Gossipcop:
Graphs: 5,464
Nodes: 314,262
Edges: 308,798
Classes:
Fake: 2,732
Real: 2,732
Node feature size:
bert: 768
content: 310
profile: 10
spacy: 300
- Parameters
-
labels
¶ Graph labels
- Type
Tensor
-
feature
¶ Node features
- Type
Tensor
-
train_mask
¶ Mask of training set
- Type
Tensor
-
val_mask
¶ Mask of validation set
- Type
Tensor
-
test_mask
¶ Mask of testing set
- Type
Tensor
Examples
>>> dataset = FakeNewsDataset('gossipcop', 'bert') >>> graph, label = dataset[0] >>> num_classes = dataset.num_classes >>> feat = dataset.feature >>> labels = dataset.labels
-
__getitem__
(i)[source]¶ Get graph and label by index
- Parameters
i (int) – Item index
- Returns
- Return type
(
dgl.DGLGraph
, Tensor)
Utilities¶
Get the absolute path to the download directory. |
|
|
Download a given URL. |
|
Check whether the sha1 hash of the file content matches the expected hash. |
|
Extract archive file. |
|
Split dataset into training, validation and test set. |
|
Load label dict from file |
|
Save dataset related information into disk. |
|
Load dataset related information from disk. |