ZINCDataset

class dgl.data.ZINCDataset(mode='train', raw_dir=None, force_reload=False, verbose=False, transform=None)[source]

Bases: dgl.data.dgl_dataset.DGLBuiltinDataset

ZINC dataset for the graph regression task.

A subset (12K) of ZINC molecular graphs (250K) dataset is used to regress a molecular property known as the constrained solubility. For each molecular graph, the node features are the types of heavy atoms, between which the edge features are the types of bonds. Each graph contains 9-37 nodes and 16-84 edges.

Reference https://arxiv.org/pdf/2003.00982.pdf

Statistics:

Train examples: 10,000 Valid examples: 1,000 Test examples: 1,000 Average number of nodes: 23.16 Average number of edges: 39.83 Number of atom types: 28 Number of bond types: 4

Parameters
  • mode (str, optional) – Should be chosen from [“train”, “valid”, “test”] Default: “train”.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: “~/.dgl/”.

  • force_reload (bool) – Whether to reload the dataset. Default: False.

  • verbose (bool) – Whether to print out progress information. Default: False.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

num_atom_types

Number of atom types.

Type

int

num_bond_types

Number of bond types.

Type

int

Examples

>>> from dgl.data import ZINCDataset
>>> training_set = ZINCDataset(mode="train")
>>> training_set.num_atom_types
28
>>> len(training_set)
10000
>>> graph, label = training_set[0]
>>> graph
Graph(num_nodes=29, num_edges=64,
    ndata_schemes={'feat': Scheme(shape=(), dtype=torch.int64)}
    edata_schemes={'feat': Scheme(shape=(), dtype=torch.int64)})
__getitem__(idx)[source]

Get one example by index.

Parameters

idx (int) – The sample index.

Returns

  • dgl.DGLGraph – Each graph contains:

    • ndata['feat']: Types of heavy atoms as node features

    • edata['feat']: Types of bonds as edge features

  • Tensor – Constrained solubility as graph label

__len__()[source]

The number of examples in the dataset.