OnDiskDataset

class dgl.graphbolt.OnDiskDataset(path: str, include_original_edge_id: bool = False, force_preprocess: bool | None = None, auto_cast_to_optimal_dtype: bool = True)[source]

Bases: Dataset

An on-disk dataset which reads graph topology, feature data and Train/Validation/Test set from disk.

Due to limited resources, the data which are too large to fit into RAM will remain on disk while others reside in RAM once OnDiskDataset is initialized. This behavior could be controled by user via in_memory field in YAML file. All paths in YAML file are relative paths to the dataset directory.

A full example of YAML file is as follows:

dataset_name: graphbolt_test
graph:
  nodes:
    - type: paper # could be omitted for homogeneous graph.
      num: 1000
    - type: author
      num: 1000
  edges:
    - type: author:writes:paper # could be omitted for homogeneous graph.
      format: csv # Can be csv only.
      path: edge_data/author-writes-paper.csv
    - type: paper:cites:paper
      format: csv
      path: edge_data/paper-cites-paper.csv
feature_data:
  - domain: node
    type: paper # could be omitted for homogeneous graph.
    name: feat
    format: numpy
    in_memory: false # If not specified, default to true.
    path: node_data/paper-feat.npy
  - domain: edge
    type: "author:writes:paper"
    name: feat
    format: numpy
    in_memory: false
    path: edge_data/author-writes-paper-feat.npy
tasks:
  - name: "edge_classification"
    num_classes: 10
    train_set:
      - type: paper # could be omitted for homogeneous graph.
        data: # multiple data sources could be specified.
          - name: seeds
            format: numpy # Can be numpy or torch.
            in_memory: true # If not specified, default to true.
            path: set/paper-train-seeds.npy
          - name: labels
            format: numpy
            path: set/paper-train-labels.npy
    validation_set:
      - type: paper
        data:
          - name: seeds
            format: numpy
            path: set/paper-validation-seeds.npy
          - name: labels
            format: numpy
            path: set/paper-validation-labels.npy
    test_set:
      - type: paper
        data:
          - name: seeds
            format: numpy
            path: set/paper-test-seeds.npy
          - name: labels
            format: numpy
            path: set/paper-test-labels.npy

Parameters:

path (str) – The YAML file path.
include_original_edge_id (bool, optional) – Whether to include the original edge id in the FusedCSCSamplingGraph.
force_preprocess (bool, optional) – Whether to force reload the ondisk dataset.
auto_cast_to_optimal_dtype (bool, optional) – Casts the dtypes of tensors in the dataset into smallest possible dtypes for reduced storage requirements and potentially increased performance. Default is True.

load(tasks: List[str] | None = None)[source]

Load the dataset.

Parameters:: tasks (List[str] = None) – The name of the tasks to be loaded. For single task, the type of tasks can be both string and List[str]. For multiple tasks, only List[str] is acceptable.

Examples

Loading via single task name “node_classification”.

>>> dataset = gb.OnDiskDataset(base_dir).load(
...     tasks="node_classification")
>>> len(dataset.tasks)
1
>>> dataset.tasks[0].metadata["name"]
"node_classification"

Loading via single task name [“node_classification”].

>>> dataset = gb.OnDiskDataset(base_dir).load(
...     tasks=["node_classification"])
>>> len(dataset.tasks)
1
>>> dataset.tasks[0].metadata["name"]
"node_classification"

3. Loading via multiple task names [“node_classification”, “link_prediction”].

>>> dataset = gb.OnDiskDataset(base_dir).load(
...     tasks=["node_classification","link_prediction"])
>>> len(dataset.tasks)
2
>>> dataset.tasks[0].metadata["name"]
"node_classification"
>>> dataset.tasks[1].metadata["name"]
"link_prediction"

property all_nodes_set: ItemSet | ItemSetDict: Return the itemset containing all nodes.

property dataset_name: str: Return the dataset name.

property feature: TorchBasedFeatureStore: Return the feature.

property graph: SamplingGraph: Return the graph.

property tasks: List[Task]: Return the tasks.

property yaml_data: Dict: Return the YAML data.