OnDiskDataset
- class dgl.graphbolt.OnDiskDataset(path: str, include_original_edge_id: bool = False, force_preprocess: bool | None = None, auto_cast_to_optimal_dtype: bool = True)[source]
Bases:
Dataset
An on-disk dataset which reads graph topology, feature data and Train/Validation/Test set from disk.
Due to limited resources, the data which are too large to fit into RAM will remain on disk while others reside in RAM once
OnDiskDataset
is initialized. This behavior could be controled by user viain_memory
field in YAML file. All paths in YAML file are relative paths to the dataset directory.A full example of YAML file is as follows:
dataset_name: graphbolt_test graph: nodes: - type: paper # could be omitted for homogeneous graph. num: 1000 - type: author num: 1000 edges: - type: author:writes:paper # could be omitted for homogeneous graph. format: csv # Can be csv only. path: edge_data/author-writes-paper.csv - type: paper:cites:paper format: csv path: edge_data/paper-cites-paper.csv feature_data: - domain: node type: paper # could be omitted for homogeneous graph. name: feat format: numpy in_memory: false # If not specified, default to true. path: node_data/paper-feat.npy - domain: edge type: "author:writes:paper" name: feat format: numpy in_memory: false path: edge_data/author-writes-paper-feat.npy tasks: - name: "edge_classification" num_classes: 10 train_set: - type: paper # could be omitted for homogeneous graph. data: # multiple data sources could be specified. - name: seeds format: numpy # Can be numpy or torch. in_memory: true # If not specified, default to true. path: set/paper-train-seeds.npy - name: labels format: numpy path: set/paper-train-labels.npy validation_set: - type: paper data: - name: seeds format: numpy path: set/paper-validation-seeds.npy - name: labels format: numpy path: set/paper-validation-labels.npy test_set: - type: paper data: - name: seeds format: numpy path: set/paper-test-seeds.npy - name: labels format: numpy path: set/paper-test-labels.npy
- Parameters:
path (str) – The YAML file path.
include_original_edge_id (bool, optional) – Whether to include the original edge id in the FusedCSCSamplingGraph.
force_preprocess (bool, optional) – Whether to force reload the ondisk dataset.
auto_cast_to_optimal_dtype (bool, optional) – Casts the dtypes of tensors in the dataset into smallest possible dtypes for reduced storage requirements and potentially increased performance. Default is True.
- load(tasks: List[str] | None = None)[source]
Load the dataset.
- Parameters:
tasks (List[str] = None) – The name of the tasks to be loaded. For single task, the type of tasks can be both string and List[str]. For multiple tasks, only List[str] is acceptable.
Examples
Loading via single task name “node_classification”.
>>> dataset = gb.OnDiskDataset(base_dir).load( ... tasks="node_classification") >>> len(dataset.tasks) 1 >>> dataset.tasks[0].metadata["name"] "node_classification"
Loading via single task name [“node_classification”].
>>> dataset = gb.OnDiskDataset(base_dir).load( ... tasks=["node_classification"]) >>> len(dataset.tasks) 1 >>> dataset.tasks[0].metadata["name"] "node_classification"
3. Loading via multiple task names [“node_classification”, “link_prediction”].
>>> dataset = gb.OnDiskDataset(base_dir).load( ... tasks=["node_classification","link_prediction"]) >>> len(dataset.tasks) 2 >>> dataset.tasks[0].metadata["name"] "node_classification" >>> dataset.tasks[1].metadata["name"] "link_prediction"
- property all_nodes_set: ItemSet | ItemSetDict
Return the itemset containing all nodes.
- property feature: TorchBasedFeatureStore
Return the feature.
- property graph: SamplingGraph
Return the graph.