mask_nodes_by_property

class dgl.data.utils.mask_nodes_by_property(property_values, part_ratios, random_seed=None)[source]

Bases:

Provide the split masks for a node split with distributional shift based on a given node property, as proposed in Evaluating Robustness and Uncertainty of Graph Models Under Structural Distributional Shifts

It considers the in-distribution (ID) and out-of-distribution (OOD) subsets of nodes. The ID subset includes training, validation and testing parts, while the OOD subset includes validation and testing parts. It sorts the nodes in the ascending order of their property values, splits them into 5 non-intersecting parts, and creates 5 associated node mask arrays:

  • 3 for the ID nodes: 'in_train_mask', 'in_valid_mask', 'in_test_mask',

  • and 2 for the OOD nodes: 'out_valid_mask', 'out_test_mask'.

Parameters:
  • property_values (numpy ndarray) – The node property (float) values by which the dataset will be split. The length of the array must be equal to the number of nodes in graph.

  • part_ratios (list) – A list of 5 ratios for training, ID validation, ID test, OOD validation, OOD testing parts. The values in the list must sum to one.

  • random_seed (int, optional) – Random seed to fix for the initial permutation of nodes. It is used to create a random order for the nodes that have the same property values or belong to the ID subset. (default: None)

Returns:

split_masks – A python dict storing the mask names as keys and the corresponding node mask arrays as values.

Return type:

dict

Examples

>>> num_nodes = 1000
>>> property_values = np.random.uniform(size=num_nodes)
>>> part_ratios = [0.3, 0.1, 0.1, 0.3, 0.2]
>>> split_masks = dgl.data.utils.mask_nodes_by_property(property_values, part_ratios)
>>> print('in_valid_mask' in split_masks)
True