MovieLensDatasetΒΆ
-
class
dgl.data.
MovieLensDataset
(name, valid_ratio, test_ratio=None, raw_dir=None, force_reload=None, verbose=None, transform=None, random_state=0)[source]ΒΆ Bases:
dgl.data.dgl_dataset.DGLDataset
MovieLens dataset for edge prediction tasks. The raw datasets are extracted from MovieLens <https://grouplens.org/datasets/movielens/>, introduced by Movielens unplugged: experiences with an occasionally connected recommender system <https://dl.acm.org/doi/10.1145/604045.604094>.
The datasets consist of user ratings for movies and incorporate additional user/movie information in the form of features. The nodes represent users and movies, and the edges store ratings that users assign to movies.
Statistics:
MovieLens-100K (ml-100k)
Users: 943
Movies: 1,682
Ratings: 100,000 (1, 2, 3, 4, 5)
MovieLens-1M (ml-1m)
Users: 6,040
Movies: 3,706
Ratings: 1,000,209 (1, 2, 3, 4, 5)
MovieLens-10M (ml-10m)
Users: 69,878
Movies: 10,677
Ratings: 10,000,054 (0.5, 1, 1.5, β¦, 4.5, 5.0)
- Parameters
name (str) β Dataset name. (
"ml-100k"
,"ml-1m"
,"ml-10m"
).valid_ratio (int) β Ratio of validation samples out of the whole dataset. Should be in (0.0, 1.0).
test_ratio (int, optional) β Ratio of testing samples out of the whole dataset. Should be in (0.0, 1.0). And its sum with
valid_ratio
should be in (0.0, 1.0) as well. This parameter is invalid whenname
is"ml-100k"
, since its testing samples are pre-specified. Default: Noneraw_dir (str, optional) β Raw file directory to download/store the data. Default: ~/.dgl/
force_reload (bool, optional) β Whether to re-download(if the dataset has not been downloaded) and re-process the dataset. Default: False
verbose (bool, optional) β Whether to print progress information. Default: True.
transform (callable, optional) β A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.random_state (int, optional) β Random seed used for random dataset split. Default: 0
Notes
When
name
is"ml-100k"
, thetest_ratio
is invalid, and the training ratio is equal to 1-valid_ratio
.
When
name
is"ml-1m"
or"ml-10m"
, thetest_ratio
is valid, and the training ratio is equal to 1-valid_ratio
-test_ratio
. - The number of edges is doubled to form an undirected(bidirected) graph structure.Examples
>>> from dgl.data import MovieLensDataset >>> dataset = MovieLensDataset(name='ml-100k', valid_ratio=0.2) >>> g = dataset[0] >>> g Graph(num_nodes={'movie': 1682, 'user': 943}, num_edges={('movie', 'movie-user', 'user'): 100000, ('user', 'user-movie', 'movie'): 100000}, metagraph=[('movie', 'user', 'movie-user'), ('user', 'movie', 'user-movie')])
>>> # get ratings of edges in the training graph. >>> rate = g.edges['user-movie'].data['rate'] # or rate = g.edges['movie-user'].data['rate'] >>> rate tensor([5., 5., 3., ..., 3., 3., 5.])
>>> # get train, valid and test mask of edges >>> train_mask = g.edges['user-movie'].data['train_mask'] >>> valid_mask = g.edges['user-movie'].data['valid_mask'] >>> test_mask = g.edges['user-movie'].data['test_mask']
>>> # get train, valid and test ratings >>> train_ratings = rate[train_mask] >>> valid_ratings = rate[valid_mask] >>> test_ratings = rate[test_mask]
>>> # get input features of users >>> g.nodes["user"].data["feat"] # or g.nodes["movie"].data["feat"] for movie nodes tensor([[0.4800, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [1.0600, 1.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [0.4600, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], ..., [0.4000, 0.0000, 1.0000, ..., 0.0000, 0.0000, 0.0000], [0.9600, 1.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [0.4400, 0.0000, 1.0000, ..., 0.0000, 0.0000, 0.0000]])