Tianyi-Lab/Agentic_MovieLens
收藏Hugging Face2026-04-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Tianyi-Lab/Agentic_MovieLens
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- config_name: mixed
data_files:
- split: train
path: data_mixed/train-*
- config_name: movies
data_files: metadata/movies/*.parquet
- config_name: users
data_files: metadata/users/*.parquet
dataset_info:
config_name: mixed
features:
- name: user_id
dtype: int64
- name: movie_id
dtype: int64
- name: rating
dtype: float64
- name: reasoning
dtype: string
splits:
- name: train
num_bytes: 12879921379
num_examples: 100000000
download_size: 5726140888
dataset_size: 12879921379
---
# Dataset Card for Agentic_Movielens
## Dataset Description
This dataset contains movie ratings and related information.
## Usage
### 1. Rating & Reasoning data
The dataset was sorted with regard to `user_id` and `movied_id` to support efficient query. Quick-start with the following helper function
```python
from datasets import load_dataset
class MovieMatrix:
def __init__(self, dataset_name="Tianyi-Lab/Agentic_Movielens"):
# Load in standard mode to enable memory mapping (instant access)
self.ds = load_dataset(dataset_name, split="train")
self.COLS = 10000 # Total movies per user
def get_interaction(self, user_id: int, movie_id: int):
"""
Retrieves interaction in O(1) time using matrix indexing.
"""
# 1. Validate IDs
if not (0 <= user_id <= 9999):
raise ValueError("User ID must be 0-9999")
if not (1 <= movie_id <= 10000):
raise ValueError("Movie ID must be 1-10000")
# 2. Calculate Index: (Row * Width) + Column
# Note: movie_id is 1-based, so we subtract 1 to get 0-based offset
index = (user_id * self.COLS) + (movie_id - 1)
# 3. Direct Access
return self.ds[index]
# --- Usage ---
matrix = MovieMatrix()
# Instant lookup
data = matrix.get_interaction(user_id=42, movie_id=500)
print(data)
```
By default, this will load the 100M dataset generated by Qwen. To access the dataset generated by mixed models, use the following command
```python
ds = load_dataset("Tianyi-Lab/Agentic_MovieLens", "mixed", split="train")
```
The mixed dataset is constructed via the following models:
| Users | Model | Records |
| :--- | :--- | :--- |
| 0-1250 | Gemini Flash | 12.5M |
| 1250-1500 | Gemini Pro | 2.5M |
| 1500-2750 | GPT-5 Mini | 12.5M |
| 2750-3000 | GPT-5 | 2.5M |
| 3000-3540 | Claude Haiku 4.5 | 5.4M |
| 3540-4000 | Qwen (default) | 4.6M |
| 4000-5500 | Gemini Flash | 15M |
| 5500-7000 | GPT-5 Mini | 15M |
| 7000-8000 | DeepSeek v3.2 | 10M |
| 8000-10000 | Qwen (default) | 20M |
| **Total** | **6 models** | **100M** |
### 2. Movie Metadata
Use the following function call
```python
ds_movies = load_dataset("Tianyi-Lab/Agentic_MovieLens", "movies", split="train")
```
### 3. User Metadata
Use the following function call
```python
ds_users = load_dataset("Tianyi-Lab/Agentic_MovieLens", "users", split="train")
```
The features are defined in `metadata/users/metadata_mappings.json`.
<td>Specifically, the quiz item contains four elements representing the answer to a single question in the main self report quiz.<br/>
[0] -> the position of that question in the quiz (question order was shuffled for each user).<br/>
[1] -> the question ID, sey key for the text of each item<br/>
[2] -> the user's response (originally 1 - 100 scale, but rounded to nearest 10 here for privacy protection)<br/>
[3] -> time between question load and answer in milliseconds<br/>
Answers were not included in the dataset if the answer was a skip, or the answer was done in less than 1000ms.
</td>
User features are defined according to [Statistical "Which Character" Personality Quiz (SWCPQ)](https://openpsychometrics.org/tests/characters/).
## Dataset Structure
The dataset is provided in the `train` split and includes all collected data.
## Additional Information
For questions or issues, please refer to the repository documentation.
提供机构:
Tianyi-Lab



