Aurora-chasing/data_sample_1000
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Aurora-chasing/data_sample_1000
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
---
# TAAC2026 Demo Dataset (1000 Samples)
A sample dataset containing 1000 user-item interaction records for the [TAAC2026 competition](https://algo.qq.com).
## Dataset Description
- **Rows**: 1,000
- **Format**: Parquet (`sample_data.parquet`)
- **File Size**: ~68 MB
## Columns
| Column | Type | Description |
|---|---|---|
| `item_id` | `int64` | **Target item** identifier. |
| `item_feature` | `array[struct]` | Array of **target item** feature dicts. Each element has `feature_id`, `feature_value_type`, and value fields (`float_value`, `int_array`, `int_value`). |
| `label` | `array[struct]` | Array of label dicts. Each element contains `action_time` and `action_type`. |
| `seq_feature` | `struct` | Sequence features dict with keys: `action_seq`, `content_seq`, `item_seq`. Each sub-key contains arrays of feature structs. |
| `timestamp` | `int64` | Event timestamp. |
| `user_feature` | `array[struct]` | Array of user feature dicts. Each element has `feature_id`, `feature_value_type`, and value fields (`float_array`, `int_array`, `int_value`). |
| `user_id` | `string` | User identifier. |
## Feature Struct Schema
Each feature element contains `feature_id`, `feature_value_type`, and several value fields. Depending on `feature_value_type`, the corresponding value fields are populated and the rest are `null`.
**`item_feature`** — value fields: `int_value`, `float_value`, `int_array`
```json
{
"feature_id": 6,
"feature_value_type": "int_value",
"float_value": null,
"int_array": null,
"int_value": 96,
}
```
**`user_feature`** — value fields: `int_value`, `float_array`, `int_array`
```json
{
"feature_id": 65,
"feature_value_type": "int_value",
"float_array": null,
"int_array": null,
"int_value": 19
}
```
**`seq_feature`** — value fields: `int_array`
```json
{
"feature_id": 19,
"feature_value_type": "int_array",
"int_array": [1, 1, 1, ...]
}
```
Possible `"feature_value_type"` values and their corresponding fields:
- `"int_value"` → `int_value`
- `"float_value"` → `float_value`
- `"int_array"` → `int_array`
- `"float_array"` → `float_array`
- Also there are some combinations of these types, e.g. `"int_array_and_float_array"` → both `int_array` and `float_array` are populated.
## Label Schema
Each element in the `label` array:
```json
{
"action_time": 1770694299,
"action_type": 1
}
```
## Usage
```python
import pandas as pd
df = pd.read_parquet("sample_data.parquet")
print(df.shape) # (1000, 7)
print(df.columns) # ['item_id', 'item_feature', 'label', 'seq_feature', 'timestamp', 'user_feature', 'user_id']
```
With Hugging Face `datasets`:
```python
from datasets import load_dataset
ds = load_dataset("TAAC2026/data_sample_1000")
print(ds)
```
许可协议:知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)
# TAAC2026演示数据集(1000条样本)
本数据集为面向[TAAC2026竞赛](https://algo.qq.com)的示例数据集,共包含1000条用户-物品交互记录。
## 数据集概览
- **数据行数**:1000
- **数据格式**:Parquet格式(文件名为`sample_data.parquet`)
- **文件大小**:约68 MB
## 字段说明
| 字段名 | 数据类型 | 字段描述 |
|---|---|---|
| `item_id` | `int64` | **目标物品(target item)**唯一标识符。 |
| `item_feature` | `array[struct]` | 目标物品(target item)特征字典数组。每个元素包含`feature_id`、`feature_value_type`以及取值字段(`float_value`、`int_array`、`int_value`)。 |
| `label` | `array[struct]` | 标签字典数组。每个元素包含`action_time`与`action_type`字段。 |
| `seq_feature` | `struct` | 序列特征字典,包含`action_seq`、`content_seq`、`item_seq`三个键,每个子键对应特征结构体数组。 |
| `timestamp` | `int64` | 事件时间戳。 |
| `user_feature` | `array[struct]` | 用户特征字典数组。每个元素包含`feature_id`、`feature_value_type`以及取值字段(`float_array`、`int_array`、`int_value`)。 |
| `user_id` | `string` | 用户唯一标识符。 |
## 特征结构体规范
每个特征元素均包含`feature_id`、`feature_value_type`以及若干取值字段。根据`feature_value_type`的取值,仅会填充对应的取值字段,其余字段均为`null`。
### `item_feature` 字段
对应取值字段为:`int_value`、`float_value`、`int_array`
示例格式:
json
{
"feature_id": 6,
"feature_value_type": "int_value",
"float_value": null,
"int_array": null,
"int_value": 96,
}
### `user_feature` 字段
对应取值字段为:`int_value`、`float_array`、`int_array`
示例格式:
json
{
"feature_id": 65,
"feature_value_type": "int_value",
"float_array": null,
"int_array": null,
"int_value": 19
}
### `seq_feature` 字段
对应取值字段为:`int_array`
示例格式:
json
{
"feature_id": 19,
"feature_value_type": "int_array",
"int_array": [1, 1, 1, ...]
}
可选的`"feature_value_type"`取值及其对应字段如下:
- `"int_value"` → 对应`int_value`字段
- `"float_value"` → 对应`float_value`字段
- `"int_array"` → 对应`int_array`字段
- `"float_array"` → 对应`float_array`字段
- 此外存在部分组合类型,例如`"int_array_and_float_array"`,此时`int_array`与`float_array`字段均会被填充。
## 标签格式规范
`label`数组中的每个元素格式如下:
json
{
"action_time": 1770694299,
"action_type": 1
}
## 使用示例
### 通过Pandas加载
python
import pandas as pd
df = pd.read_parquet("sample_data.parquet")
print(df.shape) # (1000, 7)
print(df.columns) # ['item_id', 'item_feature', 'label', 'seq_feature', 'timestamp', 'user_feature', 'user_id']
### 通过Hugging Face `datasets`库加载
python
from datasets import load_dataset
ds = load_dataset("TAAC2026/data_sample_1000")
print(ds)
提供机构:
Aurora-chasing



