新闻推荐数据集
收藏阿里云天池2026-06-03 更新2024-07-22 收录
下载链接:
https://tianchi.aliyun.com/dataset/183569
下载链接
链接失效反馈官方服务:
资源简介:
新闻推荐数据集
赛题以预测用户未来点击新闻文章为任务,数据集报名后可见并可下载,该数据来自某新闻APP平台的用户交互数据,包括30万用户,近300万次点击,共36万多篇不同的新闻文章,同时每篇新闻文章有对应的embedding向量表示。为了保证比赛的公平性,将会从中抽取20万用户的点击日志数据作为训练集,5万用户的点击日志数据作为测试集A,5万用户的点击日志数据作为测试集B。
train_click_log.csv:训练集用户点击日志
testA_click_log.csv:测试集用户点击日志
articles.csv:新闻文章信息数据表
articles_emb.csv:新闻文章embedding向量表示
sample_submit.csv:提交样例文件
user_id 用户id
click_article_id 点击文章id
click_timestamp 点击时间戳
click_environment 点击环境
click_deviceGroup 点击设备组
click_os 点击操作系统
click_country 点击城市
click_region 点击地区
click_referrer_type 点击来源类型
article_id 文章id,与click_article_id相对应
category_id 文章类型id
created_at_ts 文章创建时间戳
words_count 文章字数
emb_1,emb_2,...,emb_249 文章embedding向量表示
News Recommendation Dataset
This competition takes predicting users' future clicks on news articles as its core task. The dataset is available for download after registration, and is sourced from user interaction data of a certain news APP platform. It includes 300,000 users, nearly 3 million click records, and more than 360,000 distinct news articles, with each news article having a corresponding embedding vector representation. To ensure the fairness of the competition, click log data of 200,000 users will be extracted as the training set, click log data of 50,000 users as test set A, and click log data of another 50,000 users as test set B.
The dataset contains the following files:
1. train_click_log.csv: User click log of the training set
2. testA_click_log.csv: User click log of test set A
3. articles.csv: Information table of news articles
4. articles_emb.csv: Embedding vector representation table of news articles
5. sample_submit.csv: Sample submission file
Field descriptions:
- user_id: User ID
- click_article_id: ID of the clicked article
- click_timestamp: Click timestamp
- click_environment: Click environment
- click_deviceGroup: Click device group
- click_os: Click operating system
- click_country: Click city
- click_region: Click region
- click_referrer_type: Type of click referral source
- article_id: Article ID, corresponding to click_article_id
- category_id: Article category ID
- created_at_ts: Article creation timestamp
- words_count: Word count of the article
- emb_1, emb_2, ..., emb_249: Embedding vector representations of the article
提供机构:
阿里云天池
创建时间:
2024-07-21
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个用于新闻推荐任务的大规模用户交互数据集,包含30万用户、近300万次点击和36万多篇新闻文章,每篇文章都有embedding向量表示。数据集分为训练集和测试集,提供了详细的点击日志和文章信息,适用于预测用户点击行为的机器学习模型训练和评估。
以上内容由遇见数据集搜集并总结生成



