yeeb/C50
收藏Hugging Face2022-10-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/yeeb/C50
下载链接
链接失效反馈官方服务:
资源简介:
---
license: openrail
---
## Dataset Description
The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts.
- **Homepage:** https://archive.ics.uci.edu/ml/datasets/Reuter_50_50
- **Repository:** https://archive.ics.uci.edu/ml/datasets/Reuter_50_50
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
提供机构:
yeeb
原始信息汇总
数据集概述
数据集名称
RCV1 子集
数据集描述
本数据集是 RCV1 的一个子集,已用于作者识别实验。数据集中选取了文章总量排名前 50 的作者,并从这些作者中筛选出至少有一篇文章标注为 CCAT(corporate/industrial) 子主题的作者。这样做的目的是为了最小化主题因素在文本区分中的影响。训练集包含 2,500 篇文章(每位作者 50 篇),测试集包含另外 2,500 篇文章(每位作者 50 篇),与训练集文章不重叠。
数据集结构
- 训练集:2,500 篇文章
- 测试集:2,500 篇文章
许可证
openrail



