yeeb/C50

Name: yeeb/C50
Creator: yeeb
Published: 2022-10-26 05:55:06
License: 暂无描述

Hugging Face2022-10-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/yeeb/C50

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: openrail --- ## Dataset Description The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts. - **Homepage:** https://archive.ics.uci.edu/ml/datasets/Reuter_50_50 - **Repository:** https://archive.ics.uci.edu/ml/datasets/Reuter_50_50 - **Paper:** - **Leaderboard:** - **Point of Contact:**

提供机构：

yeeb

原始信息汇总

数据集概述

数据集名称

RCV1 子集

数据集描述

本数据集是 RCV1 的一个子集，已用于作者识别实验。数据集中选取了文章总量排名前 50 的作者，并从这些作者中筛选出至少有一篇文章标注为 CCAT(corporate/industrial) 子主题的作者。这样做的目的是为了最小化主题因素在文本区分中的影响。训练集包含 2,500 篇文章（每位作者 50 篇），测试集包含另外 2,500 篇文章（每位作者 50 篇），与训练集文章不重叠。

数据集结构

训练集：2,500 篇文章
测试集：2,500 篇文章

许可证

openrail

5,000+

优质数据集

54 个

任务类型

进入经典数据集