five

xperimental dataset for cross-lingual text classification based on sentence vector weighting

收藏
DataCite Commons2025-04-27 更新2025-04-16 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=97ab24dee41f488dac431e4872ad3f14
下载链接
链接失效反馈
官方服务:
资源简介:
该实验数据集包含论文中三个实验中使用的数据。数据集1是金融、经济、文化两大分类数据集,涵盖中文、俄文、法文、西班牙文四种语言,共计1610篇文本。数据集2是财经、科技、体育、文化四大分类数据集,涵盖中文、英文、俄文、法文四大语言,共计2745篇文本。数据集 3 来自多语言公共数据集 Reuters RCV1/RCV2。从数据集的中文、德文、法文和丹麦文馆藏中选出仅标有单一类别的文本,即CCAT(企业/工业)、ECAT(经济学)、GCAT(政府/社会)和MCAT(市场)四个类别,共计3200篇文本。

This experimental dataset comprises the data used in the three experiments reported in this paper. Dataset 1 is a two-category classification dataset covering four languages: Chinese, Russian, French and Spanish, with a total of 1610 texts, and its categories include finance, economy and culture. Dataset 2 is a four-category classification dataset spanning four languages: Chinese, English, Russian and French, containing a total of 2745 texts, covering finance & economics, technology, sports and culture. Dataset 3 is sourced from the multilingual public dataset Reuters RCV1/RCV2. We selected texts labeled with only a single category from its Chinese, German, French and Danish collections, which include four categories: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social) and MCAT (Markets), with a total of 3200 texts.
提供机构:
Science Data Bank
创建时间:
2024-07-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作