xperimental dataset for cross-lingual text classification based on sentence vector weighting
收藏DataCite Commons2025-04-27 更新2025-04-16 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=97ab24dee41f488dac431e4872ad3f14
下载链接
链接失效反馈官方服务:
资源简介:
该实验数据集包含论文中三个实验中使用的数据。数据集1是金融、经济、文化两大分类数据集,涵盖中文、俄文、法文、西班牙文四种语言,共计1610篇文本。数据集2是财经、科技、体育、文化四大分类数据集,涵盖中文、英文、俄文、法文四大语言,共计2745篇文本。数据集 3 来自多语言公共数据集 Reuters RCV1/RCV2。从数据集的中文、德文、法文和丹麦文馆藏中选出仅标有单一类别的文本,即CCAT(企业/工业)、ECAT(经济学)、GCAT(政府/社会)和MCAT(市场)四个类别,共计3200篇文本。
This experimental dataset comprises the data used in the three experiments reported in this paper. Dataset 1 is a two-category classification dataset covering four languages: Chinese, Russian, French and Spanish, with a total of 1610 texts, and its categories include finance, economy and culture. Dataset 2 is a four-category classification dataset spanning four languages: Chinese, English, Russian and French, containing a total of 2745 texts, covering finance & economics, technology, sports and culture. Dataset 3 is sourced from the multilingual public dataset Reuters RCV1/RCV2. We selected texts labeled with only a single category from its Chinese, German, French and Danish collections, which include four categories: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social) and MCAT (Markets), with a total of 3200 texts.
提供机构:
Science Data Bank
创建时间:
2024-07-26



