xperimental dataset for cross-lingual text classification based on sentence vector weighting

Name: xperimental dataset for cross-lingual text classification based on sentence vector weighting
Creator: Science Data Bank
Published: 2025-04-27 16:48:36
License: 暂无描述

DataCite Commons2025-04-27 更新2025-04-16 收录

下载链接：

https://www.scidb.cn/detail?dataSetId=97ab24dee41f488dac431e4872ad3f14

下载链接

链接失效反馈

官方服务：

资源简介：

该实验数据集包含论文中三个实验中使用的数据。数据集1是金融、经济、文化两大分类数据集，涵盖中文、俄文、法文、西班牙文四种语言，共计1610篇文本。数据集2是财经、科技、体育、文化四大分类数据集，涵盖中文、英文、俄文、法文四大语言，共计2745篇文本。数据集 3 来自多语言公共数据集 Reuters RCV1/RCV2。从数据集的中文、德文、法文和丹麦文馆藏中选出仅标有单一类别的文本，即CCAT（企业/工业）、ECAT（经济学）、GCAT（政府/社会）和MCAT（市场）四个类别，共计3200篇文本。

This experimental dataset comprises the data used in the three experiments reported in this paper. Dataset 1 is a two-category classification dataset covering four languages: Chinese, Russian, French and Spanish, with a total of 1610 texts, and its categories include finance, economy and culture. Dataset 2 is a four-category classification dataset spanning four languages: Chinese, English, Russian and French, containing a total of 2745 texts, covering finance & economics, technology, sports and culture. Dataset 3 is sourced from the multilingual public dataset Reuters RCV1/RCV2. We selected texts labeled with only a single category from its Chinese, German, French and Danish collections, which include four categories: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social) and MCAT (Markets), with a total of 3200 texts.

提供机构：

Science Data Bank

创建时间：

2024-07-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集