FrancophonIA/Cross-Language-Dataset
收藏Hugging Face2025-03-30 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/FrancophonIA/Cross-Language-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
这是一个用于评估跨语言相似性检测算法的多语言数据集。该数据集包含法语、英语和西班牙语三种语言,提供不同粒度(文档级、句子级和语块级)的跨语言对齐信息,基于平行和可比语料库构建,包括人工和机器翻译的文本。数据集中的部分内容被修改以增加跨语言相似性检测的难度,而其余部分保持无噪声。文档由不同类型的作者撰写,从普通作者到专业人士。
This is a multilingual dataset for the evaluation of cross-language similarity detection algorithms. The dataset includes French, English, and Spanish, providing cross-language alignment information at different granularities: document-level, sentence-level, and chunk-level. It is based on both parallel and comparable corpora, containing both human and machine translated text. Part of the dataset has been altered to make the cross-language similarity detection more challenging, while the rest remains without noise. The documents were written by various types of authors, ranging from average individuals to professionals.
提供机构:
FrancophonIA



