FrancophonIA/Cross-Language-Dataset

Name: FrancophonIA/Cross-Language-Dataset
Creator: FrancophonIA
Published: 2025-03-30 14:34:14
License: 暂无描述

Hugging Face2025-03-30 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/FrancophonIA/Cross-Language-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个用于评估跨语言相似性检测算法的多语言数据集。该数据集包含法语、英语和西班牙语三种语言，提供不同粒度（文档级、句子级和语块级）的跨语言对齐信息，基于平行和可比语料库构建，包括人工和机器翻译的文本。数据集中的部分内容被修改以增加跨语言相似性检测的难度，而其余部分保持无噪声。文档由不同类型的作者撰写，从普通作者到专业人士。

This is a multilingual dataset for the evaluation of cross-language similarity detection algorithms. The dataset includes French, English, and Spanish, providing cross-language alignment information at different granularities: document-level, sentence-level, and chunk-level. It is based on both parallel and comparable corpora, containing both human and machine translated text. Part of the dataset has been altered to make the cross-language similarity detection more challenging, while the rest remains without noise. The documents were written by various types of authors, ranging from average individuals to professionals.

提供机构：

FrancophonIA

5,000+

优质数据集

54 个

任务类型

进入经典数据集