Reubencf/Adaption-multilingual-sentences

Name: Reubencf/Adaption-multilingual-sentences
Creator: Reubencf
Published: 2026-04-24 06:33:24
License: 暂无描述

Hugging Face2026-04-24 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Reubencf/Adaption-multilingual-sentences

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个多语言句子数据集，包含9,999个句子，覆盖123种语言，包括高资源、中资源、低资源以及一些构造语言。数据集来源于Tatoeba项目，并经过Adaption的Adaptive Data平台处理，增加了enhanced_prompt、enhanced_completion和reasoning_trace等字段。每个句子包含源语言文本、翻译（如果可用）以及经过Adaption处理的字段。数据集主要用于多语言指令调优、低资源语言能力评估以及翻译和跨语言迁移研究。数据分布偏向土耳其语、俄语、意大利语、英语和世界语，其他语言出现频率较低。句子为短句，非完整文档文本。Adaption生成的字段可能引入模型偏见和细微的意义漂移。数据集许可证为CC BY 2.0。

This dataset is a multilingual sentence dataset containing 9,999 sentences across 123 languages, including high-resource, mid-resource, low-resource, and several constructed languages. Derived from the Tatoeba project and processed by Adaptions Adaptive Data platform, it includes additional fields like enhanced_prompt, enhanced_completion, and reasoning_trace. Each row features a source-language sentence, translations (where available), and the Adaption-processed fields. Intended for multilingual instruction tuning, benchmarking low-resource and constructed-language capabilities in multilingual LLMs, and seed data for translation and cross-lingual transfer research. The distribution is skewed towards Turkish, Russian, Italian, English, and Esperanto, with long-tail languages appearing infrequently. Sentences are short, isolated utterances, not full-document text. Adaption-generated fields may inherit model bias and introduce subtle meaning drift for very low-resource languages. Licensed under CC BY 2.0.

提供机构：

Reubencf

5,000+

优质数据集

54 个

任务类型

进入经典数据集