five

Reubencf/Adaption-multilingual-sentences

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Reubencf/Adaption-multilingual-sentences
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是一个多语言句子数据集,包含9,999个句子,覆盖123种语言,包括高资源、中资源、低资源以及一些构造语言。数据集来源于Tatoeba项目,并经过Adaption的Adaptive Data平台处理,增加了enhanced_prompt、enhanced_completion和reasoning_trace等字段。每个句子包含源语言文本、翻译(如果可用)以及经过Adaption处理的字段。数据集主要用于多语言指令调优、低资源语言能力评估以及翻译和跨语言迁移研究。数据分布偏向土耳其语、俄语、意大利语、英语和世界语,其他语言出现频率较低。句子为短句,非完整文档文本。Adaption生成的字段可能引入模型偏见和细微的意义漂移。数据集许可证为CC BY 2.0。

This dataset is a multilingual sentence dataset containing 9,999 sentences across 123 languages, including high-resource, mid-resource, low-resource, and several constructed languages. Derived from the Tatoeba project and processed by Adaptions Adaptive Data platform, it includes additional fields like enhanced_prompt, enhanced_completion, and reasoning_trace. Each row features a source-language sentence, translations (where available), and the Adaption-processed fields. Intended for multilingual instruction tuning, benchmarking low-resource and constructed-language capabilities in multilingual LLMs, and seed data for translation and cross-lingual transfer research. The distribution is skewed towards Turkish, Russian, Italian, English, and Esperanto, with long-tail languages appearing infrequently. Sentences are short, isolated utterances, not full-document text. Adaption-generated fields may inherit model bias and introduce subtle meaning drift for very low-resource languages. Licensed under CC BY 2.0.
提供机构:
Reubencf
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作