five

flores

收藏
魔搭社区2026-05-20 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/opencompass/flores
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Flores 200 ## Table of Contents - [Dataset Card for Flores 200](#dataset-card-for-flores-200) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Home:** [Flores](https://github.com/facebookresearch/flores) - **Repository:** [Github](https://github.com/facebookresearch/flores) ### Dataset Summary FLORES is a benchmark dataset for machine translation between English and low-resource languages. >The creation of FLORES-200 doubles the existing language coverage of FLORES-101. Given the nature of the new languages, which have less standardization and require more specialized professional translations, the verification process became more complex. This required modifications to the translation workflow. FLORES-200 has several languages which were not translated from English. Specifically, several languages were translated from Spanish, French, Russian and Modern Standard Arabic. Moreover, FLORES-200 also includes two script alternatives for four languages. FLORES-200 consists of translations from 842 distinct web articles, totaling 3001 sentences. These sentences are divided into three splits: dev, devtest, and test (hidden). On average, sentences are approximately 21 words long. **Disclaimer**: *The Flores-200 dataset is hosted by the Facebook and licensed under the [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/). ### Supported Tasks and Leaderboards #### Multilingual Machine Translation Refer to the [Dynabench leaderboard](https://dynabench.org/flores/Flores%20MT%20Evaluation%20(FULL)) for additional details on model evaluation on FLORES-101 in the context of the WMT2021 shared task on [Large-Scale Multilingual Machine Translation](http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html). Flores 200 is an extention of this. ### Languages The dataset contains parallel sentences for 200 languages, as mentioned in the original [Github](https://github.com/facebookresearch/flores/blob/master/README.md) page for the project. Languages are identified with the ISO 639-3 code (e.g. `eng`, `fra`, `rus`) plus an additional code describing the script (e.g., "eng_Latn", "ukr_Cyrl"). See [the webpage for code descriptions](https://github.com/facebookresearch/flores/blob/main/flores200/README.md). Use the configuration `all` to access the full set of parallel sentences for all the available languages in a single command. Use a hyphenated pairing to get two langauges in one datapoint (e.g., "eng_Latn-ukr_Cyrl" will provide sentences in the format below). ## Dataset Structure ### Data Instances A sample from the `dev` split for the Ukrainian language (`ukr_Cyrl` config) is provided below. All configurations have the same structure, and all sentences are aligned across configurations and splits. ```python { 'id': 1, 'sentence': 'У понеділок, науковці зі Школи медицини Стенфордського університету оголосили про винайдення нового діагностичного інструменту, що може сортувати клітини за їх видами: це малесенький друкований чіп, який можна виготовити за допомогою стандартних променевих принтерів десь по одному центу США за штуку.', 'URL': 'https://en.wikinews.org/wiki/Scientists_say_new_medical_diagnostic_chip_can_sort_cells_anywhere_with_an_inkjet', 'domain': 'wikinews', 'topic': 'health', 'has_image': 0, 'has_hyperlink': 0 } ``` When using a hyphenated pairing or using the `all` function, data will be presented as follows: ```python { 'id': 1, 'URL': 'https://en.wikinews.org/wiki/Scientists_say_new_medical_diagnostic_chip_can_sort_cells_anywhere_with_an_inkjet', 'domain': 'wikinews', 'topic': 'health', 'has_image': 0, 'has_hyperlink': 0, 'sentence_eng_Latn': 'On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.', 'sentence_ukr_Cyrl': 'У понеділок, науковці зі Школи медицини Стенфордського університету оголосили про винайдення нового діагностичного інструменту, що може сортувати клітини за їх видами: це малесенький друкований чіп, який можна виготовити за допомогою стандартних променевих принтерів десь по одному центу США за штуку.' } ``` The text is provided as-in the original dataset, without further preprocessing or tokenization. ### Data Fields - `id`: Row number for the data entry, starting at 1. - `sentence`: The full sentence in the specific language (may have _lang for pairings) - `URL`: The URL for the English article from which the sentence was extracted. - `domain`: The domain of the sentence. - `topic`: The topic of the sentence. - `has_image`: Whether the original article contains an image. - `has_hyperlink`: Whether the sentence contains a hyperlink. ### Data Splits | config| `dev`| `devtest`| |-----------------:|-----:|---------:| |all configurations| 997| 1012:| ### Dataset Creation Please refer to the original article [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) for additional information on dataset creation. ## Additional Information ### Dataset Curators See paper for details. ### Licensing Information Licensed with Creative Commons Attribution Share Alike 4.0. License available [here](https://creativecommons.org/licenses/by-sa/4.0/). ### Citation Information Please cite the authors if you use these corpora in your work: ```bibtex @article{nllb2022, author = {NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang}, title = {No Language Left Behind: Scaling Human-Centered Machine Translation}, year = {2022} } ``` Please also cite prior work that this dataset builds on: ```bibtex @inproceedings{, title={The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation}, author={Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzm\'{a}n, Francisco and Fan, Angela}, year={2021} } ``` ```bibtex @inproceedings{, title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English}, author={Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio}, journal={arXiv preprint arXiv:1902.01382}, year={2019} } ```

# Flores 200 数据集卡片 ## 目录 - [Flores 200 数据集卡片](#dataset-card-for-flores-200) - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概况](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言覆盖](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) ## 数据集描述 - **主页:** [Flores](https://github.com/facebookresearch/flores) - **代码仓库:** [Github](https://github.com/facebookresearch/flores) ### 数据集概况 FLORES是一款用于英语与低资源语言间机器翻译的基准数据集(benchmark dataset)。 >FLORES-200的构建将FLORES-101原有的语言覆盖范围扩大了一倍。由于新增语言的标准化程度较低,且需要更专业化的人工翻译,验证流程变得更为复杂,因此对翻译工作流进行了调整。FLORES-200包含部分并非从英语翻译而来的语言:具体而言,部分语言的译文源自西班牙语、法语、俄语和现代标准阿拉伯语。此外,FLORES-200还为四种语言提供了两种书写脚本变体。该数据集包含来自842篇独立网络文章的译文,总计3001个句子,被划分为dev、devtest与test(隐藏)三个子集,平均每个句子约含21个单词。 **免责声明**:*Flores-200数据集由Facebook托管,采用[知识共享署名-相同方式共享4.0国际许可协议](https://creativecommons.org/licenses/by-sa/4.0/)进行授权。* ### 支持任务与排行榜 #### 多语言机器翻译 有关FLORES-101在WMT2021「大规模多语言机器翻译」共享任务中的模型评估细节,请参阅[Dynabench排行榜](https://dynabench.org/flores/Flores%20MT%20Evaluation%20(FULL))。Flores 200是该基准的扩展版本。 ### 语言覆盖 本数据集包含200种语言的平行句子(parallel sentences),详情可参阅项目原[Github页面](https://github.com/facebookresearch/flores/blob/master/README.md)。语言采用ISO 639-3代码(例如`eng`、`fra`、`rus`)辅以描述书写脚本的附加代码(例如`eng_Latn`、`ukr_Cyrl`),代码含义可查阅[代码说明页面](https://github.com/facebookresearch/flores/blob/main/flores200/README.md)。 可通过`all`配置项单次调用所有可用语言的完整平行语料集;使用连字符连接的语言配对(例如`eng_Latn-ukr_Cyrl`)可获取单条数据内的两种语言句子,格式如下。 ## 数据集结构 ### 数据样例 下方展示了乌克兰语(`ukr_Cyrl`配置)`dev`子集的一条样例。所有配置的结构均保持一致,且所有句子在不同配置与子集间均保持对齐。 python { 'id': 1, 'sentence': 'У понеділок, науковці зі Школи медицини Стенфордського університету оголосили про винайдення нового діагностичного інструменту, що може сортувати клітини за їх видами: це малесенький друкований чіп, який можна виготовити за допомогою стандартних променевих принтерів десь по одному центу США за штуку.', 'URL': 'https://en.wikinews.org/wiki/Scientists_say_new_medical_diagnostic_chip_can_sort_cells_anywhere_with_an_inkjet', 'domain': 'wikinews', 'topic': 'health', 'has_image': 0, 'has_hyperlink': 0 } 当使用连字符语言配对或`all`配置时,数据将以下述格式呈现: python { 'id': 1, 'URL': 'https://en.wikinews.org/wiki/Scientists_say_new_medical_diagnostic_chip_can_sort_cells_anywhere_with_an_inkjet', 'domain': 'wikinews', 'topic': 'health', 'has_image': 0, 'has_hyperlink': 0, 'sentence_eng_Latn': 'On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.', 'sentence_ukr_Cyrl': 'У понеділок, науковці зі Школи медицини Стенфордського університету оголосили про винайдення нового діагностичного інструменту, що може сортувати клітини за їх видами: це малесенький друкований чіп, який можна виготовити за допомогою стандартних променевих принтерів десь по одному центу США за штуку.' } 文本直接沿用原数据集格式,未进行额外预处理或分词操作。 ### 数据字段 - `id`: 数据条目的行号,从1开始计数。 - `sentence`: 对应语言的完整句子(配对模式下会附加`_语言代码`后缀)。 - `URL`: 提取该句子的英文原文文章链接。 - `domain`: 句子所属领域。 - `topic`: 句子所属主题。 - `has_image`: 原始文章是否包含图片。 - `has_hyperlink`: 句子是否包含超链接。 ### 数据划分 | 配置| `dev`| `devtest`| |----------------:|----:|--------:| | 所有配置| 997| 1012| ### 数据集构建 有关数据集构建的更多细节,请参阅原论文[《No Language Left Behind: Scaling Human-Centered Machine Translation》](https://arxiv.org/abs/2207.04672)。 ## 附加信息 ### 数据集维护者 详情请参阅相关论文。 ### 许可信息 采用知识共享署名-相同方式共享4.0许可协议进行授权,许可协议详情可查阅[此处](https://creativecommons.org/licenses/by-sa/4.0/)。 ### 引用信息 若在研究中使用该语料库,请引用以下作者: bibtex @article{nllb2022, author = {NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang}, title = {No Language Left Behind: Scaling Human-Centered Machine Translation}, year = {2022} } 同时请引用该数据集所基于的先前研究: bibtex @inproceedings{, title={The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation}, author={Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzmán, Francisco and Fan, Angela}, year={2021} } bibtex @inproceedings{, title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English}, author={Guzmán, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio}, journal={arXiv preprint arXiv:1902.01382}, year={2019} }
提供机构:
maas
创建时间:
2024-07-02
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
flores数据集是一个多语言机器翻译基准数据集,覆盖200种语言,包含3001个平行句子,主要用于低资源语言的翻译研究。数据集由Facebook发布,遵循Creative Commons Attribution-ShareAlike 4.0国际许可协议。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作