bouquet|机器翻译数据集|质量评估数据集

魔搭社区2025-06-20 更新2025-06-21 收录

机器翻译

质量评估

下载链接：

https://modelscope.cn/datasets/facebook/bouquet

下载链接

链接失效反馈

资源简介：

# BOUQuET 💐: Benchmark and Open initiative for Universal Quality Evaluation in Translation BOUQuET is a multi-way parallel, multi-centric and multi-register/domain dataset and benchmark for machine translation quality. The underlying texts have been handcrafted by linguists in 8 diverse languages and translated to English, and the dataset is intended to be extensible to virtually any other written language. BOUQuET has been described in the paper by [Omnilingual Team, 2025](https://arxiv.org/abs/2502.04314) and can be extended by contributing new translations in https://bouquet.metademolab.com. ## Dataset Details ### Dataset Description - **Language(s) (NLP):** multilingual - **Home:** https://bouquet.metademolab.com/ - **Paper:** https://arxiv.org/abs/2502.04314 ## Uses The dataset is intended for evaluation of machine translation quality. By purpose, it is similar to [FLORES+](https://huggingface.co/datasets/openlanguagedata/flores_plus) or [WMT24++](https://huggingface.co/datasets/google/wmt24pp). Unlike these datasets, BOUQuET focuses more on linguistic diversity. It is not intended as a training dataset, but the `dev` subset may be used for validation during model development. ## Dataset Structure ### Composition BOUQuET consists of short paragraphs, fully parallel in all languages at the sentence level. The dataset is distributed both at the sentence level and at the paragraph level. By default, only sentence-level data is loaded; the `paragraph_level` config may be used to load the paragraph-level data. For convenience, every language is paired with English, but actually the dataset is fully multi-way parallel, so any language could be paired with any other. The public portion of the dataset contains two splits: - `dev`: 504 unique sentences, 120 paragraphs - `test`: 864 unique sentences, 200 paragraphs An additional split made up of 632 unique sentences and 144 paragraphs is being held out for quality assurance purposes and is not distributed here. ### Columns The dataset contains the following fields: ``` - level # str, "sentence_level" or "paragraph_level" - split # str, "dev" or "test" - uniq_id # str, identifier of the dataset item (e.g. `P464-S1` for sentence-level, `P464` for paragraph-level data) - src_lang # str, NLLB-compatible non-English language code (such as `hin_Deva`) - tgt_lang # str, "eng_Latn" - src_text # str, non-English text - tgt_text # str, English text - orig_text # str, the original text (sentence or paragraph), which sometimes corresponds to src_text - par_comment # str, comment to the whole paragraph - sent_comment # str, comment to the sentence - has_hashtag # bool, whether a hash tag is present in the text - has_emoji # bool, whether an emoji is present in the text - has_12p # bool, whether the sentence has first- or second-person pronouns - has_speaker_tag # bool, whether the sentence starts with a speaker tag - newline_next # bool, whether the sentence should be followed by a newline in the paragraph - par_id # str, paragraph id (e.g. `P464`) - domain # str, one of the 8 domains (see the paper) - register # str, three-letter identifier of the register (see the paper) - tags # str, comma-separated linguistic tags of a sentence (see the paper) ``` ### Languages Currently, BOUQuET covers 9 language varieties: | ISO 639-3 | ISO 15924 | Language | Family | Subgroup1 | |--|--|--|--|--| arz | Arab | Egyptian Arabic | Afro-Asiatic | West Semitic | cmn | Hans | Mandarin Chinese | Sino-Tibetan | Sinitic | deu | Latn | German | Indo-European | West Germanic | eng | Latn | English | Indo-European | West Germanic | fra | Latn | French | Indo-European | Italic | hin | Deva | Hindi | Indo-European | Indo-Aryan | ind | Latn | Indonesian | Austronesian | Malayic | rus | Cyrl | Russian | Indo-European | Balto-Slavic | spa | Latn | Spanish | Indo-European | Italic | Each language variety is characterized by an ISO 639-3 code for its language, and an ISO 15924 code for the writing system. To contribute translations for new languages, please use our crowdsourcing tool: https://bouquet.metademolab.com. ## Usage examples The code below loads a pre-configured subset, French sentences paired with English, and selects the first instance ```Python import datasets data = datasets.load_dataset("facebook/bouquet", "fra_Latn", split="dev") # to demonstrate an example, we select a single data instance data[0] # {'uniq_id': 'P037-S1', # 'src_lang': 'fra_Latn', # 'src_text': 'Tu as des mains en or, la nourriture est délicieuse.', # 'tgt_lang': 'eng_Latn', # 'domain': 'comments', # 'tgt_text': 'Bless your hands, the food was very delicious. ', # 'par_comment': 'possessive pronoun "your" is 2nd person feminine', # 'tags': 'second person, single tense (past)', # 'register': 'mra', # 'orig_text': 'تسلم ايديكي الاكل كان جميل جدًا', # 'has_speaker_tag': False, # 'has_hashtag': False, # 'has_emoji': False, # 'has_12p': True, # 'newline_next': True, # 'level': 'sentence_level', # 'split': 'dev', # 'par_id': 'P037', # 'sent_comment': None} ``` Another example loads paragraph-level data paired with English, and then pairs Spanish sentences with their Russian translations: ```Python import datasets data = datasets.load_dataset("facebook/bouquet", "paragraph_level", split="dev").to_pandas() spa2rus = pd.merge( data.loc[data["src_lang"].eq("spa_Latn")].drop(["tgt_lang", "tgt_text"], axis=1), data.loc[data["src_lang"].eq("rus_Cyrl"), ["src_lang", "src_text", "uniq_id"]].rename({"src_lang": "tgt_lang", "src_text": "tgt_text"}, axis=1), on="uniq_id", ) ``` ## Dataset Creation ### Curation Rationale The dataset has been created manually from scratch, by composing the source sentences that cover a variety of domains and registers in 8 diverse non-English languages: Egyptian Arabic (alternating with Modern Standard Arabic when appropriate), French, German, Hindi, Indonesian, Mandarin Chinese, Russian, and Spanish. For each of the source languages, the sentences have been created in the following 8 domains: 1. How-to, written tutorials or instructions 2. Conversations (dialogues) 3. Narration (creative writing that doesn’t include dialogues) 4. Social media posts 5. Social media comments (reactive) 6. Other web content 7. Reflective piece 8. Miscellaneous (address to a nation, disaster response, etc.) Apart from the domains, a variety of registers (contextual styles) were used. Each sentence is annotated with the register characterized by three features: connectedness, preparedness, and social differential. The linguists who were creating the dataset were instructed to maintain the diversity of sentence lengths, word orders, sentence structures, and other linguistic characteristics. Subsequently, the source sentences were translated from the 8 source languages into English. We plan to extend the dataset "in width", by translating it into even more languages. See the [paper](https://arxiv.org/abs/2502.04314) for more details. ## Contribution To contribute to the dataset (adding translations for a new language, or verifying some of the existing translations), please use the web annotation tool at https://bouquet.metademolab.com. ## Citation If you are referring to this dataset, please cite the [BOUQuET paper](https://arxiv.org/abs/2502.04314). ```bibtex @article{andrews2025bouquet, title={BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation}, author={{The Omnilingual MT Team} and Andrews, Pierre and Artetxe, Mikel and Meglioli, Mariano Coria and Costa-juss{\`a}, Marta R and Chuang, Joe and Dupenthaler, Mark and Dale, David and Ekberg, Nate and Gao, Cynthia and Maillard, Jean and Licht, Daniel and Mourachko, Alex and Ropers, Christophe and Saleem, Safiyyah and Sánchez, Eduardo, and Tsiamas, Ioannis, and Turkatenko, Arina and Ventayol, Albert and Yates, Shireen}, journal={arXiv preprint arXiv:2502.04314}, year={2025} } ``` ## Glossary - **Domain.** By the term *domain*, we mean different spaces in which language is produced in speech, sign, or writing (e.g., books, social media, news, Wikipedia, organization websites, official documents, direct messaging, texting). In this paper, we focus solely on the written modality. - **Register.** We understand the term *register* as a functional variety of language that includes socio-semiotic properties, as expressed in [Halliday and Matthiessen (2004)], or more simply as a "contextual style", as presented in [Labov (1991), pp.79–99]. In that regard, a register is a specific variety of language used to best fit a specific communicative purpose in a specific situation.

提供机构：

maas

创建时间：

2025-06-17

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

中国高考录取分数线数据

高考录取分数线，是指普通高等学校招生全国统一考试录取分数线。该分数线，每年高考结束后，由省级教育招生主管部门统计后公布。高考录取分数线分为本科线和专科线。全国各个地方的录取线分科类、分批次确定，科类一般分为文科类、理科类、音乐类（文、理）、美术类（文、理）、体育类等，每一科类又各分为提前批、第一批、第二批等等。 CnOpenData推出中国高考录取分数线数据，从批次、学校、专业等三方面汇总高考录取情况，涵盖生源地、学校所在地、年份、分类、批次、分科、分数线、学校、专业、录取人数、最高/低分等字段，为相关研究提供优质的数据资源。

CnOpenData 收录

UAVDT

UAVDT数据集由中国科学院大学等机构创建，包含约80,000帧从10小时无人机拍摄视频中精选的图像，覆盖多种复杂城市环境。数据集主要关注车辆目标，每帧均标注了边界框及多达14种属性，如天气条件、飞行高度、相机视角等。该数据集旨在推动无人机视觉技术在不受限制场景下的研究，解决高密度、小目标、相机运动等挑战，适用于物体检测、单目标跟踪和多目标跟踪等基础视觉任务。

arXiv 收录

Figshare

Figshare是一个在线数据共享平台，允许研究人员上传和共享各种类型的研究成果，包括数据集、论文、图像、视频等。它旨在促进科学研究的开放性和可重复性。

figshare.com 收录

poi

本项目收集国内POI兴趣点，当前版本数据来自于openstreetmap。

github 收录

中国地质调查局: 全国1∶200 000区域水文地质图空间数据库

全国1∶200 000区域水文地质图空间数据库以建国后在全国范围内(本次未在香港特别行政区、澳门特别行政区和台湾省开展工作) 30个省开展的1∶200 000区域水文地质普查工作所取得的区域水文地质普查报告、综合水文地质图等地质资料为数据源，在制定的“1∶200 000区域水文地质图空间数据库图层及属性文件格式标准”的基础上，建成了一个全国性的、大型的区域水文地质学空间数据库。该数据库总共采集、处理了全国范围内1∶200 000图幅的<number>1 017</number>幅全要素综合水文地质图信息，全部数据量约50 GB。数据库涵盖了以1∶200 000国际标准图幅为管理单位的水文地质要素空间数据图层，内容包括：地理要素(交通层、水系层、行政区划层等)，基础地质要素(地层分区层、断裂构造层)，水文地质要素(地下水类型层、地下水富水性层、地下水迳流模数层，地下水水质层、水文地质特征层、地下水利用规划层)，专题要素(综合水文地质柱状图，水文地质剖面图) 四大类近30个要素图层。空间数据库主要采用MapGIS地理信息系统格式存储，形成了目前国内覆盖范围最广、包含信息最完整的区域水文地质图空间数据库成果，是地质领域全国性最重要的基础信息资源之一。

DataCite Commons 收录