five

opus_gnome

收藏
魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/Helsinki-NLP/opus_gnome
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Opus Gnome ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** http://opus.nlpl.eu/GNOME.php - **Repository:** None - **Paper:** http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf - **Leaderboard:** [More Information Needed] - **Point of Contact:** [More Information Needed] ### Dataset Summary To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/GNOME.php E.g. `dataset = load_dataset("opus_gnome", lang1="it", lang2="pl")` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances ``` { 'id': '0', 'translation': { 'ar': 'إعداد سياسة القفل', 'bal': 'تنظیم کتن سیاست کبل' } } ``` ### Data Fields Each instance has two fields: - **id**: the id of the example - **translation**: a dictionary containing translated texts in two languages. ### Data Splits Each subset simply consists in a train set. We provide the number of examples for certain language pairs: | | train | |:---------|--------:| | ar-bal | 60 | | bg-csb | 10 | | ca-en_GB | 7982 | | cs-eo | 73 | | de-ha | 216 | | cs-tk | 18686 | | da-vi | 149 | | en_GB-my | 28232 | | el-sk | 150 | | de-tt | 2169 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information @InProceedings{TIEDEMANN12.463, author = {J{\"o}rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } ### Contributions Thanks to [@rkc007](https://github.com/rkc007) for adding this dataset.

# Opus Gnome 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言分布](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献声明](#contributions) ## 数据集描述 - **主页**:http://opus.nlpl.eu/GNOME.php - **代码仓库**:无 - **相关论文**:http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf - **排行榜**:[需补充更多信息] - **联系人**:[需补充更多信息] ### 数据集概述 若需加载配置中未涵盖的语言对,仅需指定语言代码组合即可。有效语言对列表可在数据集描述的主页(http://opus.nlpl.eu/GNOME.php)中查询。示例如下: python dataset = load_dataset("opus_gnome", lang1="it", lang2="pl") ### 支持任务与排行榜 [需补充更多信息] ### 语言分布 [需补充更多信息] ## 数据集结构 ### 数据实例 json { 'id': '0', 'translation': { 'ar': 'إعداد سياسة القفل', 'bal': 'تنظیم کتن سیاست کبل' } } ### 数据字段 每个数据实例包含两个字段: - **id**:示例的唯一标识符 - **translation**:包含两种语言译文的字典。 ### 数据划分 每个子集仅包含训练集。以下为部分语言对的训练集样本数量: | 语言对 | 训练集样本数 | |:-----------|-------------:| | ar-bal | 60 | | bg-csb | 10 | | ca-en_GB | 7982 | | cs-eo | 73 | | de-ha | 216 | | cs-tk | 18686 | | da-vi | 149 | | en_GB-my | 28232 | | el-sk | 150 | | de-tt | 2169 | ## 数据集构建 ### 构建依据 [需补充更多信息] ### 源数据 [需补充更多信息] #### 初始数据收集与归一化 [需补充更多信息] #### 源语言文本的创作者是谁? [需补充更多信息] ### 标注信息 [需补充更多信息] #### 标注流程 [需补充更多信息] #### 标注者是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息 [需补充更多信息] ### 引用信息 bibtex @InProceedings{TIEDEMANN12.463, author = {Jörg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } ### 贡献声明 感谢 [@rkc007](https://github.com/rkc007) 为本数据集的收录提供支持。
提供机构:
maas
创建时间:
2025-08-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作