community-datasets/scielo

Name: community-datasets/scielo
Creator: community-datasets
Published: 2024-06-26 06:13:16
License: 暂无描述

Hugging Face2024-06-26 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/community-datasets/scielo

下载链接

链接失效反馈

官方服务：

资源简介：

SciELO数据集是一个包含英语、葡萄牙语和西班牙语的平行语料库，主要用于机器翻译任务。该数据集从Scielo数据库中收集了全文科学文章，并对所有语言对的句子进行了对齐，同时还有一小部分句子进行了三语对齐。对齐过程使用了Hunalign算法。数据集的结构包括数据实例、数据字段和数据分割的详细信息。

The SciELO dataset is a parallel corpus containing English, Portuguese, and Spanish, primarily used for machine translation tasks. The dataset collects full-text scientific articles from the Scielo database, with sentence alignment for all language pairs and a small subset of sentences aligned in three languages. The alignment process used the Hunalign algorithm. The dataset structure includes detailed information on data instances, data fields, and data splits.

提供机构：

community-datasets

原始信息汇总

数据集卡片 for SciELO

数据集描述

数据集摘要

一个从Scielo数据库收集的科学文章全文平行语料库，包含以下语言：英语、葡萄牙语和西班牙语。该语料库对所有语言对进行了句子对齐，以及对一小部分句子进行了三语对齐。对齐工作使用了Hunalign算法。

支持的任务和排行榜

基础任务是机器翻译。

语言

英语
西班牙语
葡萄牙语

数据集结构

数据实例

配置名称: en-es
- 特征:
  - 名称: translation
    - 数据类型:
      - 语言:
        
        英语
        
        西班牙语
- 分割:
  - 名称: train
    - 字节数: 71777213
    - 示例数: 177782
- 下载大小: 22965217
- 数据集大小: 71777213
配置名称: en-pt
- 特征:
  - 名称: translation
    - 数据类型:
      - 语言:
        
        英语
        
        葡萄牙语
- 分割:
  - 名称: train
    - 字节数: 1032669686
    - 示例数: 2828917
- 下载大小: 322726075
- 数据集大小: 1032669686
配置名称: en-pt-es
- 特征:
  - 名称: translation
    - 数据类型:
      - 语言:
        
        英语
        
        葡萄牙语
        
        西班牙语
- 分割:
  - 名称: train
    - 字节数: 147472132
    - 示例数: 255915
- 下载大小: 45556562
- 数据集大小: 147472132

数据字段

translation

数据分割

train

数据集创建

策划理由

[更多信息需补充]

源数据

初始数据收集和规范化

[更多信息需补充]

源语言生产者

[更多信息需补充]

注释

注释过程

[更多信息需补充]

注释者

[更多信息需补充]

个人和敏感信息

[更多信息需补充]

使用数据的注意事项

数据集的社会影响

[更多信息需补充]

偏见的讨论

[更多信息需补充]

其他已知限制

[更多信息需补充]

附加信息

数据集策展人

[更多信息需补充]

许可信息

未知

引用信息

@inproceedings{soares2018large, title={A Large Parallel Corpus of Full-Text Scientific Articles}, author={Soares, Felipe and Moreira, Viviane and Becker, Karin}, booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)}, year={2018} }

贡献

感谢@patil-suraj添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集