ParaPat/para_pat|机器翻译数据集|多语言数据集数据集

hugging_face2024-01-18 更新2024-06-15 收录

机器翻译

多语言数据集

下载链接：

https://hf-mirror.com/datasets/ParaPat/para_pat

下载链接

链接失效反馈

资源简介：

--- annotations_creators: - machine-generated language_creators: - expert-generated language: - cs - de - el - en - es - fr - hu - ja - ko - pt - ro - ru - sk - uk - zh license: - cc-by-4.0 multilinguality: - translation size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-generation - fill-mask - translation task_ids: - language-modeling - masked-language-modeling paperswithcode_id: parapat pretty_name: Parallel Corpus of Patents Abstracts dataset_info: - config_name: el-en features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - el - en splits: - name: train num_bytes: 24818840 num_examples: 10855 download_size: 24894705 dataset_size: 24818840 - config_name: cs-en features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - cs - en splits: - name: train num_bytes: 117555722 num_examples: 78977 download_size: 118010340 dataset_size: 117555722 - config_name: en-hu features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - en - hu splits: - name: train num_bytes: 80637157 num_examples: 42629 download_size: 80893995 dataset_size: 80637157 - config_name: en-ro features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - en - ro splits: - name: train num_bytes: 80290819 num_examples: 48789 download_size: 80562562 dataset_size: 80290819 - config_name: en-sk features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - en - sk splits: - name: train num_bytes: 31510348 num_examples: 23410 download_size: 31707728 dataset_size: 31510348 - config_name: en-uk features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - en - uk splits: - name: train num_bytes: 136808871 num_examples: 89226 download_size: 137391928 dataset_size: 136808871 - config_name: es-fr features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - es - fr splits: - name: train num_bytes: 53767035 num_examples: 32553 download_size: 53989438 dataset_size: 53767035 - config_name: fr-ru features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - fr - ru splits: - name: train num_bytes: 33915203 num_examples: 10889 download_size: 33994490 dataset_size: 33915203 - config_name: de-fr features: - name: translation dtype: translation: languages: - de - fr splits: - name: train num_bytes: 655742822 num_examples: 1167988 download_size: 204094654 dataset_size: 655742822 - config_name: en-ja features: - name: translation dtype: translation: languages: - en - ja splits: - name: train num_bytes: 3100002828 num_examples: 6170339 download_size: 1093334863 dataset_size: 3100002828 - config_name: en-es features: - name: translation dtype: translation: languages: - en - es splits: - name: train num_bytes: 337690858 num_examples: 649396 download_size: 105202237 dataset_size: 337690858 - config_name: en-fr features: - name: translation dtype: translation: languages: - en - fr splits: - name: train num_bytes: 6103179552 num_examples: 12223525 download_size: 1846098331 dataset_size: 6103179552 - config_name: de-en features: - name: translation dtype: translation: languages: - de - en splits: - name: train num_bytes: 1059631418 num_examples: 2165054 download_size: 339299130 dataset_size: 1059631418 - config_name: en-ko features: - name: translation dtype: translation: languages: - en - ko splits: - name: train num_bytes: 1466703472 num_examples: 2324357 download_size: 475152089 dataset_size: 1466703472 - config_name: fr-ja features: - name: translation dtype: translation: languages: - fr - ja splits: - name: train num_bytes: 211127021 num_examples: 313422 download_size: 69038401 dataset_size: 211127021 - config_name: en-zh features: - name: translation dtype: translation: languages: - en - zh splits: - name: train num_bytes: 2297993338 num_examples: 4897841 download_size: 899568201 dataset_size: 2297993338 - config_name: en-ru features: - name: translation dtype: translation: languages: - en - ru splits: - name: train num_bytes: 1974874480 num_examples: 4296399 download_size: 567240359 dataset_size: 1974874480 - config_name: fr-ko features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - fr - ko splits: - name: train num_bytes: 222006786 num_examples: 120607 download_size: 64621605 dataset_size: 222006786 - config_name: ru-uk features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - ru - uk splits: - name: train num_bytes: 163442529 num_examples: 85963 download_size: 38709524 dataset_size: 163442529 - config_name: en-pt features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - en - pt splits: - name: train num_bytes: 37372555 num_examples: 23121 download_size: 12781082 dataset_size: 37372555 --- # Dataset Card for ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts](https://figshare.com/articles/ParaPat_The_Multi-Million_Sentences_Parallel_Corpus_of_Patents_Abstracts/12627632) - **Repository:** [ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts](https://github.com/soares-f/parapat) - **Paper:** [ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts](https://www.aclweb.org/anthology/2020.lrec-1.465/) - **Point of Contact:** [Felipe Soares](fs@felipesoares.net) ### Dataset Summary ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts This dataset contains the developed parallel corpus from the open access Google Patents dataset in 74 language pairs, comprising more than 68 million sentences and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset contains samples in cs, de, el, en, es, fr, hu, ja, ko, pt, ro, ru, sk, uk, zh, hu ## Dataset Structure ### Data Instances They are of 2 types depending on the dataset: First type { "translation":{ "en":"A method for converting a series of m-bit information words to a modulated signal is described.", "es":"Se describe un método para convertir una serie de palabras de informacion de bits m a una señal modulada." } } Second type { "family_id":10944407, "index":844, "translation":{ "el":"αφές ο οποίος παρασκευάζεται με χαρμάνι ελληνικού καφέ είτε σε συσκευή καφέ εσπρέσο είτε σε συσκευή γαλλικού καφέ (φίλτρου) είτε κατά τον παραδοσιακό τρόπο του ελληνικού καφέ και διυλίζεται, κτυπιέται στη συνέχεια με πάγο σε χειροκίνητο ή ηλεκτρικόμίξερ ώστε να παγώσει ομοιόμορφα και να αποκτήσει πλούσιο αφρό και σερβίρεται σε ποτήρι. ΰ", "en":"offee prepared using the mix for Greek coffee either in an espresso - type coffee making machine, or in a filter coffee making machine or in the traditional way for preparing Greek coffee and is then filtered , shaken with ice manually or with an electric mixer so that it freezes homogeneously, obtains a rich froth and is served in a glass." } } ### Data Fields **index:** position in the corpus **family id:** for each abstract, such that researchers can use that information for other text mining purposes. **translation:** distionary containing source and target sentence for that example ### Data Splits No official train/val/test splits given. Parallel corpora aligned into sentence level |Language Pair|# Sentences|# Unique Tokens| |--------|-----|------| |EN/ZH|4.9M|155.8M| |EN/JA|6.1M|189.6M| |EN/FR|12.2M|455M| |EN/KO|2.3M|91.4M| |EN/DE|2.2M|81.7M| |EN/RU|4.3M|107.3M| |DE/FR|1.2M|38.8M| |FR/JA|0.3M|9.9M| |EN/ES|0.6M|24.6M| Parallel corpora aligned into abstract level |Language Pair|# Abstracts| |--------|-----| |FR/KO|120,607| |EN/UK|89,227| |RU/UK|85,963| |CS/EN|78,978| |EN/RO|48,789| |EN/HU|42,629| |ES/FR|32,553| |EN/SK|23,410| |EN/PT|23,122| |BG/EN|16,177| |FR/RU|10,889| ## Dataset Creation ### Curation Rationale The availability of parallel corpora is required by current Statistical and Neural Machine Translation systems (SMT and NMT). Acquiring a high-quality parallel corpus that is large enough to train MT systems, particularly NMT ones, is not a trivial task due to the need for correct alignment and, in many cases, human curation. In this context, the automated creation of parallel corpora from freely available resources is extremely important in Natural Language Pro- cessing (NLP). ### Source Data #### Initial Data Collection and Normalization Google makes patents data available under the Google Cloud Public Datasets. BigQuery is a Google service that supports the efficient storage and querying of massive datasets which are usually a challenging task for usual SQL databases. For instance, filtering the September 2019 release of the dataset, which contains more than 119 million rows, can take less than 1 minute for text fields. The on-demand billing for BigQuery is based on the amount of data processed by each query run, thus for a single query that performs a full-scan, the cost can be over USD 15.00, since the cost per TB is currently USD 5.00. #### Who are the source language producers? BigQuery is a Google service that supports the efficient storage and querying of massive datasets which are usually a challenging task for usual SQL databases. ### Annotations #### Annotation process The following steps describe the process of producing patent aligned abstracts: 1. Load the nth individual file 2. Remove rows where the number of abstracts with more than one language is less than 2 for a given family id. The family id attribute is used to group patents that refers to the same invention. By removing these rows, we remove abstracts that are available only in one language. 3. From the resulting set, create all possible parallel abstracts from the available languages. For instance, an abstract may be available in English, French and German, thus, the possible language pairs are English/French, English/German, and French/German. 4. Store the parallel patents into an SQL database for easier future handling and sampling. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Funded by Google Tensorflow Research Cloud. ### Licensing Information CC BY 4.0 ### Citation Information ``` @inproceedings{soares-etal-2020-parapat, title = "{P}ara{P}at: The Multi-Million Sentences Parallel Corpus of Patents Abstracts", author = "Soares, Felipe and Stevenson, Mark and Bartolome, Diego and Zaretskaya, Anna", booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.465", pages = "3769--3774", language = "English", ISBN = "979-10-95546-34-4", } ``` [DOI](https://doi.org/10.6084/m9.figshare.12627632) ### Contributions Thanks to [@bhavitvyamalik](https://github.com/bhavitvyamalik) for adding this dataset.

提供机构：

ParaPat

原始信息汇总

数据集卡片：ParaPat - 专利摘要的多百万句子平行语料库

数据集描述

数据集摘要

ParaPat是一个包含74种语言对的多百万句子平行语料库，涵盖超过6800万句子和8亿个词。这些句子是通过Hunalign算法自动对齐的，主要针对22种最大的语言对，其他语言对则是按段落对齐的。

支持的任务和排行榜

[更多信息待补充]

语言

数据集包含以下语言的样本：cs, de, el, en, es, fr, hu, ja, ko, pt, ro, ru, sk, uk, zh, hu

数据集结构

数据实例

数据实例有两种类型：

第一种类型： json { "translation":{ "en":"A method for converting a series of m-bit information words to a modulated signal is described.", "es":"Se describe un método para convertir una serie de palabras de informacion de bits m a una señal modulada." } }

第二种类型： json { "family_id":10944407, "index":844, "translation":{ "el":"αφές ο οποίος παρασκευάζεται με χαρμάνι ελληνικού καφέ είτε σε συσκευή καφέ εσπρέσο είτε σε συσκευή γαλλικού καφέ (φίλτρου) είτε κατά τον παραδοσιακό τρόπο του ελληνικού καφέ και διυλίζεται, κτυπιέται στη συνέχεια με πάγο σε χειροκίνητο ή ηλεκτρικόμίξερ ώστε να παγώσει ομοιόμορφα και να αποκτήσει πλούσιο αφρό και σερβίρεται σε ποτήρι. ΰ", "en":"offee prepared using the mix for Greek coffee either in an espresso - type coffee making machine, or in a filter coffee making machine or in the traditional way for preparing Greek coffee and is then filtered , shaken with ice manually or with an electric mixer so that it freezes homogeneously, obtains a rich froth and is served in a glass." } }

数据字段

index: 在语料库中的位置
family_id: 每个摘要的标识符，研究人员可以使用该信息进行其他文本挖掘目的。
translation: 包含源句子和目标句子的字典

数据分割

没有官方的训练/验证/测试分割。

数据集创建

数据收集和规范化

Google在Google Cloud Public Datasets下提供专利数据。BigQuery是Google的一项服务，支持高效存储和查询大规模数据集，这些数据集通常对常规SQL数据库来说是一个挑战。

数据对齐过程

加载第n个单独的文件
删除给定家族ID的摘要中语言少于两种的行。家族ID属性用于将指代同一发明的专利分组。通过删除这些行，我们删除了仅以一种语言提供的摘要。
从结果集中，从可用语言创建所有可能的平行摘要。例如，一个摘要可能以英语、法语和德语提供，因此可能的语言对是英语/法语、英语/德语和法语/德语。
将平行专利存储到SQL数据库中，以便于未来的处理和采样。

数据集信息

配置名称和特征

el-en
- 特征: index, family_id, translation
- 分割: train (10855个样本, 24818840字节)
- 下载大小: 24894705字节
- 数据集大小: 24818840字节
cs-en
- 特征: index, family_id, translation
- 分割: train (78977个样本, 117555722字节)
- 下载大小: 118010340字节
- 数据集大小: 117555722字节
en-hu
- 特征: index, family_id, translation
- 分割: train (42629个样本, 80637157字节)
- 下载大小: 80893995字节
- 数据集大小: 80637157字节
en-ro
- 特征: index, family_id, translation
- 分割: train (48789个样本, 80290819字节)
- 下载大小: 80562562字节
- 数据集大小: 80290819字节
en-sk
- 特征: index, family_id, translation
- 分割: train (23410个样本, 31510348字节)
- 下载大小: 31707728字节
- 数据集大小: 31510348字节
en-uk
- 特征: index, family_id, translation
- 分割: train (89226个样本, 136808871字节)
- 下载大小: 137391928字节
- 数据集大小: 136808871字节
es-fr
- 特征: index, family_id, translation
- 分割: train (32553个样本, 53767035字节)
- 下载大小: 53989438字节
- 数据集大小: 53767035字节
fr-ru
- 特征: index, family_id, translation
- 分割: train (10889个样本, 33915203字节)
- 下载大小: 33994490字节
- 数据集大小: 33915203字节
de-fr
- 特征: translation
- 分割: train (1167988个样本, 655742822字节)
- 下载大小: 204094654字节
- 数据集大小: 655742822字节
en-ja
- 特征: translation
- 分割: train (6170339个样本, 3100002828字节)
- 下载大小: 1093334863字节
- 数据集大小: 3100002828字节
en-es
- 特征: translation
- 分割: train (649396个样本, 337690858字节)
- 下载大小: 105202237字节
- 数据集大小: 337690858字节
en-fr
- 特征: translation
- 分割: train (12223525个样本, 6103179552字节)
- 下载大小: 1846098331字节
- 数据集大小: 6103179552字节
de-en
- 特征: translation
- 分割: train (2165054个样本, 1059631418字节)
- 下载大小: 339299130字节
- 数据集大小: 1059631418字节
en-ko
- 特征: translation
- 分割: train (2324357个样本, 1466703472字节)
- 下载大小: 475152089字节
- 数据集大小: 1466703472字节
fr-ja
- 特征: translation
- 分割: train (313422个样本, 211127021字节)
- 下载大小: 69038401字节
- 数据集大小: 211127021字节
en-zh
- 特征: translation
- 分割: train (4897841个样本, 2297993338字节)
- 下载大小: 899568201字节
- 数据集大小: 2297993338字节
en-ru
- 特征: translation
- 分割: train (4296399个样本, 1974874480字节)
- 下载大小: 567240359字节
- 数据集大小: 1974874480字节
fr-ko
- 特征: index, family_id, translation
- 分割: train (120607个样本, 222006786字节)
- 下载大小: 64621605字节
- 数据集大小: 222006786字节
ru-uk
- 特征: index, family_id, translation
- 分割: train (85963个样本, 163442529字节)
- 下载大小: 38709524字节
- 数据集大小: 163442529字节
en-pt
- 特征: index, family_id, translation
- 分割: train (23121个样本, 37372555字节)
- 下载大小: 12781082字节
- 数据集大小: 37372555字节

数据集创建

数据收集和规范化

数据对齐过程

加载第n个单独的文件
删除给定家族ID的摘要中语言少于两种的行。家族ID属性用于将指代同一发明的专利分组。通过删除这些行，我们删除了仅以一种语言提供的摘要。
从结果集中，从可用语言创建所有可能的平行摘要。例如，一个摘要可能以英语、法语和德语提供，因此可能的语言对是英语/法语、英语/德语和法语/德语。
将平行专利存储到SQL数据库中，以便于未来的处理和采样。

数据集信息

许可证

CC BY 4.0

引用信息

@inproceedings{soares-etal-2020-parapat, title = "{P}ara{P}at: The Multi-Million Sentences Parallel Corpus of Patents Abstracts", author = "Soares, Felipe and Stevenson, Mark and Bartolome, Diego and Zaretskaya, Anna", booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.465", pages = "3769--3774", language = "English", ISBN = "979-10-95546-34-4", }

DOI

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4099个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

中国裁判文书网

中国裁判文书网是中国最高人民法院设立的官方网站，旨在公开各级法院的裁判文书。该数据集包含了大量的法律文书，如判决书、裁定书、调解书等，涵盖了民事、刑事、行政、知识产权等多个法律领域。

wenshu.court.gov.cn 收录

CCNC

CCNC是一个包含365万姓名样本的大型中文姓名语料库，数据来源于姓名大全和中文人名语料库，经过处理和注音，用于中文姓名研究和实体识别。

github 收录

Tropicos

Tropicos是一个全球植物名称数据库，包含超过130万种植物的名称、分类信息、分布数据、图像和参考文献。该数据库由密苏里植物园维护，旨在为植物学家、生态学家和相关领域的研究人员提供全面的植物信息。

www.tropicos.org 收录

SPIQA

SPIQA数据集由谷歌研究院和约翰斯·霍普金斯大学共同创建，是首个针对科学研究论文中复杂图表和表格进行多模态问答的大规模数据集。该数据集包含270,194个问题，涉及计算机科学多个领域的研究论文。数据集的创建过程结合了自动和手动筛选，确保了数据的质量和多样性。SPIQA数据集主要用于评估多模态大型语言模型在理解科学论文中的图表和表格方面的能力，旨在提高信息检索和问答系统的性能。

arXiv 收录

UAVDT Dataset

The authors constructed a new UAVDT Dataset focused on complex scenarios with new level challenges. Selected from 10 hours raw videos, about 80, 000 representative frames are fully annotated with bounding boxes as well as up to 14 kinds of attributes (e.g., weather condition, flying altitude, camera view, vehicle category, and occlusion) for three fundamental computer vision tasks: object detection, single object tracking, and multiple object tracking.

datasetninja.com 收录