five

ParaPat/para_pat

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/ParaPat/para_pat
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - expert-generated language: - cs - de - el - en - es - fr - hu - ja - ko - pt - ro - ru - sk - uk - zh license: - cc-by-4.0 multilinguality: - translation size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-generation - fill-mask - translation task_ids: - language-modeling - masked-language-modeling paperswithcode_id: parapat pretty_name: Parallel Corpus of Patents Abstracts dataset_info: - config_name: el-en features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - el - en splits: - name: train num_bytes: 24818840 num_examples: 10855 download_size: 24894705 dataset_size: 24818840 - config_name: cs-en features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - cs - en splits: - name: train num_bytes: 117555722 num_examples: 78977 download_size: 118010340 dataset_size: 117555722 - config_name: en-hu features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - en - hu splits: - name: train num_bytes: 80637157 num_examples: 42629 download_size: 80893995 dataset_size: 80637157 - config_name: en-ro features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - en - ro splits: - name: train num_bytes: 80290819 num_examples: 48789 download_size: 80562562 dataset_size: 80290819 - config_name: en-sk features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - en - sk splits: - name: train num_bytes: 31510348 num_examples: 23410 download_size: 31707728 dataset_size: 31510348 - config_name: en-uk features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - en - uk splits: - name: train num_bytes: 136808871 num_examples: 89226 download_size: 137391928 dataset_size: 136808871 - config_name: es-fr features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - es - fr splits: - name: train num_bytes: 53767035 num_examples: 32553 download_size: 53989438 dataset_size: 53767035 - config_name: fr-ru features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - fr - ru splits: - name: train num_bytes: 33915203 num_examples: 10889 download_size: 33994490 dataset_size: 33915203 - config_name: de-fr features: - name: translation dtype: translation: languages: - de - fr splits: - name: train num_bytes: 655742822 num_examples: 1167988 download_size: 204094654 dataset_size: 655742822 - config_name: en-ja features: - name: translation dtype: translation: languages: - en - ja splits: - name: train num_bytes: 3100002828 num_examples: 6170339 download_size: 1093334863 dataset_size: 3100002828 - config_name: en-es features: - name: translation dtype: translation: languages: - en - es splits: - name: train num_bytes: 337690858 num_examples: 649396 download_size: 105202237 dataset_size: 337690858 - config_name: en-fr features: - name: translation dtype: translation: languages: - en - fr splits: - name: train num_bytes: 6103179552 num_examples: 12223525 download_size: 1846098331 dataset_size: 6103179552 - config_name: de-en features: - name: translation dtype: translation: languages: - de - en splits: - name: train num_bytes: 1059631418 num_examples: 2165054 download_size: 339299130 dataset_size: 1059631418 - config_name: en-ko features: - name: translation dtype: translation: languages: - en - ko splits: - name: train num_bytes: 1466703472 num_examples: 2324357 download_size: 475152089 dataset_size: 1466703472 - config_name: fr-ja features: - name: translation dtype: translation: languages: - fr - ja splits: - name: train num_bytes: 211127021 num_examples: 313422 download_size: 69038401 dataset_size: 211127021 - config_name: en-zh features: - name: translation dtype: translation: languages: - en - zh splits: - name: train num_bytes: 2297993338 num_examples: 4897841 download_size: 899568201 dataset_size: 2297993338 - config_name: en-ru features: - name: translation dtype: translation: languages: - en - ru splits: - name: train num_bytes: 1974874480 num_examples: 4296399 download_size: 567240359 dataset_size: 1974874480 - config_name: fr-ko features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - fr - ko splits: - name: train num_bytes: 222006786 num_examples: 120607 download_size: 64621605 dataset_size: 222006786 - config_name: ru-uk features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - ru - uk splits: - name: train num_bytes: 163442529 num_examples: 85963 download_size: 38709524 dataset_size: 163442529 - config_name: en-pt features: - name: index dtype: int32 - name: family_id dtype: int32 - name: translation dtype: translation: languages: - en - pt splits: - name: train num_bytes: 37372555 num_examples: 23121 download_size: 12781082 dataset_size: 37372555 --- # Dataset Card for ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts](https://figshare.com/articles/ParaPat_The_Multi-Million_Sentences_Parallel_Corpus_of_Patents_Abstracts/12627632) - **Repository:** [ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts](https://github.com/soares-f/parapat) - **Paper:** [ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts](https://www.aclweb.org/anthology/2020.lrec-1.465/) - **Point of Contact:** [Felipe Soares](fs@felipesoares.net) ### Dataset Summary ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts This dataset contains the developed parallel corpus from the open access Google Patents dataset in 74 language pairs, comprising more than 68 million sentences and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset contains samples in cs, de, el, en, es, fr, hu, ja, ko, pt, ro, ru, sk, uk, zh, hu ## Dataset Structure ### Data Instances They are of 2 types depending on the dataset: First type { "translation":{ "en":"A method for converting a series of m-bit information words to a modulated signal is described.", "es":"Se describe un método para convertir una serie de palabras de informacion de bits m a una señal modulada." } } Second type { "family_id":10944407, "index":844, "translation":{ "el":"αφές ο οποίος παρασκευάζεται με χαρμάνι ελληνικού καφέ είτε σε συσκευή καφέ εσπρέσο είτε σε συσκευή γαλλικού καφέ (φίλτρου) είτε κατά τον παραδοσιακό τρόπο του ελληνικού καφέ και διυλίζεται, κτυπιέται στη συνέχεια με πάγο σε χειροκίνητο ή ηλεκτρικόμίξερ ώστε να παγώσει ομοιόμορφα και να αποκτήσει πλούσιο αφρό και σερβίρεται σε ποτήρι. ΰ", "en":"offee prepared using the mix for Greek coffee either in an espresso - type coffee making machine, or in a filter coffee making machine or in the traditional way for preparing Greek coffee and is then filtered , shaken with ice manually or with an electric mixer so that it freezes homogeneously, obtains a rich froth and is served in a glass." } } ### Data Fields **index:** position in the corpus **family id:** for each abstract, such that researchers can use that information for other text mining purposes. **translation:** distionary containing source and target sentence for that example ### Data Splits No official train/val/test splits given. Parallel corpora aligned into sentence level |Language Pair|# Sentences|# Unique Tokens| |--------|-----|------| |EN/ZH|4.9M|155.8M| |EN/JA|6.1M|189.6M| |EN/FR|12.2M|455M| |EN/KO|2.3M|91.4M| |EN/DE|2.2M|81.7M| |EN/RU|4.3M|107.3M| |DE/FR|1.2M|38.8M| |FR/JA|0.3M|9.9M| |EN/ES|0.6M|24.6M| Parallel corpora aligned into abstract level |Language Pair|# Abstracts| |--------|-----| |FR/KO|120,607| |EN/UK|89,227| |RU/UK|85,963| |CS/EN|78,978| |EN/RO|48,789| |EN/HU|42,629| |ES/FR|32,553| |EN/SK|23,410| |EN/PT|23,122| |BG/EN|16,177| |FR/RU|10,889| ## Dataset Creation ### Curation Rationale The availability of parallel corpora is required by current Statistical and Neural Machine Translation systems (SMT and NMT). Acquiring a high-quality parallel corpus that is large enough to train MT systems, particularly NMT ones, is not a trivial task due to the need for correct alignment and, in many cases, human curation. In this context, the automated creation of parallel corpora from freely available resources is extremely important in Natural Language Pro- cessing (NLP). ### Source Data #### Initial Data Collection and Normalization Google makes patents data available under the Google Cloud Public Datasets. BigQuery is a Google service that supports the efficient storage and querying of massive datasets which are usually a challenging task for usual SQL databases. For instance, filtering the September 2019 release of the dataset, which contains more than 119 million rows, can take less than 1 minute for text fields. The on-demand billing for BigQuery is based on the amount of data processed by each query run, thus for a single query that performs a full-scan, the cost can be over USD 15.00, since the cost per TB is currently USD 5.00. #### Who are the source language producers? BigQuery is a Google service that supports the efficient storage and querying of massive datasets which are usually a challenging task for usual SQL databases. ### Annotations #### Annotation process The following steps describe the process of producing patent aligned abstracts: 1. Load the nth individual file 2. Remove rows where the number of abstracts with more than one language is less than 2 for a given family id. The family id attribute is used to group patents that refers to the same invention. By removing these rows, we remove abstracts that are available only in one language. 3. From the resulting set, create all possible parallel abstracts from the available languages. For instance, an abstract may be available in English, French and German, thus, the possible language pairs are English/French, English/German, and French/German. 4. Store the parallel patents into an SQL database for easier future handling and sampling. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Funded by Google Tensorflow Research Cloud. ### Licensing Information CC BY 4.0 ### Citation Information ``` @inproceedings{soares-etal-2020-parapat, title = "{P}ara{P}at: The Multi-Million Sentences Parallel Corpus of Patents Abstracts", author = "Soares, Felipe and Stevenson, Mark and Bartolome, Diego and Zaretskaya, Anna", booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.465", pages = "3769--3774", language = "English", ISBN = "979-10-95546-34-4", } ``` [DOI](https://doi.org/10.6084/m9.figshare.12627632) ### Contributions Thanks to [@bhavitvyamalik](https://github.com/bhavitvyamalik) for adding this dataset.
提供机构:
ParaPat
原始信息汇总

数据集卡片:ParaPat - 专利摘要的多百万句子平行语料库

数据集描述

数据集摘要

ParaPat是一个包含74种语言对的多百万句子平行语料库,涵盖超过6800万句子和8亿个词。这些句子是通过Hunalign算法自动对齐的,主要针对22种最大的语言对,其他语言对则是按段落对齐的。

支持的任务和排行榜

[更多信息待补充]

语言

数据集包含以下语言的样本:cs, de, el, en, es, fr, hu, ja, ko, pt, ro, ru, sk, uk, zh, hu

数据集结构

数据实例

数据实例有两种类型:

第一种类型: json { "translation":{ "en":"A method for converting a series of m-bit information words to a modulated signal is described.", "es":"Se describe un método para convertir una serie de palabras de informacion de bits m a una señal modulada." } }

第二种类型: json { "family_id":10944407, "index":844, "translation":{ "el":"αφές ο οποίος παρασκευάζεται με χαρμάνι ελληνικού καφέ είτε σε συσκευή καφέ εσπρέσο είτε σε συσκευή γαλλικού καφέ (φίλτρου) είτε κατά τον παραδοσιακό τρόπο του ελληνικού καφέ και διυλίζεται, κτυπιέται στη συνέχεια με πάγο σε χειροκίνητο ή ηλεκτρικόμίξερ ώστε να παγώσει ομοιόμορφα και να αποκτήσει πλούσιο αφρό και σερβίρεται σε ποτήρι. ΰ", "en":"offee prepared using the mix for Greek coffee either in an espresso - type coffee making machine, or in a filter coffee making machine or in the traditional way for preparing Greek coffee and is then filtered , shaken with ice manually or with an electric mixer so that it freezes homogeneously, obtains a rich froth and is served in a glass." } }

数据字段

  • index: 在语料库中的位置
  • family_id: 每个摘要的标识符,研究人员可以使用该信息进行其他文本挖掘目的。
  • translation: 包含源句子和目标句子的字典

数据分割

没有官方的训练/验证/测试分割。

数据集创建

数据收集和规范化

Google在Google Cloud Public Datasets下提供专利数据。BigQuery是Google的一项服务,支持高效存储和查询大规模数据集,这些数据集通常对常规SQL数据库来说是一个挑战。

数据对齐过程

  1. 加载第n个单独的文件
  2. 删除给定家族ID的摘要中语言少于两种的行。家族ID属性用于将指代同一发明的专利分组。通过删除这些行,我们删除了仅以一种语言提供的摘要。
  3. 从结果集中,从可用语言创建所有可能的平行摘要。例如,一个摘要可能以英语、法语和德语提供,因此可能的语言对是英语/法语、英语/德语和法语/德语。
  4. 将平行专利存储到SQL数据库中,以便于未来的处理和采样。

数据集信息

配置名称和特征

  • el-en

    • 特征: index, family_id, translation
    • 分割: train (10855个样本, 24818840字节)
    • 下载大小: 24894705字节
    • 数据集大小: 24818840字节
  • cs-en

    • 特征: index, family_id, translation
    • 分割: train (78977个样本, 117555722字节)
    • 下载大小: 118010340字节
    • 数据集大小: 117555722字节
  • en-hu

    • 特征: index, family_id, translation
    • 分割: train (42629个样本, 80637157字节)
    • 下载大小: 80893995字节
    • 数据集大小: 80637157字节
  • en-ro

    • 特征: index, family_id, translation
    • 分割: train (48789个样本, 80290819字节)
    • 下载大小: 80562562字节
    • 数据集大小: 80290819字节
  • en-sk

    • 特征: index, family_id, translation
    • 分割: train (23410个样本, 31510348字节)
    • 下载大小: 31707728字节
    • 数据集大小: 31510348字节
  • en-uk

    • 特征: index, family_id, translation
    • 分割: train (89226个样本, 136808871字节)
    • 下载大小: 137391928字节
    • 数据集大小: 136808871字节
  • es-fr

    • 特征: index, family_id, translation
    • 分割: train (32553个样本, 53767035字节)
    • 下载大小: 53989438字节
    • 数据集大小: 53767035字节
  • fr-ru

    • 特征: index, family_id, translation
    • 分割: train (10889个样本, 33915203字节)
    • 下载大小: 33994490字节
    • 数据集大小: 33915203字节
  • de-fr

    • 特征: translation
    • 分割: train (1167988个样本, 655742822字节)
    • 下载大小: 204094654字节
    • 数据集大小: 655742822字节
  • en-ja

    • 特征: translation
    • 分割: train (6170339个样本, 3100002828字节)
    • 下载大小: 1093334863字节
    • 数据集大小: 3100002828字节
  • en-es

    • 特征: translation
    • 分割: train (649396个样本, 337690858字节)
    • 下载大小: 105202237字节
    • 数据集大小: 337690858字节
  • en-fr

    • 特征: translation
    • 分割: train (12223525个样本, 6103179552字节)
    • 下载大小: 1846098331字节
    • 数据集大小: 6103179552字节
  • de-en

    • 特征: translation
    • 分割: train (2165054个样本, 1059631418字节)
    • 下载大小: 339299130字节
    • 数据集大小: 1059631418字节
  • en-ko

    • 特征: translation
    • 分割: train (2324357个样本, 1466703472字节)
    • 下载大小: 475152089字节
    • 数据集大小: 1466703472字节
  • fr-ja

    • 特征: translation
    • 分割: train (313422个样本, 211127021字节)
    • 下载大小: 69038401字节
    • 数据集大小: 211127021字节
  • en-zh

    • 特征: translation
    • 分割: train (4897841个样本, 2297993338字节)
    • 下载大小: 899568201字节
    • 数据集大小: 2297993338字节
  • en-ru

    • 特征: translation
    • 分割: train (4296399个样本, 1974874480字节)
    • 下载大小: 567240359字节
    • 数据集大小: 1974874480字节
  • fr-ko

    • 特征: index, family_id, translation
    • 分割: train (120607个样本, 222006786字节)
    • 下载大小: 64621605字节
    • 数据集大小: 222006786字节
  • ru-uk

    • 特征: index, family_id, translation
    • 分割: train (85963个样本, 163442529字节)
    • 下载大小: 38709524字节
    • 数据集大小: 163442529字节
  • en-pt

    • 特征: index, family_id, translation
    • 分割: train (23121个样本, 37372555字节)
    • 下载大小: 12781082字节
    • 数据集大小: 37372555字节

数据集创建

数据收集和规范化

Google在Google Cloud Public Datasets下提供专利数据。BigQuery是Google的一项服务,支持高效存储和查询大规模数据集,这些数据集通常对常规SQL数据库来说是一个挑战。

数据对齐过程

  1. 加载第n个单独的文件
  2. 删除给定家族ID的摘要中语言少于两种的行。家族ID属性用于将指代同一发明的专利分组。通过删除这些行,我们删除了仅以一种语言提供的摘要。
  3. 从结果集中,从可用语言创建所有可能的平行摘要。例如,一个摘要可能以英语、法语和德语提供,因此可能的语言对是英语/法语、英语/德语和法语/德语。
  4. 将平行专利存储到SQL数据库中,以便于未来的处理和采样。

数据集信息

许可证

CC BY 4.0

引用信息

@inproceedings{soares-etal-2020-parapat, title = "{P}ara{P}at: The Multi-Million Sentences Parallel Corpus of Patents Abstracts", author = "Soares, Felipe and Stevenson, Mark and Bartolome, Diego and Zaretskaya, Anna", booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.465", pages = "3769--3774", language = "English", ISBN = "979-10-95546-34-4", }

DOI

搜集汇总
背景与挑战
背景概述
ParaPat是一个大规模专利摘要平行语料库,包含超过6800万句和8亿标记,覆盖74种语言对,数据来源于Google Patents并采用自动对齐技术。该数据集支持机器翻译等自然语言处理任务,提供句子和摘要两种对齐级别,适用于多语言模型训练和研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作