Thermostatic/texts_parallel_corpus_europarl_english_spanish

Name: Thermostatic/texts_parallel_corpus_europarl_english_spanish
Creator: Thermostatic
Published: 2024-04-18 04:21:43
License: 暂无描述

Hugging Face2024-04-18 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/Thermostatic/texts_parallel_corpus_europarl_english_spanish

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - translation language: - en - es tags: - English - Spanish - Parallel Corpus pretty_name: Europarl size_categories: - 1M<n<10M --- # Dataset Card for Dataset Name  A massive parallel corpus of English-Spanish pairs. It hasn't a specified license, but there doesn't seem to be any copyrighted material in the corpus. I have personally merged rows using a pseudo-random algorithm making the dataset useful in training LLMs, reducing the risk of overfitting. ## Dataset Details ### Dataset Description  - **Curated by:** Philipp Koehn - **Funded by [optional]:** In part funded by the European Commission (7th Framework Programme). - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** English & Spanish - **License:** Not specified. ### Dataset Sources [optional]  - **Repository:** https://www.statmt.org/europarl/index.html - **Paper [optional]:** https://aclanthology.org/2005.mtsummit-papers.11.pdf - **Demo [optional]:** [More Information Needed] ## Uses  ### Direct Use  [More Information Needed] ### Out-of-Scope Use  [More Information Needed] ## Dataset Structure  [More Information Needed] ## Dataset Creation ### Curation Rationale  [More Information Needed] ### Source Data  #### Data Collection and Processing  [More Information Needed] #### Who are the source data producers?  [More Information Needed] ### Annotations [optional]  #### Annotation process  [More Information Needed] #### Who are the annotators?  [More Information Needed] #### Personal and Sensitive Information  [More Information Needed] ## Bias, Risks, and Limitations  [More Information Needed] ### Recommendations  Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional]  **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional]  [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]

提供机构：

Thermostatic

原始信息汇总

数据集概述

数据集描述

名称： Europarl
任务类别： 翻译
语言： 英语 & 西班牙语
标签： 英语, 西班牙语, 平行语料库
大小： 1M<n<10M
许可证： 未指定

数据集来源

存储库： https://www.statmt.org/europarl/index.html
论文： https://aclanthology.org/2005.mtsummit-papers.11.pdf

数据集创建

策划者： Philipp Koehn
资助： 部分由欧洲委员会（第7框架计划）资助

使用

直接使用： [信息不足]
超出范围的使用： [信息不足]

数据集结构

结构描述： [信息不足]

数据集创建

数据收集和处理： [信息不足]
源数据生产者： [信息不足]
注释： [信息不足]

偏差、风险和限制

建议： 用户应意识到数据集的风险、偏差和限制。需要更多信息以提供进一步的建议。

5,000+

优质数据集

54 个

任务类型

进入经典数据集