ilsp/scipar_parallel_docs
收藏Hugging Face2024-03-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ilsp/scipar_parallel_docs
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含从学术论文、学位论文和其他科学文本中提取的平行文档(即标题和摘要)。在原始论文中,我们从86个存储库中提取了31种语言对的9.17M句子对。此版本通过进一步处理和过滤,提取了平行文档而非平行句子。为此,我们仅保留了具有高对齐分数的学术记录的平行标题和摘要,并使用了其他过滤器(例如,单词比例、非空摘要等)。请注意,该数据集将不断更新以包含更多语言对,如原始论文所述。
This dataset contains parallel documents (i.e., "titles and abstracts") extracted from academic papers, dissertations, and other scientific texts. In the original study, we extracted 9.17M sentence pairs across 31 language pairs from 86 repositories. This version extracts parallel documents rather than parallel sentence pairs through further processing and filtering. To this end, we only retained parallel titles and abstracts of academic records with high alignment scores, and applied additional filtering criteria such as word ratio thresholds, non-empty abstracts, etc. Please note that this dataset will be continuously updated to include more language pairs, as detailed in the original publication.
提供机构:
ilsp
原始信息汇总
数据集概述
数据集描述
该数据集包含从学术论文、学位论文和其他科学文本中提取的平行文档(即标题和摘要)。原始数据集从86个资源库中提取了9.17M个句子对,涵盖31种语言对。本版本通过进一步处理和过滤,提取了平行文档而非平行句子,仅保留了具有高(平均)对齐分数的学术记录的平行标题和摘要,并应用了其他过滤条件(如词比率、非空摘要等)。
语言对及文档数量
| 语言对 | 平行文档数量 |
|---|---|
| EN-DE | 57,387 |
| EN-EL | 55,833 |
| EN-ES | 25,844 |
| EN-FR | 130,750 |
| EN-IT | 3,860 |
| 总计 | 273,674 |
相关数据集
原始数据集可在ELRC-SHARE找到:
引用
@inproceedings{roussis2022scipar, title={SciPar: A collection of parallel corpora from scientific abstracts}, author={Roussis, Dimitrios and Papavassiliou, Vassilis and Prokopidis, Prokopis and Piperidis, Stelios and Katsouros, Vassilis}, booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference}, pages={2652--2657}, year={2022} }



