ymoslem/UN-Arabic-English-Filtered
收藏数据集概述
数据集信息
- 特征:
text_en: 英文文本,数据类型为字符串。text_ar: 阿拉伯文文本,数据类型为字符串。
- 分割:
train: 训练集,包含19,279,407个样本,占用9,128,039,564字节。test: 测试集,包含8,752个样本,占用3,677,574字节。dev: 开发集,包含8,752个样本,占用3,714,475字节。
- 下载大小: 5,159,323,292字节
- 数据集总大小: 9,135,431,613字节
配置
- 默认配置:
train: 数据路径为data/train-*test: 数据路径为data/test-*dev: 数据路径为data/dev-*
任务类别
- 翻译
语言
- 阿拉伯语 (
ar) - 英语 (
en)
数据集规模
- 10M < n < 100M
许可证
- CC BY 4.0
数据集结构
DatasetDict({ train: Dataset({ features: [text_en, text_ar], num_rows: 19279407 }) test: Dataset({ features: [text_en, text_ar], num_rows: 8752 }) dev: Dataset({ features: [text_en, text_ar], num_rows: 8752 }) })
引用
@inproceedings{eisele-chen-2010-multiun, title = "{M}ulti{UN}: A Multilingual Corpus from United Nation Documents", author = "Eisele, Andreas and Chen, Yu", editor = "Calzolari, Nicoletta and Choukri, Khalid and Maegaard, Bente and Mariani, Joseph and Odijk, Jan and Piperidis, Stelios and Rosner, Mike and Tapias, Daniel", booktitle = "Proceedings of the Seventh International Conference on Language Resources and Evaluation ({LREC}10)", month = may, year = "2010", address = "Valletta, Malta", publisher = "European Language Resources Association (ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2010/pdf/686_Paper.pdf", }
@inproceedings{ziemski-etal-2016-united, title = "The {U}nited {N}ations Parallel Corpus v1.0", author = "Ziemski, Micha{l} and Junczys-Dowmunt, Marcin and Pouliquen, Bruno", editor = "Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Goggi, Sara and Grobelnik, Marko and Maegaard, Bente and Mariani, Joseph and Mazo, Helene and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}16)", month = may, year = "2016", address = "Portoro{v{z}}, Slovenia", publisher = "European Language Resources Association (ELRA)", url = "https://aclanthology.org/L16-1561", pages = "3530--3534", }
@INPROCEEDINGS{Tiedemann2012-OPUS, title = "{Parallel Data, Tools and Interfaces in {OPUS}}", booktitle = "{Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}12)}", author = "Tiedemann, J{"o}rg", publisher = "European Language Resources Association (ELRA)", pages = "2214--2218", month = may, year = 2012, url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf", address = "Istanbul, Turkey" }



