five

LHF/escorpius-mr

收藏
Hugging Face2023-05-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/LHF/escorpius-mr
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-nd-4.0 language: - af - ar - bn - ca - cs - da - de - el - eu - fa - fi - fr - gl - hi - hr - it - ja - ko - mt - nl - no - oc - pa - pl - pt - ro - sl - sr - sv - tr - uk - ur multilinguality: - multilingual size_categories: - 100B<n<1T source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling --- # esCorpius Multilingual Raw In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, they present important shortcomings for languages different from English, as they are either too small, or present a low quality derived from sub-optimal cleaning and deduplication. In this repository, we introduce esCorpius-m, a multilingual crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in some of the languages covered with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius-m has been released under CC BY-NC-ND 4.0 license. # Usage ``` dataset = load_dataset('LHF/escorpius-m', split='train', streaming=True) ``` # Intended use This corpus is the *raw version* of the esCorpius-m corpus. This corpus can be used for benchmarking deduplication tools. ## Other corpora - esCorpius multilingual corpus (deduplicated): https://huggingface.co/datasets/LHF/escorpius-m - esCorpius original *Spanish-only* corpus (deduplicated): https://huggingface.co/datasets/LHF/escorpius ## Citation Link to paper: https://www.isca-speech.org/archive/pdfs/iberspeech_2022/gutierrezfandino22_iberspeech.pdf / https://arxiv.org/abs/2206.15147 Cite this work: ``` @inproceedings{gutierrezfandino22_iberspeech, author={Asier Gutiérrez-Fandiño and David Pérez-Fernández and Jordi Armengol-Estapé and David Griol and Zoraida Callejas}, title={{esCorpius: A Massive Spanish Crawling Corpus}}, keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences}, year=2022, booktitle={Proc. IberSPEECH 2022}, pages={126--130}, doi={10.21437/IberSPEECH.2022-26} } ``` ## Disclaimer We did not perform any kind of filtering and/or censorship to the corpus. We expect users to do so applying their own methods. We are not liable for any misuse of the corpus.
提供机构:
LHF
原始信息汇总

esCorpius Multilingual Raw 数据集概述

基本信息

  • 许可证: CC BY-NC-ND 4.0
  • 语言: 支持多种语言,包括但不限于 af, ar, bn, ca, cs, da, de, el, eu, fa, fi, fr, gl, hi, hr, it, ja, ko, mt, nl, no, oc, pa, pl, pt, ro, sl, sr, sv, tr, uk, ur
  • 多语言性: 多语言
  • 大小: 100B<n<1T
  • 数据来源: 原始数据

任务类别

  • 文本生成
  • 填空任务

使用场景

  • 用于基准测试去重工具

数据集版本

  • 原始版本: 用于测试去重工具
  • 去重版本: 可用于更精确的语言处理任务

数据集特点

  • 从近1 Pb的Common Crawl数据中提取,具有高质量的清洗和去重处理
  • 维护源网页URL和WARC分片源URL,以遵守欧盟法规

使用方法

python dataset = load_dataset(LHF/escorpius-m, split=train, streaming=True)

引用信息

  • 论文链接: https://www.isca-speech.org/archive/pdfs/iberspeech_2022/gutierrezfandino22_iberspeech.pdf / https://arxiv.org/abs/2206.15147
  • 引用格式:

@inproceedings{gutierrezfandino22_iberspeech, author={Asier Gutiérrez-Fandiño and David Pérez-Fernández and Jordi Armengol-Estapé and David Griol and Zoraida Callejas}, title={{esCorpius: A Massive Spanish Crawling Corpus}}, year=2022, booktitle={Proc. IberSPEECH 2022}, pages={126--130}, doi={10.21437/IberSPEECH.2022-26} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作