C4Corpus
收藏数据集概述
数据集名称
- C4Corpus:预处理的CommonCrawl数据集,代表Creative Commons from Common Crawl。
数据集描述
- C4Corpus是一个多语言的Web规模数据集,具有免费许可证。
数据集用途
- 用于语言检测、近似重复移除等处理任务。
数据集访问
- 可通过S3访问C4Corpus数据。
引用信息
@InProceedings{Habernal.et.al.2016.LREC, author = {Habernal, Ivan and Zayed, Omnia, and Gurevych, Iryna}, title = {{C4Corpus: Multilingual Web-size Corpus with Free License}}, booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)}, pages = {914--922}, month = {May}, year = {2016}, address = {Portorov{z}, Slovenia}, publisher = {European Language Resources Association (ELRA)}, editor = {Nicoletta Calzolari and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, isbn = {978-2-9517408-9-1}, url = {http://www.lrec-conf.org/proceedings/lrec2016/pdf/388_Paper.pdf} }




