clean_mc4_it
收藏huggingface.co2025-01-08 收录
下载链接:
https://huggingface.co/datasets/gsarti/clean_mc4_it
下载链接
链接失效反馈官方服务:
资源简介:
A thoroughly cleaned version of the Italian portion of the multilingual
colossal, cleaned version of Common Crawl's web crawl corpus (mC4) by AllenAI.
Based on Common Crawl dataset: "https://commoncrawl.org".
This is the processed version of Google's mC4 dataset by AllenAI, with further cleaning
detailed in the repository README file.
由 AllenAI 严谨清洗的意大利语部分的多语言巨型语料库(mC4)的洁净版本,该语料库基于 Common Crawl 数据集(https://commoncrawl.org)。这是 Google 的 mC4 数据集经过 AllenAI 处理的版本,进一步的清洗细节请参考仓库的 README 文件。
提供机构:
huggingface.co



