gsarti/clean_mc4_it

Name: gsarti/clean_mc4_it
Creator: gsarti
Published: 2024-06-17 13:20:30
License: 暂无描述

Hugging Face2024-06-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/gsarti/clean_mc4_it

下载链接

链接失效反馈

官方服务：

资源简介：

Clean Italian mC4数据集是基于Common Crawl的多语言大规模清理版本mC4的意大利语部分，经过进一步的清理和预处理。该数据集包含了超过200GB的清理后的意大利语文本，估计包含超过410亿个单词，是迄今为止最大的意大利语语料库。预处理步骤包括移除包含不良词汇的文档、过滤不符合条件的句子以及移除不符合长度和语言要求的文档。数据集的结构包括URL、文本内容和时间戳等字段，并且提供了不同大小的数据集分割以便于使用。该数据集的使用可能会对意大利语语言技术的发展产生重要影响，但也需要注意其中可能存在的偏见。

The Clean Italian mC4 dataset is a thoroughly cleaned version of the Italian split of the multilingual colossal, cleaned version of Common Crawls web crawl corpus (mC4). It contains over 200GB of cleaned Italian text, estimated to include over 41 billion words, making it the largest available corpus for the Italian language. The preprocessing steps involve removing documents containing bad words, filtering out sentences that do not meet certain criteria, and removing documents that do not meet length and language requirements. The dataset structure includes fields such as URL, text content, and timestamp, and provides different-sized dataset splits for ease of use. The use of this dataset may have significant implications for the development of Italian language technology, but it is also important to be aware of potential biases.

提供机构：

gsarti

原始信息汇总

数据集概述

数据集名称

名称: Clean Italian mC4 🇮🇹
别名: mC4_it

数据集属性

语言: 意大利语（it）
许可证: ODC-BY
多语言性: 单语种
任务类别: 文本生成
任务ID: 语言建模
论文代码ID: mc4

数据集结构

数据实例: 包含timestamp, url, text字段
数据分割: 分为训练集和验证集，支持不同大小的分割（tiny, small, medium, large, full）

数据集大小

tiny: 1M<n<10M
small: 10M<n<100M
medium: 10M<n<100M
large: 10M<n<100M
full: 100M<n<1B

数据集创建与预处理

原始数据: 基于Common Crawl数据集
预处理: 采用特定流程清理数据，包括去除不当内容和特定格式文本
处理工具: 使用LangDetect进行语言识别，使用96 CPU核心进行高效处理

使用考虑

社会影响: 提供大规模意大利语文本，支持语言模型研究和应用开发
偏见讨论: 数据集可能反映互联网内容中的偏见，适合用于研究数据偏见及其影响

附加信息

数据集管理员: Gabriele Sarti
联系方式: gabriele.sarti996@gmail.com
引用信息: 引用时需包括数据集作者和原始mC4作者的相关文献

5,000+

优质数据集

54 个

任务类型

进入经典数据集