sanranjan/fineweb-CC-MAIN-2024-10-1B-en
收藏Hugging Face2024-09-15 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/sanranjan/fineweb-CC-MAIN-2024-10-1B-en
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个字段,包括文本、ID、转储、URL、日期、文件路径、语言、语言评分和词数等。数据集主要用于存储和文本相关的信息,可能用于自然语言处理任务,如文本分类、情感分析或语言模型训练。数据集包含一个训练分割,共有1,500,000个样本,总大小为5,554,137,895字节。
This dataset includes multiple fields such as text, ID, dump, URL, date, file path, language, language score, and token count. It is primarily used for storing text-related information and could be utilized for natural language processing tasks such as text classification, sentiment analysis, or language model training. The dataset contains a training split with 1,500,000 samples and a total size of 5,554,137,895 bytes.
提供机构:
sanranjan



