kothasuhas/dclm-baseline-1.0_subset_1M
收藏Hugging Face2024-12-29 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/kothasuhas/dclm-baseline-1.0_subset_1M
下载链接
链接失效反馈官方服务:
资源简介:
这是一个包含了文本和多种元数据信息的网页数据集,共有100万条训练数据。数据集中的文本字段包含了文本内容,metadata字段则包含了如网页大小、类型、日期等多种元数据信息。
This is a web dataset containing text and various metadata information, with a total of 1 million training data. The text field in the dataset contains the text content, while the metadata field contains various metadata information such as web page size, type, date, etc.
提供机构:
kothasuhas



