LLM360/TxT360
收藏Hugging Face2025-05-26 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/LLM360/TxT360
下载链接
链接失效反馈官方服务:
资源简介:
TxT360是全球首个对99个CommonCrawl快照和14个常用非网络数据源(如FreeLaw、PG-19等)进行去重处理的数据集。该数据集为预训练团队提供了高质量的开源数据,并帮助训练出性能最佳的模型。数据集包含了多种数据源,如CommonCrawl、论文、维基百科、FreeLaw、DM Math、USPTO、PG-19、HackerNews、Ubuntu IRC、EuroParl和StackExchange等。通过复杂的过滤和去重技术,数据集确保了数据的完整性和高质量。数据集的结构按数据源类型进行组织,并提供了详细的数据模式描述。
TxT360 is a high-quality dataset for pre-training large language models. It is constructed by globally deduplicating 99 CommonCrawl snapshots and 14 commonly used non-web data sources such as FreeLaw, PG-19, etc. The dataset includes various types of data such as papers, Wikipedia, legal documents, math problems, patents, news, IRC chat logs, programming Q&A, etc. TxT360 ensures high quality and diversity of data through sophisticated filtering and deduplication techniques. The structured organization of the dataset allows users to easily adjust data weighting, obtain the largest high-quality open-source dataset, and train the most performant models.
提供机构:
LLM360



