Xdata

Name: Xdata
Creator: XiaoduoAILab
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/XiaoduoAILab/XmodelLM

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个自建的平衡中英双语的语料库，旨在根据下游任务的优化进行调整。该数据集被用于训练Xmodel-LM模型，尽管其规模较小，但与其他模型相比，该模型展现出了具有竞争力的性能。该数据集的规模大约为2万亿个标记，其任务是进行语言模型的预训练。

This is a self-built balanced Chinese-English bilingual corpus tailored for downstream task optimization. It is employed to train the Xmodel-LM model. Despite its relatively small scale, the Xmodel-LM model trained on this corpus delivers competitive performance compared to other baseline models. Boasting a scale of approximately 2 trillion tokens, this corpus is dedicated to language model pre-training.

提供机构：

XiaoduoAILab

5,000+

优质数据集

54 个

任务类型

进入经典数据集