mixture-vitae/mv_long
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mixture-vitae/mv_long
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多种类型的数据,主要分为GitHub、书籍、数学与推理、代码、StackExchange、维基百科、提交差异、少样本、Nemo、科学论文、网页(C4)、多语言和对齐等类别。其中,代码相关数据占比最大(44.1%),GitHub数据占比最高(30.1%)。数据集详细列出了每个类别下的具体文件及其大小和占比。
The dataset includes various types of data, categorized into GitHub, Books, Math + reasoning, Code, StackExchange, Wikipedia, Commit diffs, Few-shot, Nemo, Scientific papers, Web (C4), Multilingual, and Alignment. Code-related data constitutes the largest portion (44.1%), with GitHub being the most significant contributor (30.1%). The dataset provides detailed information on specific files within each category, including their sizes and percentages.
提供机构:
mixture-vitae



