News-5m
收藏arXiv2025-09-30 收录
下载链接:
http://data.statmt.org/news-commentary/v16/
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了500万个高质量的英文句子,这些句子来源于开源新闻,用于知识蒸馏阶段的无标签数据集构建。此外,该数据集还与构建学生模型微调阶段相关。规模达到了500万个句子,任务定位于知识蒸馏。
This dataset contains 5 million high-quality English sentences sourced from open-source news, and it is used to construct unlabeled datasets for the knowledge distillation stage. Additionally, this dataset is also relevant to the fine-tuning phase when constructing student models. With a total of 5 million sentences, this dataset is targeted at knowledge distillation tasks.
提供机构:
Open-source news repositories



