finerweb-10bt
收藏魔搭社区2025-10-09 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/TurkuNLP/finerweb-10bt
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for FinerWeb-10BT
## Dataset Details
### Dataset Description
This dataset extends the FineWeb-10BT sample (10 billion tokens) by adding quality scores for each line of text. Each document has been enhanced with line-level quality scores derived from an LLM-based filtering pipeline that identifies high and low-quality content.
- **Curated by:** Erik Henriksson*, Otto Tarkka*, Filip Ginter (University of Turku, *Equal contribution.)
- **Language(s):** English
- **License:** apache-2.0
### Dataset Sources
- **Repository:** https://huggingface.co/datasets/TurkuNLP/finerweb-10bt
- **Model**: https://huggingface.co/TurkuNLP/finerweb-quality-classifier
- **Paper:** https://arxiv.org/abs/2501.07314
## Dataset Structure
The dataset follows the original FineWeb-10BT structure with an additional `line_quality` key for each document. This key contains a list of floating-point scores (0.0 to 1.0) corresponding to each line in the document (obtained by splitting the document's text on newlines). Higher scores indicate higher quality content, with scores closer to 1.0 representing clean, natural language text, and lower scores indicating content like formatting artifacts, copyright notices, or navigation elements.
## Dataset Creation
### Source Data
#### Data Collection and Processing
Quality scores were generated through a pipeline that:
1. Used GPT-4o mini to label a 20,000-document sample
2. Trained a DeBERTa-v3 classifier on the labeled data
3. Applied the classifier to generate quality scores for each line in the full dataset
## Bias, Risks, and Limitations
The quality scores inherit some biases from the LLMs used in the labeling process. Users should note that the distinction between high and low-quality content can be subjective, and the scores should be interpreted as guidelines rather than absolute measures.
## Citation
```bibtex
@misc{henriksson2025finerweb10btrefiningwebdata,
title={FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering},
author={Erik Henriksson and Otto Tarkka and Filip Ginter},
year={2025},
eprint={2501.07314},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.07314},
}
```
# FinerWeb-10BT 数据集卡片
## 数据集详情
### 数据集描述
本数据集基于FineWeb-10BT样本(含100亿Token)进行扩展,为每一行文本新增了质量评分。每份文档均补充了基于大语言模型(LLM)过滤流水线生成的行级质量评分,用于区分高质量与低质量内容。
- **整理方:** 埃里克·亨里克松(Erik Henriksson)*、奥托·塔尔卡(Otto Tarkka)*、菲利普·金特(Filip Ginter)(图尔库大学,* 同等贡献)
- **语言:** 英语
- **许可证:** Apache-2.0
### 数据集来源
- **代码仓库:** https://huggingface.co/datasets/TurkuNLP/finerweb-10bt
- **模型:** https://huggingface.co/TurkuNLP/finerweb-quality-classifier
- **学术论文:** https://arxiv.org/abs/2501.07314
## 数据集结构
本数据集沿用原始FineWeb-10BT的结构,仅为每份文档新增了`line_quality`字段。该字段存储一组浮点型评分(范围0.0至1.0),与文档内的每一行文本一一对应(评分通过按换行符拆分文档文本得到)。评分越高代表内容质量越好:评分越接近1.0,代表文本为规范自然的语言;评分较低则对应格式乱码、版权声明或导航元素等低质内容。
## 数据集构建
### 源数据
#### 数据收集与处理流程
质量评分通过以下流水线生成:
1. 使用GPT-4o mini对2万份文档样本进行标注
2. 在标注数据上训练DeBERTa-v3分类器
3. 将训练好的分类器应用于全量数据集,为每一行文本生成质量评分
## 偏差、风险与局限性
本质量评分继承了标注过程中所用大语言模型的部分偏差。用户需注意:高质量与低质量内容的划分标准具有主观性,评分仅作为参考准则而非绝对评判标准。
## 引用格式
bibtex
@misc{henriksson2025finerweb10btrefiningwebdata,
title={FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering},
author={Erik Henriksson and Otto Tarkka and Filip Ginter},
year={2025},
eprint={2501.07314},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.07314},
}
提供机构:
maas
创建时间:
2025-08-08



