ratishsp/rephrased-web-data-quality-study
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ratishsp/rephrased-web-data-quality-study
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: faq
data_files:
- split: train
path: faq/train.jsonl
- config_name: table
data_files:
- split: train
path: table/train.jsonl
- config_name: tutorial
data_files:
- split: train
path: tutorial/train.jsonl
- config_name: math
data_files:
- split: train
path: math/train.jsonl
---
# Rephrased Web Data Quality Study
LLM-as-judge evaluation of ~4,000 examples from [HuggingFaceFW/finephrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase) (1,000 sampled per split, 86 dropped due to judge parse failures, 3,914 successfully evaluated).
**Judge**: Claude Sonnet 4.6 via OpenRouter | **Cost**: ~$45
## Quality Scores (1-5 scale)
| Metric | FAQ (n=965) | Table (n=979) | Tutorial (n=976) | Math (n=994) |
|---|---|---|---|---|
| Faithfulness | 1.82 | 1.72 | 1.90 | 1.49 |
| Info preservation | 1.93 | 1.64 | 1.99 | 1.47 |
| Appropriateness | 3.54 | 2.87 | 2.48 | 1.67 |
| Format compliance | 2.18 | 1.71 | 2.28 | 1.42 |
Over 80% of outputs score faithfulness ≤ 2. Over 87% contain hallucinations detected by the judge.
Math has quite poor results. In case of Math, 92.2% score faithfulness ≤ 2 and 94.5% fail format compliance (score ≤ 2). The low appropriateness score for math (1.67) indicates most source documents lack numerical content suitable for math word problems.
In case of Tables, faithfulness averages 1.72/5 and around 85% of Table outputs suffer from low format compliance.
## Some Score Distributions
| Metric | FAQ | Table | Tutorial | Math |
|---|---|---|---|---|
| Faithfulness = 1 (lowest) | 35.4% | 45.0% | 31.8% | 61.4% |
| Faithfulness ≤ 2 | 85.9% | 84.7% | 80.9% | 92.2% |
| Faithfulness ≥ 4 (good) | 2.5% | 1.7% | 2.4% | 1.8% |
| Format compliance ≤ 2 | 56.7% | 85.1% | 56.7% | 94.5% |
| Has hallucinations | 87.8% | 88.5% | 94.2% | 86.3% |
## Fields
Each row contains:
- `source_text`, `source_url`, `source_token_count`: the original FineWeb-Edu document
- `output_text`, `completion_tokens`, `finish_reason`: SmolLM2-1.7B's synthetic output
- `faithfulness`, `info_preservation`, `appropriateness`, `format_compliance`: judge scores (1-5)
- `faithfulness_issues`, `info_preservation_issues`, `appropriateness_issues`, `format_issues`: detailed judge reasoning
- `hallucinations`: list of specific hallucinations flagged
- `judge_error`: non-empty if the judge response failed to parse (these rows are excluded from this dataset)
## Sampling
In case of FAQ, Table, and Tutorial splits, 1,000 examples per split sampled using block sampling: 10 blocks of 100 at random offsets (seed=42).
In case of Math split, 10 blocks of 100 at offsets [1000, 10000, 20000, ..., 90000] via streaming with `load_dataset`, as the datasets server API returned 501 errors for this config.
## Discussion
See [discussion post on FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase/discussions/5).
提供机构:
ratishsp



