five

ratishsp/rephrased-web-data-quality-study

收藏
Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ratishsp/rephrased-web-data-quality-study
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: faq data_files: - split: train path: faq/train.jsonl - config_name: table data_files: - split: train path: table/train.jsonl - config_name: tutorial data_files: - split: train path: tutorial/train.jsonl - config_name: math data_files: - split: train path: math/train.jsonl --- # Rephrased Web Data Quality Study LLM-as-judge evaluation of ~4,000 examples from [HuggingFaceFW/finephrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase) (1,000 sampled per split, 86 dropped due to judge parse failures, 3,914 successfully evaluated). **Judge**: Claude Sonnet 4.6 via OpenRouter | **Cost**: ~$45 ## Quality Scores (1-5 scale) | Metric | FAQ (n=965) | Table (n=979) | Tutorial (n=976) | Math (n=994) | |---|---|---|---|---| | Faithfulness | 1.82 | 1.72 | 1.90 | 1.49 | | Info preservation | 1.93 | 1.64 | 1.99 | 1.47 | | Appropriateness | 3.54 | 2.87 | 2.48 | 1.67 | | Format compliance | 2.18 | 1.71 | 2.28 | 1.42 | Over 80% of outputs score faithfulness ≤ 2. Over 87% contain hallucinations detected by the judge. Math has quite poor results. In case of Math, 92.2% score faithfulness ≤ 2 and 94.5% fail format compliance (score ≤ 2). The low appropriateness score for math (1.67) indicates most source documents lack numerical content suitable for math word problems. In case of Tables, faithfulness averages 1.72/5 and around 85% of Table outputs suffer from low format compliance. ## Some Score Distributions | Metric | FAQ | Table | Tutorial | Math | |---|---|---|---|---| | Faithfulness = 1 (lowest) | 35.4% | 45.0% | 31.8% | 61.4% | | Faithfulness ≤ 2 | 85.9% | 84.7% | 80.9% | 92.2% | | Faithfulness ≥ 4 (good) | 2.5% | 1.7% | 2.4% | 1.8% | | Format compliance ≤ 2 | 56.7% | 85.1% | 56.7% | 94.5% | | Has hallucinations | 87.8% | 88.5% | 94.2% | 86.3% | ## Fields Each row contains: - `source_text`, `source_url`, `source_token_count`: the original FineWeb-Edu document - `output_text`, `completion_tokens`, `finish_reason`: SmolLM2-1.7B's synthetic output - `faithfulness`, `info_preservation`, `appropriateness`, `format_compliance`: judge scores (1-5) - `faithfulness_issues`, `info_preservation_issues`, `appropriateness_issues`, `format_issues`: detailed judge reasoning - `hallucinations`: list of specific hallucinations flagged - `judge_error`: non-empty if the judge response failed to parse (these rows are excluded from this dataset) ## Sampling In case of FAQ, Table, and Tutorial splits, 1,000 examples per split sampled using block sampling: 10 blocks of 100 at random offsets (seed=42). In case of Math split, 10 blocks of 100 at offsets [1000, 10000, 20000, ..., 90000] via streaming with `load_dataset`, as the datasets server API returned 501 errors for this config. ## Discussion See [discussion post on FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase/discussions/5).
提供机构:
ratishsp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作