Akaashiiii/TFPD
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Akaashiiii/TFPD
下载链接
链接失效反馈官方服务:
资源简介:
---
arxiv_id: 2604.02176
license: mit
language:
- en
task_categories:
- text-generation
- question-answering
- translation
pretty_name: Textual Frequency Paired Dataset (TFPD)
size_categories:
- 1K<n<10K
tags:
- mathematical-reasoning
- machine-translation
- frequency-benchmark
- high-frequency
- low-frequency
---
# Textual Frequency Paired Dataset (TFPD)
## Overview
This dataset accompanies the paper **“Adam's Law: Textual Frequency Law on Large Language Models”** (arXiv:2604.02176).
It is designed to validate the **Textual Frequency Law (TFL)** , **Textual Frequency Distillation (TFD)** , and **Curriculum Textual Frequency Training (CTFT)** methods on two core tasks:
- **Mathematical Reasoning (MR)** – using GSM8K and CSQA
- **Machine Translation (MT)** – using FLORES‑200
For each original sentence, we used GPT‑4o‑mini to generate multiple paraphrases, then selected the **highest‑frequency** and **lowest‑frequency** versions based on sentence‑level frequency estimation. All pairs were manually verified by three human annotators to ensure semantic equivalence.
## Dataset Structure
The dataset is organised into **JSONL files** (one JSON object per line, with a `"text"` field containing the sentence).
Below is the complete file list as described in the paper:
### Mathematical Reasoning (MR)
| File | Source | Frequency | # Sentences |
|------|--------|-----------|-----------------------------|
| `gsm8k-highfrequency.jsonl` | GSM8K | High | 738 |
| `gsm8k-lowfrequency.jsonl` | GSM8K | Low | 738 |
| `csqa-highfrequency.jsonl` | CSQA | High | 526 |
| `csqa-lowfrequency.jsonl` | CSQA | Low | 526 |
### Machine Translation (MT) – FLORES‑200
Example files (full list available in the paper appendix):
| File | Language (ISO code) | Frequency |
|------|---------------------|-----------|
| `eng_Latn-highfrequency.jsonl` | English | High |
| `eng_Latn-lowfrequency.jsonl` | English | Low |
| `kea_Latn-highfrequency.jsonl` | English → Kabuverdianu | High |
| `kea_Latn-lowfrequency.jsonl` | English → Kabuverdianu | Low |
| `pag_Latn-highfrequency.jsonl` | English → Pangasinan | High |
| `pag_Latn-lowfrequency.jsonl` | English → Pangasinan | Low |
> All JSONL files follow the same format: `{"text": "sentence to translate or solve"}`.
## Usage Example
Load a specific split with Hugging Face `datasets`:
```python
from datasets import load_dataset
# Load GSM8K high-frequency math problems
dataset = load_dataset("Akaashiiii/TFPD", data_files="gsm8k-highfrequency.jsonl")
print(dataset[0]["text"])
提供机构:
Akaashiiii



