Akaashiiii/TFPD

Name: Akaashiiii/TFPD
Creator: Akaashiiii
Published: 2026-04-21 14:18:29
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Akaashiiii/TFPD

下载链接

链接失效反馈

官方服务：

资源简介：

--- arxiv_id: 2604.02176 license: mit language: - en task_categories: - text-generation - question-answering - translation pretty_name: Textual Frequency Paired Dataset (TFPD) size_categories: - 1K<n<10K tags: - mathematical-reasoning - machine-translation - frequency-benchmark - high-frequency - low-frequency --- # Textual Frequency Paired Dataset (TFPD) ## Overview This dataset accompanies the paper **“Adam's Law: Textual Frequency Law on Large Language Models”** (arXiv:2604.02176). It is designed to validate the **Textual Frequency Law (TFL)** , **Textual Frequency Distillation (TFD)** , and **Curriculum Textual Frequency Training (CTFT)** methods on two core tasks: - **Mathematical Reasoning (MR)** – using GSM8K and CSQA - **Machine Translation (MT)** – using FLORES‑200 For each original sentence, we used GPT‑4o‑mini to generate multiple paraphrases, then selected the **highest‑frequency** and **lowest‑frequency** versions based on sentence‑level frequency estimation. All pairs were manually verified by three human annotators to ensure semantic equivalence. ## Dataset Structure The dataset is organised into **JSONL files** (one JSON object per line, with a `"text"` field containing the sentence). Below is the complete file list as described in the paper: ### Mathematical Reasoning (MR) | File | Source | Frequency | # Sentences | |------|--------|-----------|-----------------------------| | `gsm8k-highfrequency.jsonl` | GSM8K | High | 738 | | `gsm8k-lowfrequency.jsonl` | GSM8K | Low | 738 | | `csqa-highfrequency.jsonl` | CSQA | High | 526 | | `csqa-lowfrequency.jsonl` | CSQA | Low | 526 | ### Machine Translation (MT) – FLORES‑200 Example files (full list available in the paper appendix): | File | Language (ISO code) | Frequency | |------|---------------------|-----------| | `eng_Latn-highfrequency.jsonl` | English | High | | `eng_Latn-lowfrequency.jsonl` | English | Low | | `kea_Latn-highfrequency.jsonl` | English → Kabuverdianu | High | | `kea_Latn-lowfrequency.jsonl` | English → Kabuverdianu | Low | | `pag_Latn-highfrequency.jsonl` | English → Pangasinan | High | | `pag_Latn-lowfrequency.jsonl` | English → Pangasinan | Low | > All JSONL files follow the same format: `{"text": "sentence to translate or solve"}`. ## Usage Example Load a specific split with Hugging Face `datasets`: ```python from datasets import load_dataset # Load GSM8K high-frequency math problems dataset = load_dataset("Akaashiiii/TFPD", data_files="gsm8k-highfrequency.jsonl") print(dataset[0]["text"])

提供机构：

Akaashiiii

5,000+

优质数据集

54 个

任务类型

进入经典数据集