five

Akaashiiii/TFPD

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Akaashiiii/TFPD
下载链接
链接失效反馈
官方服务:
资源简介:
--- arxiv_id: 2604.02176 license: mit language: - en task_categories: - text-generation - question-answering - translation pretty_name: Textual Frequency Paired Dataset (TFPD) size_categories: - 1K<n<10K tags: - mathematical-reasoning - machine-translation - frequency-benchmark - high-frequency - low-frequency --- # Textual Frequency Paired Dataset (TFPD) ## Overview This dataset accompanies the paper **“Adam's Law: Textual Frequency Law on Large Language Models”** (arXiv:2604.02176). It is designed to validate the **Textual Frequency Law (TFL)** , **Textual Frequency Distillation (TFD)** , and **Curriculum Textual Frequency Training (CTFT)** methods on two core tasks: - **Mathematical Reasoning (MR)** – using GSM8K and CSQA - **Machine Translation (MT)** – using FLORES‑200 For each original sentence, we used GPT‑4o‑mini to generate multiple paraphrases, then selected the **highest‑frequency** and **lowest‑frequency** versions based on sentence‑level frequency estimation. All pairs were manually verified by three human annotators to ensure semantic equivalence. ## Dataset Structure The dataset is organised into **JSONL files** (one JSON object per line, with a `"text"` field containing the sentence). Below is the complete file list as described in the paper: ### Mathematical Reasoning (MR) | File | Source | Frequency | # Sentences | |------|--------|-----------|-----------------------------| | `gsm8k-highfrequency.jsonl` | GSM8K | High | 738 | | `gsm8k-lowfrequency.jsonl` | GSM8K | Low | 738 | | `csqa-highfrequency.jsonl` | CSQA | High | 526 | | `csqa-lowfrequency.jsonl` | CSQA | Low | 526 | ### Machine Translation (MT) – FLORES‑200 Example files (full list available in the paper appendix): | File | Language (ISO code) | Frequency | |------|---------------------|-----------| | `eng_Latn-highfrequency.jsonl` | English | High | | `eng_Latn-lowfrequency.jsonl` | English | Low | | `kea_Latn-highfrequency.jsonl` | English → Kabuverdianu | High | | `kea_Latn-lowfrequency.jsonl` | English → Kabuverdianu | Low | | `pag_Latn-highfrequency.jsonl` | English → Pangasinan | High | | `pag_Latn-lowfrequency.jsonl` | English → Pangasinan | Low | > All JSONL files follow the same format: `{"text": "sentence to translate or solve"}`. ## Usage Example Load a specific split with Hugging Face `datasets`: ```python from datasets import load_dataset # Load GSM8K high-frequency math problems dataset = load_dataset("Akaashiiii/TFPD", data_files="gsm8k-highfrequency.jsonl") print(dataset[0]["text"])
提供机构:
Akaashiiii
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作