LisaMegaWatts/wikitext-103-quality-scored

Name: LisaMegaWatts/wikitext-103-quality-scored
Creator: LisaMegaWatts
Published: 2026-02-26 00:44:15
License: 暂无描述

Hugging Face2026-02-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/LisaMegaWatts/wikitext-103-quality-scored

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en task_categories: - text-generation tags: - wikipedia - wikitext - quality-scored - curriculum-learning - character-level - slm size_categories: - 1M<n<10M dataset_info: features: - name: text dtype: string splits: - name: train num_examples: 2592530 - name: validation num_examples: 288060 --- # WikiText-103 Quality-Scored Corpus Cleaned and quality-scored subset of [WikiText-103](https://huggingface.co/datasets/Salesforce/wikitext) (curated Wikipedia Good and Featured articles), prepared for character-level language model training with curriculum learning support. ## Dataset Description This dataset contains cleaned text from WikiText-103, with each batch scored on multiple quality dimensions for curriculum-based training. The text has been lowercased and filtered to an ASCII character set suitable for character-level tokenization. ### Source - **Original dataset**: [Salesforce/wikitext](https://huggingface.co/datasets/Salesforce/wikitext) (wikitext-103-v1 split) - **Content**: ~29,000 Wikipedia Good and Featured articles covering science, history, literature, philosophy, geography, technology, music, law, mathematics, and more ### Cleaning Applied - Moses-style detokenization (recombined subword artifacts from original WikiText tokenization) - `<unk>` token removal (WikiText-103 replaces rare words with `<unk>`) - Lowercased to ASCII character set: `a-z .,;:?!'"()-` - Gutenberg/boilerplate header/footer stripping - Whitespace normalization and empty line removal ### Quality Scoring Each batch of ~200 articles is scored using heuristic quality metrics: | Metric | Description | |--------|-------------| | `quality_score` | Weighted composite (vocab diversity, word diversity, length, repetition) | | `citation_score` | Attribution density from citation signals (references, footnotes, bibliographic patterns) | | `avg_mtld` | Measure of Textual Lexical Diversity (higher = richer vocabulary) | | `avg_flesch` | Flesch Reading Ease (lower = more complex text) | | `pop_culture_density` | Pop culture keyword density (lower = more academic) | | `academic_density` | Academic/scholarly vocabulary density | | `topic_tags` | Detected topic categories per batch | ### Quality Tiers Batches are classified into tiers for curriculum scheduling: | Tier | Criteria | Count | |------|----------|-------| | Gold | High MTLD (>76), high citation score (>0.35), low pop culture | 0 | | Silver | Moderate MTLD (>70), moderate citations (>0.25) | 145 | | Bronze | Below silver thresholds | 0 | | Excluded | High pop culture density (celebrity bios, reality TV, tabloid content) | 0 | > **Note**: All batches score as silver because each batch file mixes ~200 diverse articles, diluting both pop-culture and academic signals. For finer-grained tier separation, per-article scoring is recommended. ### Dataset Statistics | Split | Examples | Size | |-------|----------|------| | Train | 2,592,530 | 448 MB | | Validation | 288,060 | 50 MB | - **Average quality score**: 0.3967 - **Deduplication**: 2.1% duplicate chunks removed (62,280 of 2,942,872) - **90/10 train/validation split** (shuffled) ### Topic Distribution Topics detected across 145 batch files: | Topic | Batches | |-------|---------| | Literature | 145 | | History | 145 | | Science | 144 | | Law | 72 | | Technology | 49 | | Music | 48 | | Geography | 35 | | Mathematics | 23 | | Philosophy | 22 | | Economics | 21 | ## Additional Files - **`wikitext_manifest.jsonl`**: Per-batch quality scores, tier assignments, topic tags, and metadata. Each line is a JSON object with fields: `batch_file`, `chunk_count`, `tier`, `quality_score`, `citation_score`, `avg_mtld`, `avg_flesch`, `topic_tags`, `pop_culture_density`, `academic_density`. ## Usage ```python from datasets import load_dataset ds = load_dataset("LisaMegaWatts/wikitext-103-quality-scored") # Training data for example in ds["train"]: text = example["text"] # Load quality manifest for curriculum weighting import json manifest = [] with open("wikitext_manifest.jsonl") as f: for line in f: manifest.append(json.loads(line)) ``` ## Related Datasets - [LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) - Curated philosophy texts (Aristotle, Plato, Kant, etc.) - [LisaMegaWatts/classical-humanities-corpus](https://huggingface.co/datasets/LisaMegaWatts/classical-humanities-corpus) - Extended classical humanities collection ## License Apache 2.0 (following WikiText-103 licensing)

提供机构：

LisaMegaWatts

5,000+

优质数据集

54 个

任务类型

进入经典数据集