LisaMegaWatts/wikitext-103-quality-scored
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LisaMegaWatts/wikitext-103-quality-scored
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
task_categories:
- text-generation
tags:
- wikipedia
- wikitext
- quality-scored
- curriculum-learning
- character-level
- slm
size_categories:
- 1M<n<10M
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 2592530
- name: validation
num_examples: 288060
---
# WikiText-103 Quality-Scored Corpus
Cleaned and quality-scored subset of [WikiText-103](https://huggingface.co/datasets/Salesforce/wikitext) (curated Wikipedia Good and Featured articles), prepared for character-level language model training with curriculum learning support.
## Dataset Description
This dataset contains cleaned text from WikiText-103, with each batch scored on multiple quality dimensions for curriculum-based training. The text has been lowercased and filtered to an ASCII character set suitable for character-level tokenization.
### Source
- **Original dataset**: [Salesforce/wikitext](https://huggingface.co/datasets/Salesforce/wikitext) (wikitext-103-v1 split)
- **Content**: ~29,000 Wikipedia Good and Featured articles covering science, history, literature, philosophy, geography, technology, music, law, mathematics, and more
### Cleaning Applied
- Moses-style detokenization (recombined subword artifacts from original WikiText tokenization)
- `<unk>` token removal (WikiText-103 replaces rare words with `<unk>`)
- Lowercased to ASCII character set: `a-z .,;:?!'"()-`
- Gutenberg/boilerplate header/footer stripping
- Whitespace normalization and empty line removal
### Quality Scoring
Each batch of ~200 articles is scored using heuristic quality metrics:
| Metric | Description |
|--------|-------------|
| `quality_score` | Weighted composite (vocab diversity, word diversity, length, repetition) |
| `citation_score` | Attribution density from citation signals (references, footnotes, bibliographic patterns) |
| `avg_mtld` | Measure of Textual Lexical Diversity (higher = richer vocabulary) |
| `avg_flesch` | Flesch Reading Ease (lower = more complex text) |
| `pop_culture_density` | Pop culture keyword density (lower = more academic) |
| `academic_density` | Academic/scholarly vocabulary density |
| `topic_tags` | Detected topic categories per batch |
### Quality Tiers
Batches are classified into tiers for curriculum scheduling:
| Tier | Criteria | Count |
|------|----------|-------|
| Gold | High MTLD (>76), high citation score (>0.35), low pop culture | 0 |
| Silver | Moderate MTLD (>70), moderate citations (>0.25) | 145 |
| Bronze | Below silver thresholds | 0 |
| Excluded | High pop culture density (celebrity bios, reality TV, tabloid content) | 0 |
> **Note**: All batches score as silver because each batch file mixes ~200 diverse articles, diluting both pop-culture and academic signals. For finer-grained tier separation, per-article scoring is recommended.
### Dataset Statistics
| Split | Examples | Size |
|-------|----------|------|
| Train | 2,592,530 | 448 MB |
| Validation | 288,060 | 50 MB |
- **Average quality score**: 0.3967
- **Deduplication**: 2.1% duplicate chunks removed (62,280 of 2,942,872)
- **90/10 train/validation split** (shuffled)
### Topic Distribution
Topics detected across 145 batch files:
| Topic | Batches |
|-------|---------|
| Literature | 145 |
| History | 145 |
| Science | 144 |
| Law | 72 |
| Technology | 49 |
| Music | 48 |
| Geography | 35 |
| Mathematics | 23 |
| Philosophy | 22 |
| Economics | 21 |
## Additional Files
- **`wikitext_manifest.jsonl`**: Per-batch quality scores, tier assignments, topic tags, and metadata. Each line is a JSON object with fields: `batch_file`, `chunk_count`, `tier`, `quality_score`, `citation_score`, `avg_mtld`, `avg_flesch`, `topic_tags`, `pop_culture_density`, `academic_density`.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("LisaMegaWatts/wikitext-103-quality-scored")
# Training data
for example in ds["train"]:
text = example["text"]
# Load quality manifest for curriculum weighting
import json
manifest = []
with open("wikitext_manifest.jsonl") as f:
for line in f:
manifest.append(json.loads(line))
```
## Related Datasets
- [LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) - Curated philosophy texts (Aristotle, Plato, Kant, etc.)
- [LisaMegaWatts/classical-humanities-corpus](https://huggingface.co/datasets/LisaMegaWatts/classical-humanities-corpus) - Extended classical humanities collection
## License
Apache 2.0 (following WikiText-103 licensing)
提供机构:
LisaMegaWatts



