MichielBuisman/Leesplank-vloeiend-nl-curriculum
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MichielBuisman/Leesplank-vloeiend-nl-curriculum
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-classification
- text-generation
language:
- nl
tags:
- fluency
- levenshtein
- wikipedia
- Leesplank
- Granite
---
# Leesplank NL — Levenshtein Annotated (Checkpoint 1)
A reshaped and edit-distance-annotated version of the
[UWV/Leesplank_NL_wikipedia_simplifications_preprocessed](https://huggingface.co/datasets/UWV/Leesplank_NL_wikipedia_simplifications_preprocessed)
dataset, prepared as the first checkpoint of a Dutch Fluency LoRA project targeting
IBM Granite 4.0 3B Dense.
The original dataset contains ~2.7M *paragraph* pairs: a Dutch Wikipedia source paragraph
(`prompt`) and a synthetically simplified version (`result`). This checkpoint reshapes
those pairs into one text per row, labels each text as simple (synthetic) or not, and annotates
every row with the *raw* Levenshtein edit distance between its source pair.
---
## What this dataset adds
The upstream Leesplank dataset is already valuable. This checkpoint adds three things:
**1. One text per row.** The original pair structure (`prompt` / `result` columns) is
unpacked into a flat list of individual texts with an `is_simple` boolean label. This
makes the dataset directly usable for embedding, classification, perplexity measurement,
and curriculum learning without any preprocessing.
**2. Absolute Levenshtein distance annotation.** Every row carries `lev_distance` — the
raw character-level edit distance between the complex and simple version of its source
pair. Absolute rather than normalized (relative) distance is used deliberately: for
curriculum learning, what matters is the volume of character-level changes the model
must learn to produce. A 500-character rewrite represents more learning signal than a
5-character one regardless of the original text length. Normalizing by length would
obscure this, rewarding heavy rewrites of short texts and penalizing light edits of long
ones. Absolute distance is the correct signal for ordering training difficulty.
- `lev_distance = 0` — the text was already plain Dutch (only 82 such rows exist in this
dataset — these are genuinely rare cases where no simplification was needed at all)
- low values — minor edits: word substitutions, short deletions
- high values — heavy rewrites or structural changes
**3. Sorted by edit distance.** Rows are ordered ascending by `lev_distance`, which
makes curriculum learning straightforward: training on easy examples first (low edit
distance) before progressing to harder rewrites.
---
## Schema
| Column | Type | Description |
|---|---|---|
| `text` | string | One Dutch text — either the complex or simple version of a paragraph pair |
| `is_simple` | boolean | `true` = simplified text (originally the `result` column); `false` = complex source text (originally `prompt`) |
| `lev_distance` | int32 | Raw Levenshtein edit distance between the complex/simple pair this text belongs to |
---
## Row counts
| Metric | Count |
|---|---|
| Total rows | 5,389,028 |
| `is_simple = true` rows | 2,694,555 |
| `is_simple = false` rows | 2,694,473 |
| `lev_distance = 0` rows | 82 |
The simple and complex counts are not perfectly equal because rows where
`lev_distance = 0 AND is_simple = false` were removed — these are complex-labeled texts
that are identical to their simplified counterpart, making them noise rather than signal.
The 82 retained `lev_distance = 0` rows are all `is_simple = true`: texts that were
already written in plain Dutch and required no simplification.
---
## Intended uses
**This dataset is not structured for text simplification fine-tuning.** The original
complex/simple pairs have been deliberately unpacked into individual texts. If you need
paired prompt/result rows for seq2seq or instruction-tuning, use the upstream
[UWV/Leesplank_NL_wikipedia_simplifications_preprocessed](https://huggingface.co/datasets/UWV/Leesplank_NL_wikipedia_simplifications_preprocessed)
dataset instead.
What this dataset *is* designed for:
- **Dutch fluency curriculum training** — the primary intended use. Feed individual texts
to a small language model ordered by `lev_distance` to progressively expose it to
increasing levels of Dutch complexity. The goal is building fluency as a base
capability, not teaching a task. This is a non-standard training approach aimed at
small models (sub-4B parameters) that benefit from carefully ordered exposure to a
language register gradient.
- **Readability classification** — binary classification of Dutch text complexity using
the `is_simple` label, useful for probing or for training a complexity-aware
classification head.
- **Perplexity-based data selection** — stratified sampling by `lev_distance` bucket
before running model-specific perplexity scoring, as part of a difficulty-aware data
selection pipeline.
- **Dutch language modeling** — a large flat corpus of Dutch Wikipedia-derived text
spanning a wide register range, usable for causal or masked language modeling on Dutch.
- **Dutch embeddings research** — 5M paragraphs is a substantial Dutch corpus, and the
complexity gradient makes it particularly interesting for probing whether embedding
spaces capture readability as a geometric property. Because every text has a paired
counterpart at a known edit distance, you can measure whether complex/simple pairs
cluster together and whether Levenshtein distance correlates with distance in embedding
space — a concrete, quantifiable readability probe that most Dutch corpora cannot
support.
---
## Out-of-scope uses
- **Text simplification fine-tuning** — the paired structure required for this has been
removed. This is by design, not an oversight.
- The simplified texts are **synthetic** — generated, not written by humans. Do not
treat them as gold-standard human simplifications.
- This dataset is Dutch only and is not suitable for multilingual tasks without
additional filtering.
- `lev_distance` is a character-level metric. It correlates with simplification effort
but is not a semantic measure — a short deletion can have low edit distance while
substantially changing meaning.
---
## Source dataset
**UWV/Leesplank_NL_wikipedia_simplifications_preprocessed**
Published by UWV (Uitvoeringsinstituut Werknemersverzekeringen, the Dutch Employee
Insurance Agency). The original dataset pairs Dutch Wikipedia paragraphs with
synthetically simplified versions intended for readers with lower literacy levels
("leesplank" = reading primer).
Please refer to the upstream dataset page for its license and full documentation.
---
## Processing steps
1. Downloaded all three splits (`train`, `val`, `test`) from HuggingFace
2. Combined splits into a single DuckDB table
3. Computed `levenshtein(prompt, result)` natively in DuckDB for all ~2.7M pairs
4. Reshaped from pair rows to single-text rows via `UNION ALL` of prompt and result
5. Exported to Parquet with ZSTD compression, row group size 100,000
All processing used DuckDB on a local Windows machine. 82 rows where lev_distance = 0 AND is_simple = false were removed as noise. Single entries with `lev_distance = 0` are retained.
---
## What comes next
This dataset is Checkpoint 1 of a larger project. Planned subsequent checkpoints:
- **Checkpoint 2** — Semantic embeddings via Granite 4.0 3B Dense (llama.cpp + ROCm),
dimensionality reduction via PCA, K-means clustering to assign `cluster_id` per row
- **Checkpoint 3** — Per-row perplexity scores across multiple quantization levels
(Q4_K_M, Q5_K_M, Q6_K), stored as additional columns
- **Training** — QLoRA fine-tuning of Granite 4.0 3B Dense on a curriculum built from
`lev_distance` + `cluster_id` + `perplexity`, targeting fluent plain Dutch generation
---
## Citation
If you use this dataset, please also cite the upstream Leesplank dataset:
```
@dataset{leesplank_checkpoint1,
title = {Leesplank vloeiend nl curriculum (checkpoint 1)},
author = {MichielBuisman},
year = {2026},
note = {Derived from UWV/Leesplank\_NL\_wikipedia\_simplifications\_preprocessed},
url = {https://huggingface.co/datasets/MichielBuisman/Leesplank-vloeiend-nl-curriculum}
}
```
提供机构:
MichielBuisman



