five

MichielBuisman/Leesplank-vloeiend-nl-curriculum

收藏
Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MichielBuisman/Leesplank-vloeiend-nl-curriculum
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-classification - text-generation language: - nl tags: - fluency - levenshtein - wikipedia - Leesplank - Granite --- # Leesplank NL — Levenshtein Annotated (Checkpoint 1) A reshaped and edit-distance-annotated version of the [UWV/Leesplank_NL_wikipedia_simplifications_preprocessed](https://huggingface.co/datasets/UWV/Leesplank_NL_wikipedia_simplifications_preprocessed) dataset, prepared as the first checkpoint of a Dutch Fluency LoRA project targeting IBM Granite 4.0 3B Dense. The original dataset contains ~2.7M *paragraph* pairs: a Dutch Wikipedia source paragraph (`prompt`) and a synthetically simplified version (`result`). This checkpoint reshapes those pairs into one text per row, labels each text as simple (synthetic) or not, and annotates every row with the *raw* Levenshtein edit distance between its source pair. --- ## What this dataset adds The upstream Leesplank dataset is already valuable. This checkpoint adds three things: **1. One text per row.** The original pair structure (`prompt` / `result` columns) is unpacked into a flat list of individual texts with an `is_simple` boolean label. This makes the dataset directly usable for embedding, classification, perplexity measurement, and curriculum learning without any preprocessing. **2. Absolute Levenshtein distance annotation.** Every row carries `lev_distance` — the raw character-level edit distance between the complex and simple version of its source pair. Absolute rather than normalized (relative) distance is used deliberately: for curriculum learning, what matters is the volume of character-level changes the model must learn to produce. A 500-character rewrite represents more learning signal than a 5-character one regardless of the original text length. Normalizing by length would obscure this, rewarding heavy rewrites of short texts and penalizing light edits of long ones. Absolute distance is the correct signal for ordering training difficulty. - `lev_distance = 0` — the text was already plain Dutch (only 82 such rows exist in this dataset — these are genuinely rare cases where no simplification was needed at all) - low values — minor edits: word substitutions, short deletions - high values — heavy rewrites or structural changes **3. Sorted by edit distance.** Rows are ordered ascending by `lev_distance`, which makes curriculum learning straightforward: training on easy examples first (low edit distance) before progressing to harder rewrites. --- ## Schema | Column | Type | Description | |---|---|---| | `text` | string | One Dutch text — either the complex or simple version of a paragraph pair | | `is_simple` | boolean | `true` = simplified text (originally the `result` column); `false` = complex source text (originally `prompt`) | | `lev_distance` | int32 | Raw Levenshtein edit distance between the complex/simple pair this text belongs to | --- ## Row counts | Metric | Count | |---|---| | Total rows | 5,389,028 | | `is_simple = true` rows | 2,694,555 | | `is_simple = false` rows | 2,694,473 | | `lev_distance = 0` rows | 82 | The simple and complex counts are not perfectly equal because rows where `lev_distance = 0 AND is_simple = false` were removed — these are complex-labeled texts that are identical to their simplified counterpart, making them noise rather than signal. The 82 retained `lev_distance = 0` rows are all `is_simple = true`: texts that were already written in plain Dutch and required no simplification. --- ## Intended uses **This dataset is not structured for text simplification fine-tuning.** The original complex/simple pairs have been deliberately unpacked into individual texts. If you need paired prompt/result rows for seq2seq or instruction-tuning, use the upstream [UWV/Leesplank_NL_wikipedia_simplifications_preprocessed](https://huggingface.co/datasets/UWV/Leesplank_NL_wikipedia_simplifications_preprocessed) dataset instead. What this dataset *is* designed for: - **Dutch fluency curriculum training** — the primary intended use. Feed individual texts to a small language model ordered by `lev_distance` to progressively expose it to increasing levels of Dutch complexity. The goal is building fluency as a base capability, not teaching a task. This is a non-standard training approach aimed at small models (sub-4B parameters) that benefit from carefully ordered exposure to a language register gradient. - **Readability classification** — binary classification of Dutch text complexity using the `is_simple` label, useful for probing or for training a complexity-aware classification head. - **Perplexity-based data selection** — stratified sampling by `lev_distance` bucket before running model-specific perplexity scoring, as part of a difficulty-aware data selection pipeline. - **Dutch language modeling** — a large flat corpus of Dutch Wikipedia-derived text spanning a wide register range, usable for causal or masked language modeling on Dutch. - **Dutch embeddings research** — 5M paragraphs is a substantial Dutch corpus, and the complexity gradient makes it particularly interesting for probing whether embedding spaces capture readability as a geometric property. Because every text has a paired counterpart at a known edit distance, you can measure whether complex/simple pairs cluster together and whether Levenshtein distance correlates with distance in embedding space — a concrete, quantifiable readability probe that most Dutch corpora cannot support. --- ## Out-of-scope uses - **Text simplification fine-tuning** — the paired structure required for this has been removed. This is by design, not an oversight. - The simplified texts are **synthetic** — generated, not written by humans. Do not treat them as gold-standard human simplifications. - This dataset is Dutch only and is not suitable for multilingual tasks without additional filtering. - `lev_distance` is a character-level metric. It correlates with simplification effort but is not a semantic measure — a short deletion can have low edit distance while substantially changing meaning. --- ## Source dataset **UWV/Leesplank_NL_wikipedia_simplifications_preprocessed** Published by UWV (Uitvoeringsinstituut Werknemersverzekeringen, the Dutch Employee Insurance Agency). The original dataset pairs Dutch Wikipedia paragraphs with synthetically simplified versions intended for readers with lower literacy levels ("leesplank" = reading primer). Please refer to the upstream dataset page for its license and full documentation. --- ## Processing steps 1. Downloaded all three splits (`train`, `val`, `test`) from HuggingFace 2. Combined splits into a single DuckDB table 3. Computed `levenshtein(prompt, result)` natively in DuckDB for all ~2.7M pairs 4. Reshaped from pair rows to single-text rows via `UNION ALL` of prompt and result 5. Exported to Parquet with ZSTD compression, row group size 100,000 All processing used DuckDB on a local Windows machine. 82 rows where lev_distance = 0 AND is_simple = false were removed as noise. Single entries with `lev_distance = 0` are retained. --- ## What comes next This dataset is Checkpoint 1 of a larger project. Planned subsequent checkpoints: - **Checkpoint 2** — Semantic embeddings via Granite 4.0 3B Dense (llama.cpp + ROCm), dimensionality reduction via PCA, K-means clustering to assign `cluster_id` per row - **Checkpoint 3** — Per-row perplexity scores across multiple quantization levels (Q4_K_M, Q5_K_M, Q6_K), stored as additional columns - **Training** — QLoRA fine-tuning of Granite 4.0 3B Dense on a curriculum built from `lev_distance` + `cluster_id` + `perplexity`, targeting fluent plain Dutch generation --- ## Citation If you use this dataset, please also cite the upstream Leesplank dataset: ``` @dataset{leesplank_checkpoint1, title = {Leesplank vloeiend nl curriculum (checkpoint 1)}, author = {MichielBuisman}, year = {2026}, note = {Derived from UWV/Leesplank\_NL\_wikipedia\_simplifications\_preprocessed}, url = {https://huggingface.co/datasets/MichielBuisman/Leesplank-vloeiend-nl-curriculum} } ```
提供机构:
MichielBuisman
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作