tvosch/GPT-NL-propella-annotations
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/tvosch/GPT-NL-propella-annotations
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- nl
tags:
- dutch
- annotations
- data-quality
- propella
pretty_name: GPT-NL Propella Annotations
---
# GPT-NL Propella Annotations
Document-level quality and content annotations for the **Dutch** subset of [GPT-NL/GPT-NL_Public_Corpus](https://huggingface.co/datasets/GPT-NL/GPT-NL_Public_Corpus).
## What is Propella?
[Propella](https://huggingface.co/ellamind/propella-1-4b) (`ellamind/propella-1-4b`) is a 4B-parameter language model fine-tuned from Qwen3 for **document-level annotation of LLM pretraining data**. Given a document in any language, it produces a structured JSON object with 18 quality, classification, and safety properties (see below for the features). These annotations can be used downstream to filter or weight training data, for example, removing low-quality documents, deduplicating by content type, or balancing domain coverage. See also [openeurollm/propella-annotations](https://huggingface.co/datasets/openeurollm/propella-annotations) for more info.
## Dataset description
Each row corresponds to one Dutch document from the GPT-NL Public Corpus and contains propella's structured annotation alongside the original text. The annotation covers content quality, type, audience, safety, and several other dimensions useful for data curation and filtering.
**Size:** 31,291,097 documents
**Tokens:** 50,297,442,743 tokens
## Subsets
Subsets have been filtered base on the GPT-NL's original Dutch language filter as stated in the column `language`. That results in:
| Subset | Config name | Documents |
|--------|-------------|----------:|
| American-stories | `american_stories` | 15 |
| Auditdienst Rijk | `auditdienst_rijk` | 555 |
| Belgian Journal | `belgian_journal` | 208,242 |
| C5 Filtered | `c5_filtered` | 63,519 |
| CC-English-PD | `cc_english_pd` | 1,329 |
| CC-Eurovoc | `cc_eurovoc` | 34,948 |
| CC-German-PD | `cc_german_pd` | 8,177 |
| CC-Github Code | `cc_github_code` | 38,083 |
| CC-Loc-PD-Books | `cc_loc_pd_books` | 14 |
| DANS-KNAW | `dans_knaw` | 52,880 |
| De Rechtspraak | `de_rechtspraak` | 918,634 |
| Dienst Publiek en Communicatie | `dienst_publiek_en_communicatie` | 127,715 |
| Eurlex | `eurlex` | 36,810 |
| European Parliament | `european_parliament` | 654 |
| Koninklijke Bibliotheek | `koninklijke_bibliotheek` | 1,571,895 |
| Nationaal Archief | `nationaal_archief` | 1,924,127 |
| Naturalis | `naturalis` | 2,652 |
| Noord-Hollands Archief | `noord_hollands_archief` | 38,737 |
| Officiele Bekendmakingen | `officiele_bekendmakingen` | 1,822,093 |
| Openraadsinformatie | `openraadsinformatie` | 2,712,533 |
| PBL | `pbl` | 341 |
| Tweede Kamer | `tweede_kamer` | 229,600 |
| Utrechts Archief | `utrechts_archief` | 525,886 |
| Wikidata-Synth | `wikidata_synth` | 14,582,582 |
| Wikiwijs | `wikiwijs` | 119,187 |
| Woogle | `woogle` | 4,088,931 |
| YouTube-Commons-Synth | `youtube_commons_synth` | 2,147,061 |
| Zeeuws Archief | `zeeuws_archief` | 33,897 |
| **Total** | | **31,291,097** |
## Note on `source_id` uniqueness
The `source_id` column is the original `id` field from GPT-NL_Public_Corpus. **It is not a document-level unique identifier for the majority of the corpus.**
`source_id` can be used together with `dataset_name` (available in the original corpus) to identify the sub-corpus a document comes from, but it does not uniquely identify individual documents for most sub-corpora. Use `doc_id` as the stable unique identifier for this dataset.
## Annotation errors
A small fraction of documents (~239,000 out of 31M, <1%) could not be annotated due to model output validation failures. These rows have `null` values for all annotation columns and a non-null `annotation_error` string. Filter them with:
```python
ds = ds.filter(lambda x: x["annotation_error"] is None)
```
## Annotation setup
The model was served with [vLLM](https://github.com/vllm-project/vllm) (v0.19.0) inside an Apptainer container, with structured JSON output constrained by [xgrammar](https://github.com/mlc-ai/xgrammar):
```bash
vllm serve ellamind/propella-1-4b \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 256 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--async-scheduling \
--enable-prefix-caching
```
## Source corpus & license
Source: [GPT-NL/GPT-NL_Public_Corpus](https://huggingface.co/datasets/GPT-NL/GPT-NL_Public_Corpus)
License: [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
## Columns
| Column | Type | Description |
|--------|------|-------------|
| `doc_id` | string | Deterministic UUID5 unique identifier, stable across re-runs. Derived from the source shard index and row position. |
| `source_id` | string | Original `id` value from GPT-NL_Public_Corpus. See note on uniqueness below. |
| `dataset_name` | string | Sub-corpus name from GPT-NL_Public_Corpus (e.g. `C5 Filtered`, `Koninklijke Bibliotheek`). |
| `text` | string | Document text (CC-BY-4.0, from GPT-NL_Public_Corpus). |
| `content_integrity` | string | Whether the text is complete and coherent. |
| `content_ratio` | string | Ratio of meaningful content vs. boilerplate/noise. |
| `content_length` | string | Qualitative length category. |
| `one_sentence_description` | string | One-sentence summary of the document. |
| `content_type` | list[string] | Content type tags (e.g. `news`, `legal_document`, `boilerplate`). |
| `business_sector` | list[string] | Relevant industry/sector tags. |
| `technical_content` | list[string] | Degree and type of technical content. |
| `information_density` | string | How information-dense the text is. |
| `content_quality` | string | Overall quality rating. |
| `audience_level` | string | Intended audience (e.g. `general`, `expert`). |
| `commercial_bias` | string | Degree of commercial intent. |
| `time_sensitivity` | string | Whether content is time-sensitive or evergreen. |
| `content_safety` | string | Safety classification. |
| `educational_value` | string | Educational value rating. |
| `reasoning_indicators` | string | Presence of reasoning or argumentation. |
| `pii_presence` | string | Whether the document contains personally identifiable information. |
| `regional_relevance` | list[string] | Geographic relevance tags. |
| `country_relevance` | list[string] | Specific country relevance tags. |
| `annotation_error` | string | Populated only when the annotation model failed for this document; all annotation columns will be `null` in that case. |
提供机构:
tvosch



