five

tvosch/GPT-NL-propella-annotations

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/tvosch/GPT-NL-propella-annotations
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - nl tags: - dutch - annotations - data-quality - propella pretty_name: GPT-NL Propella Annotations --- # GPT-NL Propella Annotations Document-level quality and content annotations for the **Dutch** subset of [GPT-NL/GPT-NL_Public_Corpus](https://huggingface.co/datasets/GPT-NL/GPT-NL_Public_Corpus). ## What is Propella? [Propella](https://huggingface.co/ellamind/propella-1-4b) (`ellamind/propella-1-4b`) is a 4B-parameter language model fine-tuned from Qwen3 for **document-level annotation of LLM pretraining data**. Given a document in any language, it produces a structured JSON object with 18 quality, classification, and safety properties (see below for the features). These annotations can be used downstream to filter or weight training data, for example, removing low-quality documents, deduplicating by content type, or balancing domain coverage. See also [openeurollm/propella-annotations](https://huggingface.co/datasets/openeurollm/propella-annotations) for more info. ## Dataset description Each row corresponds to one Dutch document from the GPT-NL Public Corpus and contains propella's structured annotation alongside the original text. The annotation covers content quality, type, audience, safety, and several other dimensions useful for data curation and filtering. **Size:** 31,291,097 documents **Tokens:** 50,297,442,743 tokens ## Subsets Subsets have been filtered base on the GPT-NL's original Dutch language filter as stated in the column `language`. That results in: | Subset | Config name | Documents | |--------|-------------|----------:| | American-stories | `american_stories` | 15 | | Auditdienst Rijk | `auditdienst_rijk` | 555 | | Belgian Journal | `belgian_journal` | 208,242 | | C5 Filtered | `c5_filtered` | 63,519 | | CC-English-PD | `cc_english_pd` | 1,329 | | CC-Eurovoc | `cc_eurovoc` | 34,948 | | CC-German-PD | `cc_german_pd` | 8,177 | | CC-Github Code | `cc_github_code` | 38,083 | | CC-Loc-PD-Books | `cc_loc_pd_books` | 14 | | DANS-KNAW | `dans_knaw` | 52,880 | | De Rechtspraak | `de_rechtspraak` | 918,634 | | Dienst Publiek en Communicatie | `dienst_publiek_en_communicatie` | 127,715 | | Eurlex | `eurlex` | 36,810 | | European Parliament | `european_parliament` | 654 | | Koninklijke Bibliotheek | `koninklijke_bibliotheek` | 1,571,895 | | Nationaal Archief | `nationaal_archief` | 1,924,127 | | Naturalis | `naturalis` | 2,652 | | Noord-Hollands Archief | `noord_hollands_archief` | 38,737 | | Officiele Bekendmakingen | `officiele_bekendmakingen` | 1,822,093 | | Openraadsinformatie | `openraadsinformatie` | 2,712,533 | | PBL | `pbl` | 341 | | Tweede Kamer | `tweede_kamer` | 229,600 | | Utrechts Archief | `utrechts_archief` | 525,886 | | Wikidata-Synth | `wikidata_synth` | 14,582,582 | | Wikiwijs | `wikiwijs` | 119,187 | | Woogle | `woogle` | 4,088,931 | | YouTube-Commons-Synth | `youtube_commons_synth` | 2,147,061 | | Zeeuws Archief | `zeeuws_archief` | 33,897 | | **Total** | | **31,291,097** | ## Note on `source_id` uniqueness The `source_id` column is the original `id` field from GPT-NL_Public_Corpus. **It is not a document-level unique identifier for the majority of the corpus.** `source_id` can be used together with `dataset_name` (available in the original corpus) to identify the sub-corpus a document comes from, but it does not uniquely identify individual documents for most sub-corpora. Use `doc_id` as the stable unique identifier for this dataset. ## Annotation errors A small fraction of documents (~239,000 out of 31M, <1%) could not be annotated due to model output validation failures. These rows have `null` values for all annotation columns and a non-null `annotation_error` string. Filter them with: ```python ds = ds.filter(lambda x: x["annotation_error"] is None) ``` ## Annotation setup The model was served with [vLLM](https://github.com/vllm-project/vllm) (v0.19.0) inside an Apptainer container, with structured JSON output constrained by [xgrammar](https://github.com/mlc-ai/xgrammar): ```bash vllm serve ellamind/propella-1-4b \ --tensor-parallel-size 1 \ --max-model-len 32768 \ --max-num-seqs 256 \ --quantization fp8 \ --kv-cache-dtype fp8 \ --async-scheduling \ --enable-prefix-caching ``` ## Source corpus & license Source: [GPT-NL/GPT-NL_Public_Corpus](https://huggingface.co/datasets/GPT-NL/GPT-NL_Public_Corpus) License: [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) ## Columns | Column | Type | Description | |--------|------|-------------| | `doc_id` | string | Deterministic UUID5 unique identifier, stable across re-runs. Derived from the source shard index and row position. | | `source_id` | string | Original `id` value from GPT-NL_Public_Corpus. See note on uniqueness below. | | `dataset_name` | string | Sub-corpus name from GPT-NL_Public_Corpus (e.g. `C5 Filtered`, `Koninklijke Bibliotheek`). | | `text` | string | Document text (CC-BY-4.0, from GPT-NL_Public_Corpus). | | `content_integrity` | string | Whether the text is complete and coherent. | | `content_ratio` | string | Ratio of meaningful content vs. boilerplate/noise. | | `content_length` | string | Qualitative length category. | | `one_sentence_description` | string | One-sentence summary of the document. | | `content_type` | list[string] | Content type tags (e.g. `news`, `legal_document`, `boilerplate`). | | `business_sector` | list[string] | Relevant industry/sector tags. | | `technical_content` | list[string] | Degree and type of technical content. | | `information_density` | string | How information-dense the text is. | | `content_quality` | string | Overall quality rating. | | `audience_level` | string | Intended audience (e.g. `general`, `expert`). | | `commercial_bias` | string | Degree of commercial intent. | | `time_sensitivity` | string | Whether content is time-sensitive or evergreen. | | `content_safety` | string | Safety classification. | | `educational_value` | string | Educational value rating. | | `reasoning_indicators` | string | Presence of reasoning or argumentation. | | `pii_presence` | string | Whether the document contains personally identifiable information. | | `regional_relevance` | list[string] | Geographic relevance tags. | | `country_relevance` | list[string] | Specific country relevance tags. | | `annotation_error` | string | Populated only when the annotation model failed for this document; all annotation columns will be `null` in that case. |
提供机构:
tvosch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作