davanstrien/hf-dataset-domain-labels-v3

Name: davanstrien/hf-dataset-domain-labels-v3
Creator: davanstrien
Published: 2026-03-31 15:33:23
License: 暂无描述

Hugging Face2026-03-31 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/davanstrien/hf-dataset-domain-labels-v3

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 tags: - text-classification - domain-classification - huggingface-hub - active-learning size_categories: - 1K<n<10K task_categories: - text-classification --- # HF Dataset Domain Labels v3 Training and evaluation data for classifying HuggingFace datasets by knowledge domain. Built through iterative data-centric development with LLM-in-the-loop active learning. ## Dataset Structure ### Splits | Split | Examples | Purpose | |-------|----------|---------| | train | 3,437 | Training data from 3 sources | | test | 72 | Gold evaluation set (multi-model consensus validated, truly OOD) | ### Label Distribution (train) | Label | Count | Description | |-------|-------|-------------| | none | 940 | General-purpose, no specific domain | | medical | 485 | Healthcare, clinical, biomedical | | code | 425 | Programming, software engineering | | legal | 348 | Law, regulations, court cases | | cybersecurity | 288 | Security threats, malware | | math | 271 | Mathematics, theorem proving | | finance | 267 | Banking, trading, economics | | biology | 185 | Genomics, ecology, life sciences | | climate | 132 | Weather, earth observation | | chemistry | 96 | Molecules, reactions, materials | ### Fields - `text`: Preprocessed dataset card text (YAML frontmatter stripped, HTML cleaned, dataset ID prepended) - `label`: Domain classification (one of the 10 labels above) - `datasetId`: Original HuggingFace dataset identifier ## Data Sources ### Tag-based labels (1,990 examples) Derived from existing HuggingFace dataset tags. A tag-to-domain mapping converts tags like `medical`, `healthcare`, `clinical` → `medical`. High precision but limited to datasets that already have domain tags. ### LLM-labelled "none" examples (607 examples) Qwen3-235B classified 300 randomly sampled untagged datasets. 86% were classified as "none" — reflecting the real distribution on HuggingFace. Combined with tag-based "none" candidates (datasets with no domain-related tags). ### Active learning hard negatives (800 examples) The highest-value data. Generated through disagreement-based active learning: 1. Ran ModernBERT v2 on 5,000 non-training datasets 2. Ran Qwen3-4B on the 800 most uncertain predictions 3. Found 50% disagreement rate — 315 cases where the model said "domain" but the LLM said "none" 4. 85 domain-vs-domain disagreements adjudicated by Qwen3-235B 5. 382 of 800 v2 predictions were overturned These examples specifically target the model's weaknesses: false positives and domain boundary confusion. ## Gold Test Set (72 examples) The test split was validated by 3-model consensus: - **Qwen3-235B-A22B-Instruct-2507** (Together) - **DeepSeek-V3** (Together) - **Llama-3.3-70B-Instruct** (Together) Consensus quality: 51 unanimous, 20 majority (2/3), 1 split. These are truly out-of-distribution — none appear in the training data, and they were selected from model predictions on unseen data. ## Development History | Version | Train Size | OOD Accuracy | Key Change | |---------|-----------|-------------|------------| | v0 | 2,954 | — | Tag-based only, no "none" class | | v1 | 2,637 | 52.8% | Added "none" class (400 examples) | | v2 | 2,637 | 76.4% | +260 diverse LLM-labelled "none" examples | | v3 | 3,437 | 84.7% (90.3% with filter) | +800 active learning hard negatives | ## How This Was Built This dataset was built collaboratively by [@davanstrien](https://huggingface.co/davanstrien) and [Hermes Agent](https://github.com/nousresearch/hermes-agent) (NousResearch). The agent drove the full data-centric loop: diagnosing model failures via multi-model consensus, designing the disagreement-based active learning pipeline, running batch LLM inference via HF Jobs (`uv-scripts/transformers-inference`), and iterating through three rounds of data improvement. The methodology is captured as a reusable [data-centric model development skill](https://github.com/nousresearch/hermes-agent). ## Related Resources - Model: [davanstrien/modernbert-hf-dataset-domain-v3](https://huggingface.co/davanstrien/modernbert-hf-dataset-domain-v3) - Source data: [librarian-bots/dataset_cards_with_metadata](https://huggingface.co/datasets/librarian-bots/dataset_cards_with_metadata) - Intermediate artifacts: - [5K predictions](https://huggingface.co/datasets/davanstrien/domain-classifier-5k-predictions) - [LLM responses](https://huggingface.co/datasets/davanstrien/domain-classifier-llm-responses) - [Disagreement labels](https://huggingface.co/datasets/davanstrien/domain-classifier-disagreement-labels)

提供机构：

davanstrien

5,000+

优质数据集

54 个

任务类型

进入经典数据集