davanstrien/hf-dataset-domain-labels-v3
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/hf-dataset-domain-labels-v3
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
tags:
- text-classification
- domain-classification
- huggingface-hub
- active-learning
size_categories:
- 1K<n<10K
task_categories:
- text-classification
---
# HF Dataset Domain Labels v3
Training and evaluation data for classifying HuggingFace datasets by knowledge domain. Built through iterative data-centric development with LLM-in-the-loop active learning.
## Dataset Structure
### Splits
| Split | Examples | Purpose |
|-------|----------|---------|
| train | 3,437 | Training data from 3 sources |
| test | 72 | Gold evaluation set (multi-model consensus validated, truly OOD) |
### Label Distribution (train)
| Label | Count | Description |
|-------|-------|-------------|
| none | 940 | General-purpose, no specific domain |
| medical | 485 | Healthcare, clinical, biomedical |
| code | 425 | Programming, software engineering |
| legal | 348 | Law, regulations, court cases |
| cybersecurity | 288 | Security threats, malware |
| math | 271 | Mathematics, theorem proving |
| finance | 267 | Banking, trading, economics |
| biology | 185 | Genomics, ecology, life sciences |
| climate | 132 | Weather, earth observation |
| chemistry | 96 | Molecules, reactions, materials |
### Fields
- `text`: Preprocessed dataset card text (YAML frontmatter stripped, HTML cleaned, dataset ID prepended)
- `label`: Domain classification (one of the 10 labels above)
- `datasetId`: Original HuggingFace dataset identifier
## Data Sources
### Tag-based labels (1,990 examples)
Derived from existing HuggingFace dataset tags. A tag-to-domain mapping converts tags like `medical`, `healthcare`, `clinical` → `medical`. High precision but limited to datasets that already have domain tags.
### LLM-labelled "none" examples (607 examples)
Qwen3-235B classified 300 randomly sampled untagged datasets. 86% were classified as "none" — reflecting the real distribution on HuggingFace. Combined with tag-based "none" candidates (datasets with no domain-related tags).
### Active learning hard negatives (800 examples)
The highest-value data. Generated through disagreement-based active learning:
1. Ran ModernBERT v2 on 5,000 non-training datasets
2. Ran Qwen3-4B on the 800 most uncertain predictions
3. Found 50% disagreement rate — 315 cases where the model said "domain" but the LLM said "none"
4. 85 domain-vs-domain disagreements adjudicated by Qwen3-235B
5. 382 of 800 v2 predictions were overturned
These examples specifically target the model's weaknesses: false positives and domain boundary confusion.
## Gold Test Set (72 examples)
The test split was validated by 3-model consensus:
- **Qwen3-235B-A22B-Instruct-2507** (Together)
- **DeepSeek-V3** (Together)
- **Llama-3.3-70B-Instruct** (Together)
Consensus quality: 51 unanimous, 20 majority (2/3), 1 split.
These are truly out-of-distribution — none appear in the training data, and they were selected from model predictions on unseen data.
## Development History
| Version | Train Size | OOD Accuracy | Key Change |
|---------|-----------|-------------|------------|
| v0 | 2,954 | — | Tag-based only, no "none" class |
| v1 | 2,637 | 52.8% | Added "none" class (400 examples) |
| v2 | 2,637 | 76.4% | +260 diverse LLM-labelled "none" examples |
| v3 | 3,437 | 84.7% (90.3% with filter) | +800 active learning hard negatives |
## How This Was Built
This dataset was built collaboratively by [@davanstrien](https://huggingface.co/davanstrien) and [Hermes Agent](https://github.com/nousresearch/hermes-agent) (NousResearch). The agent drove the full data-centric loop: diagnosing model failures via multi-model consensus, designing the disagreement-based active learning pipeline, running batch LLM inference via HF Jobs (`uv-scripts/transformers-inference`), and iterating through three rounds of data improvement. The methodology is captured as a reusable [data-centric model development skill](https://github.com/nousresearch/hermes-agent).
## Related Resources
- Model: [davanstrien/modernbert-hf-dataset-domain-v3](https://huggingface.co/davanstrien/modernbert-hf-dataset-domain-v3)
- Source data: [librarian-bots/dataset_cards_with_metadata](https://huggingface.co/datasets/librarian-bots/dataset_cards_with_metadata)
- Intermediate artifacts:
- [5K predictions](https://huggingface.co/datasets/davanstrien/domain-classifier-5k-predictions)
- [LLM responses](https://huggingface.co/datasets/davanstrien/domain-classifier-llm-responses)
- [Disagreement labels](https://huggingface.co/datasets/davanstrien/domain-classifier-disagreement-labels)
提供机构:
davanstrien



