meridian-online/finetype-training
收藏Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/meridian-online/finetype-training
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
task_categories:
- text-classification
tags:
- semantic-type-detection
- data-profiling
- synthetic-data
- text-classification
size_categories:
- 100K<n<1M
pretty_name: FineType Training Data
---
# FineType Training Dataset
Synthetic training and evaluation data for [FineType](https://github.com/noon-org/finetype) — a semantic type classifier that detects the format of text values (dates, IPs, emails, UUIDs, etc.) from a taxonomy of **151 types**.
- **Model:** [noon-org/finetype-char-cnn](https://huggingface.co/noon-org/finetype-char-cnn)
- **GitHub:** [noon-org/finetype](https://github.com/noon-org/finetype)
## Dataset Description
Each example is a `(text, classification)` pair where:
- **text** — a string value (e.g., `"2024-01-15"`, `"192.168.1.1"`, `"hello@example.com"`)
- **classification** — the semantic type label in `domain.category.type` format (e.g., `datetime.date.iso`, `technology.internet.ip_v4`, `identity.person.email`)
### Schema
```json
{
"classification": "datetime.component.day_of_week",
"text": "Thursday"
}
```
One JSON object per line (NDJSON format).
## Dataset Versions
Three versions of the dataset are provided, corresponding to training iterations:
| Version | Train | Test | Types | Notes |
|---------|-------|------|-------|-------|
| **v1** | 74,500 | 14,900 | 149 | Initial balanced dataset, 500 per type |
| **v2** | 75,500 | 15,100 | 151 | Added 2 types, improved generators |
| **v3** | 205,500 | 41,100 | 151 | Extended with tiered model training data |
**Recommended:** Use `train.ndjson` and `test.ndjson` (v1) for the flat model, `train_v3.ndjson` and `test_v3.ndjson` for tiered models.
## Label Distribution
### By Domain (v1 train)
| Domain | Types | Examples | Description |
|--------|-------|----------|-------------|
| datetime | 46 | 23,000 | Dates, times, timestamps, epochs, components |
| technology | 34 | 17,000 | IPs, MACs, UUIDs, hashes, URLs, file paths |
| identity | 25 | 12,000 | Emails, phones, credit cards, names, SSNs |
| representation | 19 | 9,000 | JSON, CSV, XML, integers, floats, booleans |
| geography | 16 | 8,000 | Coordinates, postal codes, country codes |
| container | 11 | 5,500 | Arrays, key-value pairs, structured formats |
All types are balanced at **500 examples per type** in v1.
## Generation Methodology
Data is generated using type-specific Rust generators defined in the FineType taxonomy:
1. **YAML definitions** specify each type's format, regex pattern, DuckDB cast expression, and example values
2. **Rust generators** produce synthetic examples with:
- Locale-aware formatting (16+ locales for dates, addresses, phone numbers)
- Priority-weighted sampling (common formats appear more frequently)
- Edge case coverage (boundary values, unusual but valid formats)
- Checksum-valid values where applicable (credit cards via Luhn, IBANs, ISBNs)
3. **Validation** ensures every generated value matches the type's regex pattern and DuckDB cast expression
### Generator Quality
- All generators validated against type definitions via `finetype check`
- Taxonomy alignment verified: every type has a generator, every generator has a type
- 155 automated tests covering generation, inference, and column disambiguation
## Usage
### Load with Python
```python
import json
with open("train.ndjson") as f:
data = [json.loads(line) for line in f]
texts = [d["text"] for d in data]
labels = [d["classification"] for d in data]
```
### Load with DuckDB
```sql
SELECT * FROM read_json_auto('train.ndjson', format='newline_delimited');
```
### Load with Nushell
```nushell
open train.ndjson | lines | each { from json }
```
## Limitations
- **Synthetic data:** All examples are machine-generated, not sampled from real-world datasets. Real-world data may contain formatting variations not covered by generators.
- **English-centric:** While locale-aware for dates and addresses, the dataset primarily targets English-language data patterns.
- **Balanced distribution:** Real-world data is highly imbalanced (some types are far more common than others). The balanced training set may not reflect deployment distributions.
## Citation
```bibtex
@dataset{finetype_training2026,
title = {FineType Training Data: Synthetic Examples for Semantic Type Classification},
author = {Cameron, Hugh},
year = {2026},
url = {https://huggingface.co/datasets/noon-org/finetype-training},
license = {MIT}
}
```
提供机构:
meridian-online



