five

meridian-online/finetype-training

收藏
Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/meridian-online/finetype-training
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit task_categories: - text-classification tags: - semantic-type-detection - data-profiling - synthetic-data - text-classification size_categories: - 100K<n<1M pretty_name: FineType Training Data --- # FineType Training Dataset Synthetic training and evaluation data for [FineType](https://github.com/noon-org/finetype) — a semantic type classifier that detects the format of text values (dates, IPs, emails, UUIDs, etc.) from a taxonomy of **151 types**. - **Model:** [noon-org/finetype-char-cnn](https://huggingface.co/noon-org/finetype-char-cnn) - **GitHub:** [noon-org/finetype](https://github.com/noon-org/finetype) ## Dataset Description Each example is a `(text, classification)` pair where: - **text** — a string value (e.g., `"2024-01-15"`, `"192.168.1.1"`, `"hello@example.com"`) - **classification** — the semantic type label in `domain.category.type` format (e.g., `datetime.date.iso`, `technology.internet.ip_v4`, `identity.person.email`) ### Schema ```json { "classification": "datetime.component.day_of_week", "text": "Thursday" } ``` One JSON object per line (NDJSON format). ## Dataset Versions Three versions of the dataset are provided, corresponding to training iterations: | Version | Train | Test | Types | Notes | |---------|-------|------|-------|-------| | **v1** | 74,500 | 14,900 | 149 | Initial balanced dataset, 500 per type | | **v2** | 75,500 | 15,100 | 151 | Added 2 types, improved generators | | **v3** | 205,500 | 41,100 | 151 | Extended with tiered model training data | **Recommended:** Use `train.ndjson` and `test.ndjson` (v1) for the flat model, `train_v3.ndjson` and `test_v3.ndjson` for tiered models. ## Label Distribution ### By Domain (v1 train) | Domain | Types | Examples | Description | |--------|-------|----------|-------------| | datetime | 46 | 23,000 | Dates, times, timestamps, epochs, components | | technology | 34 | 17,000 | IPs, MACs, UUIDs, hashes, URLs, file paths | | identity | 25 | 12,000 | Emails, phones, credit cards, names, SSNs | | representation | 19 | 9,000 | JSON, CSV, XML, integers, floats, booleans | | geography | 16 | 8,000 | Coordinates, postal codes, country codes | | container | 11 | 5,500 | Arrays, key-value pairs, structured formats | All types are balanced at **500 examples per type** in v1. ## Generation Methodology Data is generated using type-specific Rust generators defined in the FineType taxonomy: 1. **YAML definitions** specify each type's format, regex pattern, DuckDB cast expression, and example values 2. **Rust generators** produce synthetic examples with: - Locale-aware formatting (16+ locales for dates, addresses, phone numbers) - Priority-weighted sampling (common formats appear more frequently) - Edge case coverage (boundary values, unusual but valid formats) - Checksum-valid values where applicable (credit cards via Luhn, IBANs, ISBNs) 3. **Validation** ensures every generated value matches the type's regex pattern and DuckDB cast expression ### Generator Quality - All generators validated against type definitions via `finetype check` - Taxonomy alignment verified: every type has a generator, every generator has a type - 155 automated tests covering generation, inference, and column disambiguation ## Usage ### Load with Python ```python import json with open("train.ndjson") as f: data = [json.loads(line) for line in f] texts = [d["text"] for d in data] labels = [d["classification"] for d in data] ``` ### Load with DuckDB ```sql SELECT * FROM read_json_auto('train.ndjson', format='newline_delimited'); ``` ### Load with Nushell ```nushell open train.ndjson | lines | each { from json } ``` ## Limitations - **Synthetic data:** All examples are machine-generated, not sampled from real-world datasets. Real-world data may contain formatting variations not covered by generators. - **English-centric:** While locale-aware for dates and addresses, the dataset primarily targets English-language data patterns. - **Balanced distribution:** Real-world data is highly imbalanced (some types are far more common than others). The balanced training set may not reflect deployment distributions. ## Citation ```bibtex @dataset{finetype_training2026, title = {FineType Training Data: Synthetic Examples for Semantic Type Classification}, author = {Cameron, Hugh}, year = {2026}, url = {https://huggingface.co/datasets/noon-org/finetype-training}, license = {MIT} } ```
提供机构:
meridian-online
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作