five

respinosamena/Helios-Nexus-JSON-Data

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/respinosamena/Helios-Nexus-JSON-Data
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 tags: - information-extraction - json - rag - structured-data - synthetic - legacy-database-modernization task_categories: - text-generation - feature-extraction size_categories: - 1B<n<10B configs: - config_name: default data_files: - split: train path: "data/train-*.parquet" --- # Helios Nano JSON Data Large-scale synthetic dataset for training small language models (SLMs) on **structured information extraction** — converting unstructured text into JSON. ## Purpose Designed for fine-tuning a 400M-parameter extraction engine that: - Reads unstructured business documents (invoices, medical records, contracts, etc.) - Follows a provided JSON schema - Outputs clean, structured JSON Ideal for **legacy database modernization** and **RAG pipelines**. ## Dataset Structure Each row contains: | Column | Type | Description | |---|---|---| | `industry` | string | Source industry (finance, healthcare, hr, legal, …) | | `doc_type` | string | Document type (invoice, prescription, contract, …) | | `schema_json` | string | JSON schema the model should extract | | `raw_text` | string | Unstructured source document | | `extracted_json` | string | Gold-standard structured extraction | ## Coverage **16 industries**, **41 document types**, including: - Finance: invoices, receipts, payroll, wire transfers, tax summaries, bank transactions - Healthcare: patient records, prescriptions, lab results, referrals - HR: employee records, job postings, performance reviews - Legal: contract summaries - Real Estate: property listings, lease agreements - Logistics: shipping notices, purchase orders, inventory, customs declarations - Retail: orders, returns - Insurance: claims - Education: enrollment, scholarships - Manufacturing: quality inspections, maintenance logs - Government: business licenses, building permits - And more… ## Format Diversity Text fields use randomized formatting for dates (`Sept 29` / `09-29-2024` / `2024-09-29`), currency (`$1,234.56` / `USD 1234.56`), phone numbers, IDs, and document layout (formal headers vs. narrative prose vs. email style). ## Stats - **Shards**: 26 - **Disk size**: 12.2 GB (Snappy-compressed Parquet) - **Target**: 10B tokens (BPE, vocab 32768) ## Usage ```python from datasets import load_dataset ds = load_dataset("respinosamena/Helios-Nano-JSON-Data", split="train") print(ds[0]) ``` ## Training Prompt Format ``` <|schema|>{schema_json}<|end_turn|> <|document|>{raw_text}<|end_turn|> <|extract|>{extracted_json}<|end_turn|> ``` ## License Apache 2.0
提供机构:
respinosamena
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作