dathuynh1108/ner-address-standard-dataset
收藏Hugging Face2025-11-27 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/dathuynh1108/ner-address-standard-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
This dataset contains fully auto‑annotated Vietnamese administrative addresses synthesized from the national unit hierarchy (old 3-level wards/districts/provinces and the new 2-level schema).
Each sample is a tokenized address string accompanied by BIO tags for four entity types: STREET, WARD, DISTRICT, and PROVINCE. The generator blends authentic administrative names (including all official aliases and legacy→modern mappings) with realistic street templates, connectors, abbreviations, accent/no-accent variants, and ordering permutations (street+ward+district+province, ward+province only, etc.). Both synthetic rows and parser-labeled real addresses are normalized, shuffled, and split so models see noisy capitalizations, missing diacritics, compact “p./q./tp.” abbreviations, and old/new administrative structures—all aimed at training robust Vietnamese NER systems that can recover the full administrative hierarchy from unstructured addresses.
Build with real data for a better dataset from my addresses.jsonl file
Repo: https://github.com/dathuynh1108/address-parser/tree/main/ner
提供机构:
dathuynh1108



