five

dathuynh1108/ner-address-standard-dataset

收藏
Hugging Face2025-11-27 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/dathuynh1108/ner-address-standard-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit --- This dataset contains fully auto‑annotated Vietnamese administrative addresses synthesized from the national unit hierarchy (old 3-level wards/districts/provinces and the new 2-level schema). Each sample is a tokenized address string accompanied by BIO tags for four entity types: STREET, WARD, DISTRICT, and PROVINCE. The generator blends authentic administrative names (including all official aliases and legacy→modern mappings) with realistic street templates, connectors, abbreviations, accent/no-accent variants, and ordering permutations (street+ward+district+province, ward+province only, etc.). Both synthetic rows and parser-labeled real addresses are normalized, shuffled, and split so models see noisy capitalizations, missing diacritics, compact “p./q./tp.” abbreviations, and old/new administrative structures—all aimed at training robust Vietnamese NER systems that can recover the full administrative hierarchy from unstructured addresses. Build with real data for a better dataset from my addresses.jsonl file Repo: https://github.com/dathuynh1108/address-parser/tree/main/ner
提供机构:
dathuynh1108
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作