ngocthanhdoan/vietnerm-gplx-dataset

Name: ngocthanhdoan/vietnerm-gplx-dataset
Creator: ngocthanhdoan
Published: 2026-03-28 09:43:29
License: 暂无描述

Hugging Face2026-03-28 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ngocthanhdoan/vietnerm-gplx-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: vi tags: - ner - vietnamese - document-ai - gplx - synthetic-data task_categories: - token-classification size_categories: - 1K<n<10K license: mit --- # VietNerm — gplx NER Dataset Synthetic BIO-tagged NER dataset for Vietnamese **gplx** document entity extraction. ## ⚠️ DISCLAIMER: SYNTHETIC / MOCKUP DATA > **Dataset này được sinh hoàn toàn tự động từ template (synthetic/mockup data), KHÔNG chứa dữ liệu cá nhân thật.** - Tất cả dữ liệu được **sinh tự động** bằng hệ thống Jinja2 template + random generator - **Không** sử dụng giấy tờ thật, thông tin cá nhân thật, hoặc dữ liệu thu thập từ người dùng - Số định danh (ID, CCCD...) được sinh ngẫu nhiên, thiết kế để **không trùng** với dữ liệu thật - Dữ liệu có inject nhiễu OCR (noise) để giả lập điều kiện thực tế - Mục đích: **nghiên cứu AI, Document AI, OCR/NER pipeline** - **Không** được sử dụng để giả mạo giấy tờ, tạo giấy tờ giả, lừa đảo hoặc gian lận ## Dataset Description This dataset contains BIO-tagged token sequences for training NER models on Vietnamese **gplx** documents. Data is synthetically generated with OCR noise simulation for robustness. ### Dataset Statistics | Split | Samples | |--------|---------| | Train | 1800 | | Test | 200 | ### Labels | Label | Type | |-------|------| | `id2label` | — | | `label2id` | — | | `labels` | — | ## Format Each sample is a JSON object with two fields: | Field | Type | Description | |------------|----------------|-------------------------------------| | `tokens` | `List[str]` | Whitespace-tokenized words | | `ner_tags` | `List[str]` | BIO label for each token | ## Data Mockup Example Below is a representative (synthetic) sample from the dataset: ```json { "tokens": [ "CỘNG", "HÒA", "XÃ", "HỘI", "CHỦ", "NGHĨA", "VIỆT", "NAM", "Độc", "lập", "-", "Tự", "do", "-", "Hạnh", "phúc", "SOCIALIST", "REPUBLIC", "OF", "VIET" ], "ner_tags": [ "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O" ] } ``` ## Usage ```python from datasets import load_dataset dataset = load_dataset("ngocthanhdoan/vietnerm-gplx-dataset") train = dataset["train"] # Access a sample sample = train[0] print(sample["tokens"]) # ['CĂN', 'CƯỚC', 'CÔNG', 'DÂN', ...] print(sample["ner_tags"]) # ['O', 'O', 'O', 'O', ...] ``` ## Training the NER Model This dataset is used to train the companion model [`ngocthanhdoan/phobert-gplx-ner`](https://huggingface.co/ngocthanhdoan/phobert-gplx-ner). ```python from vietnerm import VietNerm ner = VietNerm(doc_type="gplx", hf_username="ngocthanhdoan") result = ner.extract("your document OCR text here") print(result) ``` ## Ethical Use This dataset is built for **research and development purposes only**: - ✅ AI/NLP research - ✅ Document AI development - ✅ OCR/NER pipeline prototyping - ✅ Educational purposes - ❌ Forging documents - ❌ Creating fake identity papers - ❌ Fraud or deception ## About VietNerm VietNerm is a Document AI Factory for Vietnamese documents. It provides a complete pipeline from template-based synthetic data generation to model training and deployment. - **Repository**: [Devhub-Solutions/VietNerm](https://github.com/Devhub-Solutions/VietNerm) - **SDK**: `pip install vietnerm` - **License**: MIT — Copyright (c) 2026 Devhub Solutions

提供机构：

ngocthanhdoan

5,000+

优质数据集

54 个

任务类型

进入经典数据集