five

LocalDoc/pii_ner_azerbaijani_extended

收藏
Hugging Face2026-03-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LocalDoc/pii_ner_azerbaijani_extended
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: uid dtype: int64 - name: translated_text dtype: string - name: privacy_mask dtype: string - name: source dtype: string splits: - name: train num_bytes: 191472478 num_examples: 530913 download_size: 73794173 dataset_size: 191472478 task_categories: - token-classification language: - az tags: - pii - ner - private - transliteration - data-augmentation - llm-generated - hard-negatives pretty_name: PII NER Azerbaijani Extended Dataset size_categories: - 100K<n<1M --- # PII NER Azerbaijani Extended Dataset Extended version of [LocalDoc/pii_ner_azerbaijani](https://huggingface.co/datasets/LocalDoc/pii_ner_azerbaijani) — a synthetic Azerbaijani dataset for **PII-aware Named Entity Recognition** (token classification). This dataset combines three data generation strategies: 1. **Template-based** — original synthetic data + transliterated variants 2. **LLM-generated PII** — natural sentences with realistic PII in diverse contexts 3. **LLM-generated hard negatives** — sentences WITHOUT PII but with tricky look-alike words **Note:** All examples are **synthetically generated**. No real persons or contact details are included. ## Dataset Composition **Total: 530,913 rows** ### Template-Based Data (~481K rows) The original dataset (~121K rows) augmented with 3 transliteration strategies: | Source | Rows | Description | |---|---|---| | `original` | ~121K | Original Azerbaijani text with special characters | | `translit_standard` | ~121K | Common informal: `ş→sh`, `ç→ch`, `ğ→gh`, `ə→e`, `ö→o`, `ü→u`, `ı→i` | | `translit_minimal` | ~121K | Diacritics removed: `ş→s`, `ç→c`, `ğ→g`, `ə→a`, `ö→o`, `ü→u`, `ı→i` | | `translit_no_digraph` | ~121K | No digraphs: `ş→s`, `ç→c`, `ğ→g`, `ə→e`, `ö→o`, `ü→u`, `ı→i` | All character offsets are automatically recalculated to match the transliterated text. ### LLM-Generated Data (~49K rows) Generated using GPT-4 with [`az-data-generator`](https://github.com/LocalDoc-Azerbaijan/az-data-generator) for realistic PII values: | Source | Rows | Description | |---|---|---| | `llm_pii` | ~25K | Natural sentences with 1-4 PII entities in diverse contexts (CV, bank statements, police reports, chat messages, medical records, contracts, etc.) | | `llm_mixed` | ~10K | Sentences with BOTH real PII AND non-PII look-alike words (e.g., city name as adjective + city name as address in same sentence) | | `llm_hard_neg_*` | ~15K | Sentences with ZERO PII but containing trap words that resemble PII | #### Hard Negative Categories | Source | Rows | What it teaches | |---|---|---| | `llm_hard_neg_city_adjective` | ~2.9K | "bakı küləyi" (weather) ≠ city address | | `llm_hard_neg_name_common` | ~2.0K | "nərgiz çiçəyi" (flower) ≠ person name | | `llm_hard_neg_url_hashtag` | ~2.0K | "www.example.az" ≠ email | | `llm_hard_neg_academic` | ~1.9K | Student IDs, grades ≠ personal IDs | | `llm_hard_neg_number_context` | ~1.5K | "25 dərəcə" (temperature) ≠ age | | `llm_hard_neg_year_reference` | ~1.5K | "2008-ci ildə" (year) ≠ date PII | | `llm_hard_neg_business` | ~1.5K | Contract amounts ≠ personal numbers | | `llm_hard_neg_news_stats` | ~1.2K | News statistics ≠ PII | #### LLM Entity Distribution | Entity | Count | |---|---| | GIVENNAME | 33,010 | | SURNAME | 16,480 | | CITY | 8,307 | | DATE | 4,467 | | CREDITCARDNUMBER | 4,367 | | PASSPORTNUM | 4,346 | | TIME | 4,342 | | EMAIL | 4,335 | | TAXNUM | 4,330 | | IDCARDNUM | 4,320 | | TELEPHONENUM | 4,311 | | ZIPCODE | 4,309 | | BUILDINGNUM | 4,276 | | AGE | 4,270 | | STREET | 3,883 | ## Dataset Summary Each row contains: - `uid` *(int)* — unique record id - `translated_text` *(string)* — Azerbaijani sentence - `privacy_mask` *(string; JSON-encoded list)* — character-span annotations for PII entities - Each item: `{ "label": str, "start": int, "end": int, "value": str }` - Empty list `[]` for hard negative examples (no PII) - `source` *(string)* — origin marker (see tables above) ## Entities (PII Labels) - `GIVENNAME`, `SURNAME` - `EMAIL`, `TELEPHONENUM` - `DATE`, `TIME`, `AGE` - `IDCARDNUM`, `PASSPORTNUM`, `TAXNUM` - `CREDITCARDNUMBER` - `CITY`, `STREET`, `BUILDINGNUM` - `ZIPCODE` `start`/`end` are **character offsets** in `translated_text` (Python slice semantics). ## Intended Use - Train/evaluate **token classification** models for Azerbaijani PII detection - Improve model robustness on **informal/transliterated** Azerbaijani text - Reduce **false positives** on non-PII text using hard negatives - Train models to distinguish PII from look-alike words in **mixed contexts** - Benchmark multilingual NER models on Azerbaijani PII **Limitations:** synthetic language and formats may differ from real-world distributions; recommended to complement with carefully curated data for production use. ## Quick Start ```python from datasets import load_dataset import json ds = load_dataset("LocalDoc/pii_ner_azerbaijani_extended", split="train") print(f"Total rows: {len(ds)}") # Filter by source type original_only = ds.filter(lambda x: x['source'] == 'original') llm_pii = ds.filter(lambda x: x['source'] == 'llm_pii') hard_negs = ds.filter(lambda x: 'hard_neg' in x['source']) mixed = ds.filter(lambda x: x['source'] == 'llm_mixed') translit = ds.filter(lambda x: 'translit' in x['source']) # All template-based (original + transliterations) template_based = ds.filter(lambda x: x['source'] in ( 'original', 'translit_standard', 'translit_minimal', 'translit_no_digraph')) # All LLM-generated llm_all = ds.filter(lambda x: x['source'].startswith('llm_')) # Inspect a row row = ds[0] text = row["translated_text"] spans = json.loads(row["privacy_mask"]) print(f"[{row['source']}] {text}") print(spans[:2]) ``` ## Source & Generation - **Language:** Azerbaijani (`az`) - **Base dataset:** [LocalDoc/pii_ner_azerbaijani](https://huggingface.co/datasets/LocalDoc/pii_ner_azerbaijani) (~121K rows) - **Template data:** Synthetic generation using [`az-data-generator`](https://pypi.org/project/az-data-generator/) with programmatic transliteration augmentation - **LLM data:** Generated using GPT-4 with [`az-data-generator`](https://github.com/LocalDoc-Azerbaijan/az-data-generator) for realistic PII values, verified with automatic offset validation ## CC BY 4.0 License — What It Allows The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows: ### ✅ You Can: - **Use** the model for any purpose, including commercial use. - **Share** it — copy and redistribute in any medium or format. - **Adapt** it — remix, transform, and build upon it for any purpose, even commercially. ### 📝 You Must: - **Give appropriate credit** — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made). - **Not imply endorsement** — Do not suggest the original author endorses you or your use. ### ❌ You Cannot: - Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions). ### Summary: You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator. For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by/4.0/deed.en">CC BY 4.0 license</a>. ## Contact For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].
提供机构:
LocalDoc
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作