LocalDoc/pii_ner_azerbaijani_extended
收藏Hugging Face2026-03-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LocalDoc/pii_ner_azerbaijani_extended
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: uid
dtype: int64
- name: translated_text
dtype: string
- name: privacy_mask
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 191472478
num_examples: 530913
download_size: 73794173
dataset_size: 191472478
task_categories:
- token-classification
language:
- az
tags:
- pii
- ner
- private
- transliteration
- data-augmentation
- llm-generated
- hard-negatives
pretty_name: PII NER Azerbaijani Extended Dataset
size_categories:
- 100K<n<1M
---
# PII NER Azerbaijani Extended Dataset
Extended version of [LocalDoc/pii_ner_azerbaijani](https://huggingface.co/datasets/LocalDoc/pii_ner_azerbaijani) — a synthetic Azerbaijani dataset for **PII-aware Named Entity Recognition** (token classification).
This dataset combines three data generation strategies:
1. **Template-based** — original synthetic data + transliterated variants
2. **LLM-generated PII** — natural sentences with realistic PII in diverse contexts
3. **LLM-generated hard negatives** — sentences WITHOUT PII but with tricky look-alike words
**Note:** All examples are **synthetically generated**. No real persons or contact details are included.
## Dataset Composition
**Total: 530,913 rows**
### Template-Based Data (~481K rows)
The original dataset (~121K rows) augmented with 3 transliteration strategies:
| Source | Rows | Description |
|---|---|---|
| `original` | ~121K | Original Azerbaijani text with special characters |
| `translit_standard` | ~121K | Common informal: `ş→sh`, `ç→ch`, `ğ→gh`, `ə→e`, `ö→o`, `ü→u`, `ı→i` |
| `translit_minimal` | ~121K | Diacritics removed: `ş→s`, `ç→c`, `ğ→g`, `ə→a`, `ö→o`, `ü→u`, `ı→i` |
| `translit_no_digraph` | ~121K | No digraphs: `ş→s`, `ç→c`, `ğ→g`, `ə→e`, `ö→o`, `ü→u`, `ı→i` |
All character offsets are automatically recalculated to match the transliterated text.
### LLM-Generated Data (~49K rows)
Generated using GPT-4 with [`az-data-generator`](https://github.com/LocalDoc-Azerbaijan/az-data-generator) for realistic PII values:
| Source | Rows | Description |
|---|---|---|
| `llm_pii` | ~25K | Natural sentences with 1-4 PII entities in diverse contexts (CV, bank statements, police reports, chat messages, medical records, contracts, etc.) |
| `llm_mixed` | ~10K | Sentences with BOTH real PII AND non-PII look-alike words (e.g., city name as adjective + city name as address in same sentence) |
| `llm_hard_neg_*` | ~15K | Sentences with ZERO PII but containing trap words that resemble PII |
#### Hard Negative Categories
| Source | Rows | What it teaches |
|---|---|---|
| `llm_hard_neg_city_adjective` | ~2.9K | "bakı küləyi" (weather) ≠ city address |
| `llm_hard_neg_name_common` | ~2.0K | "nərgiz çiçəyi" (flower) ≠ person name |
| `llm_hard_neg_url_hashtag` | ~2.0K | "www.example.az" ≠ email |
| `llm_hard_neg_academic` | ~1.9K | Student IDs, grades ≠ personal IDs |
| `llm_hard_neg_number_context` | ~1.5K | "25 dərəcə" (temperature) ≠ age |
| `llm_hard_neg_year_reference` | ~1.5K | "2008-ci ildə" (year) ≠ date PII |
| `llm_hard_neg_business` | ~1.5K | Contract amounts ≠ personal numbers |
| `llm_hard_neg_news_stats` | ~1.2K | News statistics ≠ PII |
#### LLM Entity Distribution
| Entity | Count |
|---|---|
| GIVENNAME | 33,010 |
| SURNAME | 16,480 |
| CITY | 8,307 |
| DATE | 4,467 |
| CREDITCARDNUMBER | 4,367 |
| PASSPORTNUM | 4,346 |
| TIME | 4,342 |
| EMAIL | 4,335 |
| TAXNUM | 4,330 |
| IDCARDNUM | 4,320 |
| TELEPHONENUM | 4,311 |
| ZIPCODE | 4,309 |
| BUILDINGNUM | 4,276 |
| AGE | 4,270 |
| STREET | 3,883 |
## Dataset Summary
Each row contains:
- `uid` *(int)* — unique record id
- `translated_text` *(string)* — Azerbaijani sentence
- `privacy_mask` *(string; JSON-encoded list)* — character-span annotations for PII entities
- Each item: `{ "label": str, "start": int, "end": int, "value": str }`
- Empty list `[]` for hard negative examples (no PII)
- `source` *(string)* — origin marker (see tables above)
## Entities (PII Labels)
- `GIVENNAME`, `SURNAME`
- `EMAIL`, `TELEPHONENUM`
- `DATE`, `TIME`, `AGE`
- `IDCARDNUM`, `PASSPORTNUM`, `TAXNUM`
- `CREDITCARDNUMBER`
- `CITY`, `STREET`, `BUILDINGNUM`
- `ZIPCODE`
`start`/`end` are **character offsets** in `translated_text` (Python slice semantics).
## Intended Use
- Train/evaluate **token classification** models for Azerbaijani PII detection
- Improve model robustness on **informal/transliterated** Azerbaijani text
- Reduce **false positives** on non-PII text using hard negatives
- Train models to distinguish PII from look-alike words in **mixed contexts**
- Benchmark multilingual NER models on Azerbaijani PII
**Limitations:** synthetic language and formats may differ from real-world distributions; recommended to complement with carefully curated data for production use.
## Quick Start
```python
from datasets import load_dataset
import json
ds = load_dataset("LocalDoc/pii_ner_azerbaijani_extended", split="train")
print(f"Total rows: {len(ds)}")
# Filter by source type
original_only = ds.filter(lambda x: x['source'] == 'original')
llm_pii = ds.filter(lambda x: x['source'] == 'llm_pii')
hard_negs = ds.filter(lambda x: 'hard_neg' in x['source'])
mixed = ds.filter(lambda x: x['source'] == 'llm_mixed')
translit = ds.filter(lambda x: 'translit' in x['source'])
# All template-based (original + transliterations)
template_based = ds.filter(lambda x: x['source'] in (
'original', 'translit_standard', 'translit_minimal', 'translit_no_digraph'))
# All LLM-generated
llm_all = ds.filter(lambda x: x['source'].startswith('llm_'))
# Inspect a row
row = ds[0]
text = row["translated_text"]
spans = json.loads(row["privacy_mask"])
print(f"[{row['source']}] {text}")
print(spans[:2])
```
## Source & Generation
- **Language:** Azerbaijani (`az`)
- **Base dataset:** [LocalDoc/pii_ner_azerbaijani](https://huggingface.co/datasets/LocalDoc/pii_ner_azerbaijani) (~121K rows)
- **Template data:** Synthetic generation using [`az-data-generator`](https://pypi.org/project/az-data-generator/) with programmatic transliteration augmentation
- **LLM data:** Generated using GPT-4 with [`az-data-generator`](https://github.com/LocalDoc-Azerbaijan/az-data-generator) for realistic PII values, verified with automatic offset validation
## CC BY 4.0 License — What It Allows
The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows:
### ✅ You Can:
- **Use** the model for any purpose, including commercial use.
- **Share** it — copy and redistribute in any medium or format.
- **Adapt** it — remix, transform, and build upon it for any purpose, even commercially.
### 📝 You Must:
- **Give appropriate credit** — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made).
- **Not imply endorsement** — Do not suggest the original author endorses you or your use.
### ❌ You Cannot:
- Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions).
### Summary:
You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.
For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by/4.0/deed.en">CC BY 4.0 license</a>.
## Contact
For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].
提供机构:
LocalDoc



