five

fawoenix/silent-witness

收藏
Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/fawoenix/silent-witness
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar - fr - en license: cc-by-4.0 task_categories: - text-classification task_ids: - multi-label-classification tags: - human-rights - multilingual - multimodal - document-understanding - arabic - french pretty_name: Silent Witness size_categories: - 1K<n<10K --- # Silent Witness **A Multimodal Multilingual Dataset for Human Rights Violation Detection** ## Dataset Summary Silent Witness is a passage-level multi-label classification dataset for human rights violation detection in formal institutional documents. Each instance pairs an extracted text passage with the rendered page image from which it originates, annotated against an 11-category violation taxonomy derived from OHCHR international human rights law. Source material is drawn from United Nations bodies (OHCHR) and established NGOs (Human Rights Watch, Amnesty International, FIDH). ## Languages | Code | Language | Script | Source | |---|---|---|---| | `ar` | Arabic (MSA) | Arabic | OHCHR | | `fr` | French | Latin | FIDH, Amnesty International | | `en` | English | Latin | Human Rights Watch, Amnesty | ## Dataset Structure ### Data Fields | Field | Type | Description | |---|---|---| | `passage_id` | string | Unique passage identifier encoding doc, page, index | | `doc_id` | string | Source document identifier | | `page_number` | int | Source page number (1-indexed) | | `language` | string | ISO 639-1 / ISO 639-3 language code | | `text` | string | Extracted passage text (min 30 words, max 2000 chars) | | `page_image` | string | Path to rendered page PNG (150 DPI) | | `labels` | list[string] | Multi-hot violation category labels | | `confidence` | string | Annotation confidence: `high` / `medium` / `low` | | `source_org` | string | Source organization | | `geographic_scope` | string | ISO 3166-1 alpha-2 country code | | `synthetic` | bool | `true` for synthetic instances only | ### Data Splits | Split | Passages | Notes | |---|---|---| | Train | 1,243 | Stratified by language | | Validation | 155 | Stratified by language | | Test | 158 | Stratified by language | | **Total** | **1,556** | | ## Violation Taxonomy 11-category label set derived from OHCHR top-level categories. Labels are not mutually exclusive — each passage receives a multi-hot vector. `ENVIRONMENTAL_RIGHTS` added based on corpus analysis (present in 40% of documents, grounded in UN Resolution 76/300). | Label | Description | |---|---| | `TORTURE_ILL_TREATMENT` | Torture, cruel, inhuman, or degrading treatment | | `ARBITRARY_DETENTION` | Detention without legal basis or due process | | `EXTRAJUDICIAL_KILLING` | Killing outside any legal framework | | `ENFORCED_DISAPPEARANCE` | Abduction by state agents with denial of custody | | `FREEDOM_OF_EXPRESSION` | Suppression of speech, press, or information | | `FREEDOM_OF_ASSEMBLY` | Suppression of peaceful assembly or association | | `DISPLACEMENT_FORCED` | Forced displacement, expulsion, or relocation | | `SEXUAL_GENDER_BASED_VIOLENCE` | Sexual violence or gender-based persecution | | `DISCRIMINATION` | Discriminatory treatment based on protected characteristics | | `FAIR_TRIAL_VIOLATION` | Denial of fair trial or judicial guarantees | | `ENVIRONMENTAL_RIGHTS` | Violations of the right to a clean, healthy, and sustainable environment | ### Label Distribution (full corpus) | Label | Count | |---|---| | DISCRIMINATION | 1,307 | | FAIR_TRIAL_VIOLATION | 949 | | ARBITRARY_DETENTION | 676 | | TORTURE_ILL_TREATMENT | 674 | | DISPLACEMENT_FORCED | 576 | | SEXUAL_GENDER_BASED_VIOLENCE | 530 | | EXTRAJUDICIAL_KILLING | 497 | | ENFORCED_DISAPPEARANCE | 397 | | FREEDOM_OF_ASSEMBLY | 381 | | FREEDOM_OF_EXPRESSION | 329 | | ENVIRONMENTAL_RIGHTS | 59 | ## Annotation **Annotator:** Single primary annotator with multilingual background (Arabic, French, English). **Inter-annotator agreement** (15% sample, n=233 passages, second annotator): | Label | Cohen's Kappa | Interpretation | |---|---|---| | TORTURE_ILL_TREATMENT | 0.778 | Substantial | | ARBITRARY_DETENTION | 0.799 | Substantial | | EXTRAJUDICIAL_KILLING | 0.867 | Substantial | | ENFORCED_DISAPPEARANCE | 0.812 | Substantial | | FREEDOM_OF_EXPRESSION | 0.847 | Substantial | | FREEDOM_OF_ASSEMBLY | 0.814 | Substantial | | DISPLACEMENT_FORCED | 0.789 | Substantial | | SEXUAL_GENDER_BASED_VIOLENCE | 0.908 | Substantial | | DISCRIMINATION | 0.524 | Moderate | | FAIR_TRIAL_VIOLATION | 0.697 | Substantial | | ENVIRONMENTAL_RIGHTS | 0.675 | Substantial | | **Overall (macro avg)** | **0.774** | **Substantial** | `DISCRIMINATION` scores moderate agreement due to conceptual overlap with multiple other categories. This is documented as a known limitation. ## Baseline Results Three baseline experiments on the test set: | Experiment | Model | Test F1 Macro | Test F1 Micro | |---|---|---|---| | E1 — Text-only | XLM-RoBERTa-base | 0.7931 | 0.8799 | | E2 — Multimodal | XLM-RoBERTa-base + ViT-base (frozen) | 0.8070 | 0.8942 | | E3 — Zero-shot | CLIP ViT-H-14 (no fine-tuning) | 0.5123 | 0.5474 | ### Per-Language Test F1 Macro | Language | E1 XLM-R | E2 XLM-R + ViT | E3 CLIP zero-shot | |---|---|---|---| | Arabic (MSA) | 0.5854 | 0.5962 | 0.5649 | | English | 0.6659 | 0.6659 | 0.5631 | | French | 0.4877 | 0.5197 | 0.3067 | **Key findings:** - Multimodal model (E2) consistently outperforms text-only (E1) across all languages. Visual page features provide signal beyond text alone. - French F1 is consistently the lowest across all experiments. Likely cause: FIDH legal French terminology is underrepresented relative to English and Arabic in XLM-R pretraining. - Zero-shot CLIP (E3) achieves high recall but low precision at optimal threshold (0.1), indicating the model labels most passages positively. Fine-tuning is necessary for reliable detection. - `ENVIRONMENTAL_RIGHTS` scores near 0.00 in E1 and E2 due to very low test support (7 passages). Addressed in v1.0 with expanded corpus. ## Sources ### UN Bodies | Organization | Portal | Languages | |---|---|---| | OHCHR | ohchr.org/en/documents | AR | ### NGOs | Organization | Portal | Languages | |---|---|---| | Human Rights Watch | hrw.org/reports | EN | | Amnesty International | amnesty.org/en/documents | FR | | FIDH | fidh.org/en/issues | FR | ## Known Limitations - Single primary annotator for the majority of the corpus. Inter-annotator agreement measured on 15% sample only. - Page-level image pairing: one visual unit may span multiple unrelated topics. Individual image extraction deferred to v1.0. - `ENVIRONMENTAL_RIGHTS` has very low support (7 test passages) — insufficient for reliable model learning at pilot scale. - No negative examples (passages with no violation content) included. Balanced negative class scoped to v1.0. - French performance gap (F1 Macro ~0.49–0.52) warrants investigation in v1.0 with expanded French corpus. ## Versioning | Version | Scope | |---|---| | v0.1 (current) | Pilot: 1,556 passages, 3 languages, 11 labels, 3 baselines | | v1.0 (planned) | Expanded corpus, negative class, second full annotator, Darija extension | | v2.0 (planned) | Individual image extraction, additional MENA languages, OCR for scanned Arabic | ## Citation ```bibtex @dataset{silent_witness_2026, author = {BEN SALEM, Lamia}, title = {Silent Witness: A Multimodal Multilingual Dataset for Human Rights Violation Detection}, year = {2026}, publisher = {HuggingFace Datasets}, url = {https://huggingface.co/datasets/fawoenix/silent-witness} } ``` ## License Source documents are used under the following policies: - UN documents: public domain (ST/AI/189/Add.9/Rev.2) - Human Rights Watch: open access - Amnesty International: open access - FIDH: open access
提供机构:
fawoenix
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作