lszoszk/treaty-bodies-general-comments
收藏Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lszoszk/treaty-bodies-general-comments
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Treaty Bodies General Comments
homepage: https://doi.org/10.5281/zenodo.14781691
license: mit
annotations_creators:
- expert-generated
language_creators:
- found
multilinguality:
- monolingual
language:
- en
source_datasets:
- original
task_categories:
- text-classification
- text-retrieval
task_ids:
- multi-label-classification
size_categories:
- 1K<n<10K
tags:
- treaty-bodies
- general-comments
- human-rights
- legal
- policy
viewer: true
configs:
- config_name: default
data_files:
- split: train
path: data/train-00000-of-00001.parquet
---
# Dataset Card for Treaty Bodies General Comments
## Dataset Summary
This dataset packages the current JSON files in `General Comments/` as a Hugging Face-ready parquet dataset. Each row is a text segment from a treaty body general comment or general recommendation, paired with zero or more population-group labels from the source annotation and enriched with document metadata from `GC_info.json`.
The package was prepared from the current top-level `General Comments/*.json` files only. Document metadata is sourced from `GC_info.json`. It does not use `General Comments/old_files` or the broader `export-uhri.json` export.
`GC_info.json` contains 182 metadata records. 181 of them match the current source files in `General Comments/`.
Current package statistics:
- 181 source JSON documents
- 6,608 text segments
- 4,206 segments with at least one label
- 19 distinct labels
- 110 general comments and 71 general recommendations
- 3 joint documents
Treaty bodies represented in the current file set:
- `CAT`
- `CCPR`
- `CED`
- `CEDAW`
- `CERD`
- `CESCR`
- `CMW`
- `CRC`
- `CRPD`
## Supported Tasks and Leaderboards
This dataset is suitable for:
- Multi-label classification of population groups mentioned in treaty body text segments
- Semantic search and retrieval over segmented general comments
- Weak supervision, label enrichment, or taxonomy alignment work on human rights text
## Languages
The source text in the current files is English.
## Dataset Structure
The dataset has a single `train` split stored as parquet.
The package also includes a supplemental document-level metadata file:
- `document_index.parquet`: one row per source document, with title, signature, committee metadata, adoption date, source URL, and document-level label coverage statistics
- `GC_info.json`: bundled source metadata used to enrich the dataset
- `scripts/prepare_treaty_bodies_hf.py`: bundled preparation script used to generate the package
### Data Instances
```json
{
"row_id": "CAT_GC1_art.3_v1:1",
"source_file": "Annotated_CAT_GC1_art.3_v1.json",
"document_slug": "CAT_GC1_art.3_v1",
"document_title": "General Comment No. 01: Implementation of article 3 of the Convention in the context of article 22",
"document_title_short": "GC1: Implementation of Art. 3 in the context of Art. 22",
"signature": "A/53/44",
"adoption_date": "21 Nov 1997",
"adoption_date_iso": "1997-11-21",
"adoption_year": 1997,
"adoption_year_source": 1997,
"adoption_year_mismatch": false,
"committee": "CAT",
"committee_codes": ["CAT"],
"source_url": "https://tbinternet.ohchr.org/_layouts/15/treatybodyexternal/Download.aspx?symbolno=A%2F53%2F44&Lang=en",
"treaty_body_codes": ["CAT"],
"is_joint_document": false,
"document_type_code": "GC",
"document_type": "general_comment",
"document_number": 1,
"topic_slug": "art.3-v1",
"segment_position": 1,
"segment_id": "1",
"labels": [],
"label_count": 0,
"has_labels": false,
"text": "Article 3 is confined in its application to cases where there are substantial grounds for believing that the author would be in danger of being subjected to torture as defined in article 1 of the Convention.",
"text_length_chars": 207,
"text_length_words": 36
}
```
### Data Fields
- `row_id`: synthetic unique identifier built from `document_slug` and `segment_position`
- `source_file`: source JSON filename inside `General Comments/`
- `document_slug`: filename-derived document identifier without the `Annotated_` prefix
- `document_title`: full document title from `GC_info.json`
- `document_title_short`: simplified title from `GC_info.json`
- `signature`: official signature from `GC_info.json`
- `adoption_date`: adoption date from `GC_info.json`
- `adoption_date_iso`: adoption date normalized to ISO `YYYY-MM-DD`
- `adoption_year`: normalized adoption year, derived from `adoption_date_iso` when available
- `adoption_year_source`: raw adoption year from `GC_info.json`
- `adoption_year_mismatch`: whether the raw source year conflicts with the parsed date year
- `committee`: committee string from `GC_info.json`
- `committee_codes`: normalized committee codes parsed from `GC_info.json`
- `source_url`: source link from `GC_info.json`
- `treaty_body_codes`: one or more normalized treaty body codes parsed from the filename
- `is_joint_document`: whether the document is a joint text across more than one treaty body
- `document_type_code`: short source-derived type code, usually `GC` or `GR`
- `document_type`: normalized type label, either `general_comment` or `general_recommendation`
- `document_number`: document number parsed from the filename
- `topic_slug`: filename-derived topic slug
- `segment_position`: 1-based position of the segment within the source file
- `segment_id`: original `ID` value from the source file when present
- `labels`: zero or more population-group labels from the source annotation
- `label_count`: number of labels attached to the segment
- `has_labels`: whether the segment has at least one label
- `text`: normalized segment text
- `text_length_chars`: character count of `text`
- `text_length_words`: whitespace-token word count of `text`
## Dataset Creation
### Source Data
The source consists of annotated JSON files stored in the repository's `General Comments/` folder. Every current top-level JSON file follows the same basic schema:
```json
{
"ID": 1,
"Labels": ["Children"],
"Text": "..."
}
```
### Processing
The HF package is produced from the local preparation script `scripts/prepare_treaty_bodies_hf.py`. The script:
1. Reads only `General Comments/*.json`
2. Uses `GC_info.json` as the document metadata source file
3. Excludes `old_files` by construction because it does not recurse into subdirectories
4. Normalizes line breaks and whitespace in `Text`
5. Preserves multilabel annotations in `Labels`
6. Joins each source file to its metadata entry by filename
7. Normalizes committee codes and adoption dates
8. Validates that every source file has matching metadata and that committee codes agree between `GC_info.json` and filenames
9. Writes a row-level parquet split, `document_index.parquet`, and `dataset_metadata.json`
### Quality Checks
The preparation script performs a few basic integrity checks:
- every current `General Comments/*.json` file must have a matching entry in `GC_info.json`
- duplicate metadata entries by filename are rejected
- committee codes parsed from `GC_info.json` must match the filename-derived treaty body codes
- adoption year mismatches are detected and reported while preserving both the raw source value and the normalized year
- empty text rows are counted and reported in `dataset_metadata.json`
### Label Distribution
Most frequent labels in the current package:
- `Children`: 2,210
- `Women/girls`: 1,487
- `Persons with disabilities`: 687
- `Migrants`: 528
- `Indigenous peoples`: 335
- `Persons deprived of their liberty`: 316
- `Refugees & asylum-seekers`: 248
- `Adolescents`: 232
- `Persons living in rural/remote areas`: 220
- `Persons affected by armed conflict`: 209
### Labels and Annotation Process
Based on the project [README](https://github.com/lszoszk/UN-TreatyBodiesDocSearch) and the repository's [`labels_annotation.py`](https://raw.githubusercontent.com/lszoszk/UN-TreatyBodiesDocSearch/main/labels_annotation.py), the labels in this dataset are concerned groups/persons labels used to filter and search paragraph-level results in the application.
In the current dataset, the label inventory is:
- `Adolescents`
- `Children`
- `Children in alternative care`
- `Indigenous peoples`
- `Internally displaced persons`
- `LGBTI+`
- `Migrants`
- `Non-citizens and stateless`
- `Persons affected by armed conflict`
- `Persons affected by natural disasters`
- `Persons deprived of their liberty`
- `Persons in street situations`
- `Persons living in poverty`
- `Persons living in rural/remote areas`
- `Persons living with HIV/AIDS`
- `Persons with disabilities`
- `Refugees & asylum-seekers`
- `Roma, Gypsies, Sinti and Travellers`
- `Women/girls`
The label creation process is rule-based rather than manual paragraph-by-paragraph annotation. The repository script defines a mapping from each label to a curated list of keywords and phrases, then assigns every label whose keyword list matches a paragraph's text. Because of that design, these labels are best understood as heuristic weak labels for search, filtering, and exploratory analysis, not as exhaustive expert annotations.
## Considerations for Use
- Labels are sparse: 2,402 rows have no labels in the current source files.
- `GC_info.json` contains one extra metadata entry, `Annotated_CRC-GC18-Harmful.json`, that does not correspond to a current file in `General Comments/`.
- Two metadata records contain a year/date mismatch in `GC_info.json`; the package preserves the raw source year in `adoption_year_source` and exposes the normalized year in `adoption_year`.
- Labels are generated through keyword matching, so false positives, false negatives, and missed contextual mentions are possible.
- Three rows in the current source files are missing an `ID`; use `segment_position` or `row_id` as the stable row identifier.
- This package does not include a predefined train/validation/test split.
## Citation
Suggested citation for the dataset and companion scripts:
Szoszkiewicz, L., & zuzkow. (2025). *lszoszk/UN-TreatyBodiesDocSearch: UN Treaty Body General Comments/Recommendations - Full Dataset* (v1.0) [Software]. Zenodo. https://doi.org/10.5281/zenodo.14781691
```bibtex
@software{szoszkiewicz_un_2025,
author = {Szoszkiewicz, Lukasz and zuzkow},
title = {lszoszk/UN-TreatyBodiesDocSearch: UN Treaty Body General Comments/Recommendations - Full Dataset},
year = {2025},
month = jan,
publisher = {Zenodo},
version = {v1.0},
doi = {10.5281/zenodo.14781691},
url = {https://doi.org/10.5281/zenodo.14781691}
}
```
## Direct Use
```python
from datasets import load_dataset
import pandas as pd
dataset = load_dataset(
"parquet",
data_files={"train": "data/train-00000-of-00001.parquet"},
)
document_index = pd.read_parquet("document_index.parquet")
```
If the folder is uploaded as a Hugging Face dataset repository, consumers can also load it directly from the Hub repository name.
提供机构:
lszoszk



