mouseart2025/ChiNovelKE
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mouseart2025/ChiNovelKE
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- zh
task_categories:
- token-classification
- text-classification
tags:
- narrative
- knowledge-extraction
- chinese
- literature
- information-extraction
- spatial-reasoning
pretty_name: ChiNovelKE - Chinese Novel Knowledge Extraction Benchmark
size_categories:
- n<1K
---
# ChiNovelKE: Chinese Novel Knowledge Extraction Benchmark
**The first benchmark for evaluating structured knowledge extraction from Chinese long-form fiction.**
## Overview
ChiNovelKE provides human-annotated ground truth for evaluating five dimensions of narrative knowledge extraction across three classical Chinese novels:
| Novel | Genre | Chapters | Characters | Relations | Aliases | Location Hierarchy |
|-------|-------|----------|------------|-----------|---------|-------------------|
| 西游记 (Journey to the West) | Fantasy | 100 | 50 | 50 | 28 | 74 |
| 红楼梦 (Dream of the Red Chamber) | Realistic | 122 | 50 | 50 | — | 61 |
| 水浒传 (Water Margin) | Wuxia | 112 | 50 | 50 | 17 | — |
**Total: 480 annotated entries across 5 evaluation dimensions.**
## Evaluation Dimensions
### 1. Character Extraction (Entity Precision)
Each entry contains the system-extracted character name, mention frequency, and human annotation:
- `is_valid_character`: true (named character) / false (generic term, e.g., 土地, 小妖)
- `correct_name`: canonical name for alias merging (e.g., 行者 → 孙悟空)
### 2. Relationship Classification
Each entry contains a character pair with:
- `system_type`: LLM-extracted relationship type
- `correct_type`: human-annotated correct type (e.g., 师徒, 兄弟, 敌对)
- `correct_category`: family / intimate / hierarchical / social / hostile / other
### 3. Alias Resolution
Each entry contains an alias group with:
- `canonical_name`: the primary name
- `system_aliases`: system-detected aliases
- `is_correct_grouping`: human judgment on group correctness
- `wrong_aliases` / `missing_aliases`: specific errors identified
### 4. Location Hierarchy (Golden Standard)
Each entry contains a location with:
- `name`: location name
- `correct_parent`: direct parent in the containment hierarchy
- `tier`: geographic scale (continent / kingdom / region / city / site / building)
## Annotation Protocol
- **Entity annotation**: Top-50 most frequent characters per novel, annotated for validity and canonical names
- **Relationship annotation**: Top-50 most frequent character pairs, annotated for correct type and category
- **Alias annotation**: All system-generated alias groups, annotated for correctness
- **Location hierarchy**: Manually constructed golden standard following direct-parent-only rule (no level skipping), using the novel's final narrative state for ambiguous cases
## Usage
```python
import json
with open("chinovelke.json", encoding="utf-8") as f:
data = json.load(f)
# Access Journey to the West character annotations
jtw_chars = data["novels"]["journey_to_west"]["annotations"]["characters"]["entries"]
for char in jtw_chars[:5]:
print(f"{char['name']}: valid={char['is_valid_character']}, canonical={char.get('correct_name')}")
```
## Evaluation Script
See `eval_dashboard.py` in the [AI Reader repository](https://github.com/mouseart2025/AI-Reader-V2/blob/main/backend/src/utils/eval_dashboard.py) for standardized metric computation.
## Baseline Results
| Metric | Journey to the West | Dream of the Red Chamber | Water Margin | Average |
|--------|-------------------|------------------------|-------------|---------|
| Entity Precision | 78.0% | 96.0% | 100.0% | **91.3%** |
| Relation Type Accuracy | 76.0% | 82.0% | 22.0% | 60.0% |
| Relation Category Accuracy | 64.0% | 86.0% | 34.0% | **61.3%** |
| Location Hierarchy Precision | 65.6% | 55.8% | — | **60.7%** |
| Alias Group Accuracy | 42.9% | — | 47.1% | 45.0% |
## Citation
```bibtex
@inproceedings{feng2026aireader,
title={AI Reader: Taming LLM Hallucinations in Long-Form Narrative Knowledge Extraction through Multi-Layer Validation},
author={Feng, Lei},
booktitle={Proceedings of the 2026 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year={2026}
}
```
## License
CC-BY-4.0
提供机构:
mouseart2025



