SSHAFER/agency-personalities-trails
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SSHAFER/agency-personalities-trails
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Personalities-Trails
language:
- en
- zh
license: other
license_name: research-only-custom
license_link: https://huggingface.co/datasets/SSHAFER/agency-personalities-trails/blob/main/LICENSE
task_categories:
- text-generation
- feature-extraction
tags:
- literature
- character-analysis
- roleplay
- rag
- bilingual
- research
size_categories:
- 10K<n<100K
configs:
- config_name: en-traits
data_files:
- split: train
path: en-traits/*.json
- config_name: en-retrieval
data_files:
- split: train
path: en-retrieval/*.json
- config_name: en-retrieval+en-traits
data_files:
- split: train
path: en-retrieval+en-traits/*.json
- config_name: ch-traits
data_files:
- split: train
path: ch-traits/*.json
- config_name: ch-retrieval
data_files:
- split: train
path: ch-retrieval/*.json
- config_name: ch-retrieval+ch-traits
data_files:
- split: train
path: ch-retrieval+ch-traits/*.json
---
# Personalities-Trails
## Overview
Personalities-Trails is a bilingual literary analysis dataset for research on artificial agency, literary character modeling, retrieval-augmented generation, and role-playing evaluation.
The dataset is built from selected literary works and organized into multiple subsets for detailed trait analysis, retrieval-oriented structured summaries, and merged settings that combine both views.
This repository contains processed research data only, including structured annotations, metadata, and limited text excerpts. It does not provide complete literary works and should not be treated as a substitute for the original publications.
## Dataset Summary
- Root directory: `resource/`
- Total JSON files: `412`
- Approximate size: `~5.3 GB`
- Estimated total records: `50,000+`
- Languages: Chinese and English
### Subsets
| Subset | Files | Size | Language | Description |
|------|------:|------:|------|------|
| `en-traits/` | 90 | 1.3 GB | English | Full English literary analysis |
| `en-retrieval/` | 37 | 869 MB | English | English analysis with retrieval-oriented short summaries |
| `en-retrieval+en-traits/` | 90 | 1.7 GB | English | Merged English subset |
| `ch-traits/` | 90 | 664 MB | Chinese | Full Chinese literary analysis |
| `ch-retrieval/` | 15 | 100 MB | Chinese | Chinese analysis with retrieval-oriented short summaries |
| `ch-retrieval+ch-traits/` | 90 | 694 MB | Chinese | Merged Chinese subset |
Other files currently present in the directory include `dataset_comparison.xlsx`, `fig.pptx`, and `find_en.py`.
## Data Structure
All JSON files use a top-level array. Each element is a sample containing instructions, source text, outputs, and metadata.
### Common Fields
```json
{
"instruction": {
"intro": "Prompt for intro analysis",
"personalities_trails": "Prompt for character trait analysis",
"self_awareness": "Prompt for self-awareness analysis",
"scene": "Prompt for scene analysis"
},
"text": "Source literary excerpt",
"input": "",
"output": {
"intro": "Structured analysis of character/location/background/event",
"personalities_trails": "Detailed character profile",
"self_awareness": "Self-awareness analysis",
"scene": "Scene analysis"
},
"metadata": {
"element_id": "Unique identifier",
"filename": "Source EPUB filename",
"languages": "eng / zho"
}
}
```
### Retrieval-Specific Field
Some retrieval-related subsets contain an additional `output-short` field for structured summaries.
```json
{
"output-short": {
"scenario": {
"place": "Location",
"background": "Background",
"event": "Event"
},
"people": [
{
"character-profile": {
"name": "Character name",
"sketch": "Character sketch"
},
"literary-characterization": {
"appearance": "Appearance",
"language": "Language style",
"action": "Behavior",
"psychology": "Psychology",
"demeanor": "Demeanor"
},
"psychological-analysis": {
"perspective-on-life": "View of life"
}
}
]
}
}
```
Note: the `scenario` field appears in English retrieval files; Chinese retrieval files may only contain the `people` field under `output-short`.
## Subset Types
| Type | Example Directories | Fields | Purpose |
|------|------|------|------|
| `traits` | `en-traits/`, `ch-traits/` | Common fields | Detailed literary analysis |
| `retrieval` | `en-retrieval/`, `ch-retrieval/` | Common fields + `output-short` | Analysis plus compact structured summaries |
| `retrieval+traits` | `en-retrieval+en-traits/`, `ch-retrieval+ch-traits/` | Combined content | Merged subsets for broader use |
## Example Statistics
- Example file: `traits_1984.epub.json` contains `926` records
- Naming pattern: `traits_[book-title].epub.json`
- Main analysis dimensions: `intro`, `personalities_trails`, `self_awareness`, `scene`
## Source and Construction
The dataset is derived from EPUB-format books spanning Chinese and English literary works, including both classic and modern titles. Literary passages are processed into structured annotations with detailed outputs (`output`) and, in some subsets, short summaries (`output-short`).
## Intended Use
This dataset is intended for non-commercial research use, including:
- literary character modeling
- artificial agency research
- retrieval-augmented generation experiments
- role-playing and character simulation evaluation
- analysis of trait representation and self-perception
## Prohibited Use
This dataset must not be used for:
- commercial use of any kind
- commercial training or fine-tuning
- reconstructing or substituting the original books
- unlawful redistribution of excerpted text
- any use that infringes the rights of authors, translators, publishers, or other rights holders
## Copyright and License Notice
This dataset contains limited excerpts derived from copyrighted literary works. Rights in the original texts remain with their respective rights holders.
Please review the full license terms in [`LICENSE`](./LICENSE). If you are a rights holder and believe any content should be revised or removed, please contact the maintainer.
## Usage
### Download the full dataset
```bash
git lfs install
git clone https://huggingface.co/datasets/your-username/personalities-trails
```
### Download selected subsets
```python
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="your-username/personalities-trails",
repo_type="dataset",
allow_patterns=[
"en-traits/*",
"README.md",
"LICENSE",
],
)
print(local_dir)
```
### Load a JSON file directly
```python
import json
from pathlib import Path
path = Path("en-traits/traits_1984.epub.json")
with path.open("r", encoding="utf-8") as f:
data = json.load(f)
print(len(data))
print(data[0]["metadata"])
```
### Load with `datasets`
```python
from datasets import load_dataset
dataset = load_dataset(
"json",
data_files="en-traits/traits_1984.epub.json",
split="train",
)
print(dataset[0]["text"])
```
## Limitations
- The dataset includes only limited excerpts rather than complete literary works.
- Redistribution constraints may apply because the data is derived from copyrighted books.
- Coverage depends on the selected books and processing pipeline, and does not represent all literary traditions or styles.
## Citation
If you use this dataset in research, please cite the repository or the associated paper/project page when available.
提供机构:
SSHAFER



