five

SSHAFER/agency-personalities-trails

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SSHAFER/agency-personalities-trails
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Personalities-Trails language: - en - zh license: other license_name: research-only-custom license_link: https://huggingface.co/datasets/SSHAFER/agency-personalities-trails/blob/main/LICENSE task_categories: - text-generation - feature-extraction tags: - literature - character-analysis - roleplay - rag - bilingual - research size_categories: - 10K<n<100K configs: - config_name: en-traits data_files: - split: train path: en-traits/*.json - config_name: en-retrieval data_files: - split: train path: en-retrieval/*.json - config_name: en-retrieval+en-traits data_files: - split: train path: en-retrieval+en-traits/*.json - config_name: ch-traits data_files: - split: train path: ch-traits/*.json - config_name: ch-retrieval data_files: - split: train path: ch-retrieval/*.json - config_name: ch-retrieval+ch-traits data_files: - split: train path: ch-retrieval+ch-traits/*.json --- # Personalities-Trails ## Overview Personalities-Trails is a bilingual literary analysis dataset for research on artificial agency, literary character modeling, retrieval-augmented generation, and role-playing evaluation. The dataset is built from selected literary works and organized into multiple subsets for detailed trait analysis, retrieval-oriented structured summaries, and merged settings that combine both views. This repository contains processed research data only, including structured annotations, metadata, and limited text excerpts. It does not provide complete literary works and should not be treated as a substitute for the original publications. ## Dataset Summary - Root directory: `resource/` - Total JSON files: `412` - Approximate size: `~5.3 GB` - Estimated total records: `50,000+` - Languages: Chinese and English ### Subsets | Subset | Files | Size | Language | Description | |------|------:|------:|------|------| | `en-traits/` | 90 | 1.3 GB | English | Full English literary analysis | | `en-retrieval/` | 37 | 869 MB | English | English analysis with retrieval-oriented short summaries | | `en-retrieval+en-traits/` | 90 | 1.7 GB | English | Merged English subset | | `ch-traits/` | 90 | 664 MB | Chinese | Full Chinese literary analysis | | `ch-retrieval/` | 15 | 100 MB | Chinese | Chinese analysis with retrieval-oriented short summaries | | `ch-retrieval+ch-traits/` | 90 | 694 MB | Chinese | Merged Chinese subset | Other files currently present in the directory include `dataset_comparison.xlsx`, `fig.pptx`, and `find_en.py`. ## Data Structure All JSON files use a top-level array. Each element is a sample containing instructions, source text, outputs, and metadata. ### Common Fields ```json { "instruction": { "intro": "Prompt for intro analysis", "personalities_trails": "Prompt for character trait analysis", "self_awareness": "Prompt for self-awareness analysis", "scene": "Prompt for scene analysis" }, "text": "Source literary excerpt", "input": "", "output": { "intro": "Structured analysis of character/location/background/event", "personalities_trails": "Detailed character profile", "self_awareness": "Self-awareness analysis", "scene": "Scene analysis" }, "metadata": { "element_id": "Unique identifier", "filename": "Source EPUB filename", "languages": "eng / zho" } } ``` ### Retrieval-Specific Field Some retrieval-related subsets contain an additional `output-short` field for structured summaries. ```json { "output-short": { "scenario": { "place": "Location", "background": "Background", "event": "Event" }, "people": [ { "character-profile": { "name": "Character name", "sketch": "Character sketch" }, "literary-characterization": { "appearance": "Appearance", "language": "Language style", "action": "Behavior", "psychology": "Psychology", "demeanor": "Demeanor" }, "psychological-analysis": { "perspective-on-life": "View of life" } } ] } } ``` Note: the `scenario` field appears in English retrieval files; Chinese retrieval files may only contain the `people` field under `output-short`. ## Subset Types | Type | Example Directories | Fields | Purpose | |------|------|------|------| | `traits` | `en-traits/`, `ch-traits/` | Common fields | Detailed literary analysis | | `retrieval` | `en-retrieval/`, `ch-retrieval/` | Common fields + `output-short` | Analysis plus compact structured summaries | | `retrieval+traits` | `en-retrieval+en-traits/`, `ch-retrieval+ch-traits/` | Combined content | Merged subsets for broader use | ## Example Statistics - Example file: `traits_1984.epub.json` contains `926` records - Naming pattern: `traits_[book-title].epub.json` - Main analysis dimensions: `intro`, `personalities_trails`, `self_awareness`, `scene` ## Source and Construction The dataset is derived from EPUB-format books spanning Chinese and English literary works, including both classic and modern titles. Literary passages are processed into structured annotations with detailed outputs (`output`) and, in some subsets, short summaries (`output-short`). ## Intended Use This dataset is intended for non-commercial research use, including: - literary character modeling - artificial agency research - retrieval-augmented generation experiments - role-playing and character simulation evaluation - analysis of trait representation and self-perception ## Prohibited Use This dataset must not be used for: - commercial use of any kind - commercial training or fine-tuning - reconstructing or substituting the original books - unlawful redistribution of excerpted text - any use that infringes the rights of authors, translators, publishers, or other rights holders ## Copyright and License Notice This dataset contains limited excerpts derived from copyrighted literary works. Rights in the original texts remain with their respective rights holders. Please review the full license terms in [`LICENSE`](./LICENSE). If you are a rights holder and believe any content should be revised or removed, please contact the maintainer. ## Usage ### Download the full dataset ```bash git lfs install git clone https://huggingface.co/datasets/your-username/personalities-trails ``` ### Download selected subsets ```python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="your-username/personalities-trails", repo_type="dataset", allow_patterns=[ "en-traits/*", "README.md", "LICENSE", ], ) print(local_dir) ``` ### Load a JSON file directly ```python import json from pathlib import Path path = Path("en-traits/traits_1984.epub.json") with path.open("r", encoding="utf-8") as f: data = json.load(f) print(len(data)) print(data[0]["metadata"]) ``` ### Load with `datasets` ```python from datasets import load_dataset dataset = load_dataset( "json", data_files="en-traits/traits_1984.epub.json", split="train", ) print(dataset[0]["text"]) ``` ## Limitations - The dataset includes only limited excerpts rather than complete literary works. - Redistribution constraints may apply because the data is derived from copyrighted books. - Coverage depends on the selected books and processing pipeline, and does not represent all literary traditions or styles. ## Citation If you use this dataset in research, please cite the repository or the associated paper/project page when available.
提供机构:
SSHAFER
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作