资源简介:
---
license: cc-by-nc-sa-4.0
language:
- en
size_categories:
- 10K<n<100K
---
# Dataset Description
CLAUSE-ATLAS is a corpus that contains six books annotated with narrative categories at the level of clauses.
The books (i.e., Alice's Adventures in Wonderland, The Adventures of Pinocchio, Peter Pan, Pride and Prejudice, Frankenstein, and The Great Gatsby) were extracted from [Project Gutenberg](https://www.gutenberg.org).
Clauses were obtained by chunking the books with ChatGPT (gpt-3.5-turbo, 16k tokens context), called via the official [OpenAI](https://openai.com)’s API and istructed with the prompt reported in `./clauseatlas_notebook.ipynb`.
# Clause Annotation
The clauses in CLAUSE-ATLAS are annotated as expressing one of three types of information:
* __a subjective experience__, internal to the character in the novel (e.g., thoughts, memories, perceptions),
* __an objective event__ that happens in the external narrative world;
* __additional information__ about the characters or the narrative world.
Clauses marked as subjective experiences were further associated to the corresponding characters.
## Annotation Setup
Clauses were annotated in two setups: with the help of humans, and with the use of different instructions to prompt ChatGPT (gpt-3.5-turbo, 16k tokens context).
|*Setup*|*Annotators*|*Data*|*Instructions*|
|---|---|---|---|
|Human Annotation|3 people|First chapter of Alice's Adventures in Wonderland,<br>The Adventures of Pinocchio, and The Great Gatsby|Prompts and function calling as in `./clauseatlas_notebook.ipynb`|
|ChatGPT Annotation|3 prompts|Whole corpus|Stored in `./Human_Guidelines.pdf`|
NOTE: The task of identifying the characters of subjective experiences was performed by all humans. In the automatic setup, this layer of annotation was obtained (instructions as in `./clauseatlas_notebook.ipynb`) on the clauses labeled with one prompt only.
# Data Fields
|*Field*|*Type*|*Description*|
|---|---|---|
|book|str|Title of the book containing a given clause|
|chapter_id|i64|Number of the book chapter containing the clause.|
|paragraph_id|i64|Number of the paragraph containing the clause.|
|clause_number|i64|ID of the clause in the corpus.|
|text|str|Clause text.|
|prompt_one<br>prompt_two<br>prompt_three|str|Annotations obtained with the three prompts. S: subjective experience. E: external events. C: contextual information.|
|human_one<br>human_two<br>human_three|str|Annotations produced by the three humans.|
|experiencer_prompt_one|str|Characters involved in the subjective experiences identified by prompt_one.|
|experiencer_human_one<br>experiencer_human_two<br>experiencer_human_three|str|Characters involved in the subjective experiences identified by the three annotators.|
# Copyright and License
The books in CLAUSE-ATLAS are copyright-free. The corpus is licensed under the non-commercial [CC 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en) license. If you use this corpus,
please cite us as follows:
<pre><p>@inproceedings{TroianoVossen2024,
author = {Enrica Troiano and Piek Vossen},
title = {CLAUSE-ATLAS: A Corpus of Narrative Information
to Scale Up Computational Literary Analysis},
booktitle = {Proceedings of the the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)},
year = {2024},
address = {Turin, Italy}
}</p></pre>