five

troianea/CLAUSE-ATLAS

收藏
Hugging Face2024-03-11 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/troianea/CLAUSE-ATLAS
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 language: - en size_categories: - 10K<n<100K --- # Dataset Description CLAUSE-ATLAS is a corpus that contains six books annotated with narrative categories at the level of clauses. The books (i.e., Alice's Adventures in Wonderland, The Adventures of Pinocchio, Peter Pan, Pride and Prejudice, Frankenstein, and The Great Gatsby) were extracted from [Project Gutenberg](https://www.gutenberg.org). Clauses were obtained by chunking the books with ChatGPT (gpt-3.5-turbo, 16k tokens context), called via the official [OpenAI](https://openai.com)’s API and istructed with the prompt reported in `./clauseatlas_notebook.ipynb`. # Clause Annotation The clauses in CLAUSE-ATLAS are annotated as expressing one of three types of information: * __a subjective experience__, internal to the character in the novel (e.g., thoughts, memories, perceptions), * __an objective event__ that happens in the external narrative world; * __additional information__ about the characters or the narrative world. Clauses marked as subjective experiences were further associated to the corresponding characters. ## Annotation Setup Clauses were annotated in two setups: with the help of humans, and with the use of different instructions to prompt ChatGPT (gpt-3.5-turbo, 16k tokens context). |*Setup*|*Annotators*|*Data*|*Instructions*| |---|---|---|---| |Human Annotation|3 people|First chapter of Alice's Adventures in Wonderland,<br>The Adventures of Pinocchio, and The Great Gatsby|Prompts and function calling as in `./clauseatlas_notebook.ipynb`| |ChatGPT Annotation|3 prompts|Whole corpus|Stored in `./Human_Guidelines.pdf`| NOTE: The task of identifying the characters of subjective experiences was performed by all humans. In the automatic setup, this layer of annotation was obtained (instructions as in `./clauseatlas_notebook.ipynb`) on the clauses labeled with one prompt only. # Data Fields |*Field*|*Type*|*Description*| |---|---|---| |book|str|Title of the book containing a given clause| |chapter_id|i64|Number of the book chapter containing the clause.| |paragraph_id|i64|Number of the paragraph containing the clause.| |clause_number|i64|ID of the clause in the corpus.| |text|str|Clause text.| |prompt_one<br>prompt_two<br>prompt_three|str|Annotations obtained with the three prompts. S: subjective experience. E: external events. C: contextual information.| |human_one<br>human_two<br>human_three|str|Annotations produced by the three humans.| |experiencer_prompt_one|str|Characters involved in the subjective experiences identified by prompt_one.| |experiencer_human_one<br>experiencer_human_two<br>experiencer_human_three|str|Characters involved in the subjective experiences identified by the three annotators.| # Copyright and License The books in CLAUSE-ATLAS are copyright-free. The corpus is licensed under the non-commercial [CC 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en) license. If you use this corpus, please cite us as follows: <pre><p>@inproceedings{TroianoVossen2024, author = {Enrica Troiano and Piek Vossen}, title = {CLAUSE-ATLAS: A Corpus of Narrative Information to Scale Up Computational Literary Analysis}, booktitle = {Proceedings of the the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)}, year = {2024}, address = {Turin, Italy} }</p></pre>
提供机构:
troianea
原始信息汇总

数据集概述

数据集名称

CLAUSE-ATLAS

数据集描述

CLAUSE-ATLAS是一个包含六本书籍的语料库,这些书籍在从句级别上进行了叙事类别的标注。书籍包括《爱丽丝梦游仙境》、《木偶奇遇记》、《彼得潘》、《傲慢与偏见》、《弗兰肯斯坦》和《了不起的盖茨比》,均从Project Gutenberg提取。从句通过ChatGPT(gpt-3.5-turbo,16k tokens上下文)进行分块,使用OpenAI的API,并通过./clauseatlas_notebook.ipynb中的提示进行指导。

从句标注

CLAUSE-ATLAS中的从句被标注为表达以下三种信息之一:

  • 主观体验,即小说中角色的内部体验(如思想、记忆、感知);
  • 客观事件,即外部叙事世界中发生的事件;
  • 关于角色或叙事世界的附加信息。

主观体验的从句进一步关联到相应的角色。

标注设置

从句的标注在两种设置下进行:人工标注和使用不同指令提示ChatGPT进行自动标注。

设置 标注者 数据范围 指令来源
人工标注 3人 《爱丽丝梦游仙境》、《木偶奇遇记》和《了不起的盖茨比》的第一章 ./clauseatlas_notebook.ipynb
ChatGPT标注 3个提示 整个语料库 ./Human_Guidelines.pdf

数据字段

字段 类型 描述
book str 包含从句的书籍标题
chapter_id i64 包含从句的书籍章节编号
paragraph_id i64 包含从句的段落编号
clause_number i64 语料库中从句的ID
text str 从句文本
prompt_one str 通过第一个提示获得的标注
prompt_two str 通过第二个提示获得的标注
prompt_three str 通过第三个提示获得的标注
human_one str 第一个人的标注
human_two str 第二个人的标注
human_three str 第三个人的标注
experiencer_prompt_one str 通过第一个提示识别的主观体验角色
experiencer_human_one str 第一个人识别的主观体验角色
experiencer_human_two str 第二个人识别的主观体验角色
experiencer_human_three str 第三个人识别的主观体验角色

版权与许可

CLAUSE-ATLAS中的书籍版权已过期。语料库根据非商业性CC 4.0许可证授权。使用此语料库时,请按以下方式引用:

@inproceedings{TroianoVossen2024, author = {Enrica Troiano and Piek Vossen}, title = {CLAUSE-ATLAS: A Corpus of Narrative Information to Scale Up Computational Literary Analysis}, booktitle = {Proceedings of the the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)}, year = {2024}, address = {Turin, Italy} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作