markozubac/novelQA_triplets

Name: markozubac/novelQA_triplets
Creator: markozubac
Published: 2026-03-19 08:33:24
License: 暂无描述

Hugging Face2026-03-19 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/markozubac/novelQA_triplets

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - question-answering - information-extraction language: - en tags: - triplets - llm - novelqa - knowledge-graph pretty_name: NovelQA Triplets size_categories: - 1M<n<10M --- # Dataset Card for NovelQA Triplets This dataset consists of triplets generated from the NovelQA dataset using Cohere’s Command-A LLM, under two different generation strategies. It is designed to evaluate and compare the impact of context-aware prompting on triplet extraction performance across long-form narrative texts. --- ## Dataset Details ### Dataset Description The NovelQA triplets dataset contains triplets extracted from selected books in the NovelQA dataset using two distinct methods with Cohere’s Command-A large language model. It enables comparative analysis of different prompting strategies for structured knowledge extraction from long-form literary texts. - **Curated by:** Marko Zubac and Ognjen Kundačina, The Institute for Artificial Intelligence and Development of Serbia - **Funded by [optional]:** Cohere - **Shared by [optional]:** The Institute for Artificial Intelligence and Development of Serbia - **Language(s) (NLP):** English --- ### Dataset Sources [optional] - **Repository:** https://github.com/markozubac/PronounceReplacer --- ## Uses ### Direct Use This dataset can be used for: - Evaluating context-aware prompting strategies in knowledge extraction pipelines - Benchmarking LLM-based triplet extraction on long-form narrative texts - Training or fine-tuning models for information extraction and relational reasoning - Studying entity consistency and coreference handling in book-length documents ### Out-of-Scope Use - Commercial deployment without verifying license conditions - Using the dataset to infer personal data or identities - Tasks unrelated to triplet or knowledge graph construction --- ## Dataset Structure - **Columns:** `chunk_ID | question_ID | triplet` - **Format:** CSV The dataset is organized into: - Individual datasets for each book: - **B03** - **B28** - **B39** - **B42** - **B54** - For each book: - **Base method** - **Method 3 (context-aware prompt switching)** - Additionally: - **Merged dataset (all books, Base method)** - **Merged dataset (all books, Method 3)** --- ## Dataset Creation ### Curation Rationale The dataset was created to study how different prompting strategies affect the accuracy and completeness of triplet extraction from long-form narrative texts in the NovelQA dataset. --- ### Source Data The source data comes from the NovelQA dataset, which contains question–answer pairs derived from literary works. --- ### Data Collection and Processing Text segments from selected NovelQA books (B03, B28, B39, B42, B54) were chunked and processed using Cohere’s Command-A model under two prompting strategies: - **Base Method:** Standard triplet extraction with no additional contextual augmentation. - **Method 3 – Context-Aware Prompt Switching:** If a pronoun is detected in generated triplets, the model halts generation and switches to a context-aware prompt that includes triplets from the previous chunk as contextual input. --- ### Who are the source data producers? The original NovelQA dataset was created for question answering over long-form narrative texts. This derivative dataset transforms that content into structured triplets using Cohere’s Command-A LLM. --- ## Annotations [optional] ### Annotation process No manual annotation. All triplets were automatically generated using Cohere’s Command-A model. No inter-annotator agreement or validation metrics are included. ### Who are the annotators? Triplets were generated by an automated large language model (Cohere Command-A). ### Personal and Sensitive Information This dataset does not contain personal, sensitive, or private information. All text is derived from publicly available literary question–answer datasets. --- ## Bias, Risks, and Limitations - The dataset may reflect linguistic or narrative biases present in the original books and NovelQA dataset - Triplets generated by LLMs can include hallucinations or inconsistent entity linking - Long-form context may introduce coreference errors despite mitigation strategies --- ### Recommendations - Users should evaluate triplet correctness before using for downstream tasks - Not all generated triplets are validated for factual consistency - Additional filtering or post-processing is recommended for high-precision applications --- ## Citation [optional] **BibTeX:** ```bibtex @dataset{zubac2025novelqatriplets, title={NovelQA Triplets}, author={Marko Zubac and Ognjen Kundačina}, institution={The Institute for Artificial Intelligence and Development of Serbia}, year={2025}, note={Generated using Cohere Command-A LLM}, url={https://github.com/markozubac/PronounceReplacer} }

提供机构：

markozubac

5,000+

优质数据集

54 个

任务类型

进入经典数据集