five

sapienzanlp/INDAQA_CALAMITA

收藏
Hugging Face2026-02-24 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/sapienzanlp/INDAQA_CALAMITA
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: text dtype: string - name: qas list: - name: answers list: string - name: choices list: string - name: entity dtype: string - name: kind dtype: string - name: model dtype: string - name: question dtype: string - name: question_id dtype: string - name: source_paragraphs_ids list: int64 - name: source_questions_ids list: int64 - name: target struct: - name: label dtype: string - name: text dtype: string - name: metadata struct: - name: author dtype: string - name: genres list: string - name: qa_paragraphs list: string - name: source_link dtype: string - name: subgenres list: string - name: summary dtype: string - name: summary_length dtype: int64 - name: summary_link dtype: string - name: text_length dtype: int64 - name: title dtype: string - name: year dtype: int64 splits: - name: summary_level num_bytes: 64898361 num_examples: 361 - name: passage_level num_bytes: 48887194 num_examples: 99 download_size: 68897304 dataset_size: 113785555 configs: - config_name: default data_files: - split: summary_level path: data/summary_level-* - split: passage_level path: data/passage_level-* --- <div align="center"><img src="assets/indaqa2.crop.alpha.png" width="700"></div> <div style="display: flex; justify-content: center; align-items: center; gap: 8px;"> <a href="https://www.evalita.it/campaigns/evalita-2026/" style="line-height: 0;"><img src="http://img.shields.io/badge/EVALITA-2026-forestgreen.svg" style="display: block; margin: 0;"/></a> <a href="https://github.com/Andrew-Wyn/INDAQA_CALAMITA" style="line-height: 0;"><img src="https://img.shields.io/badge/GitHub-INDAQA2-purple" style="display: block; margin: 0;"/></a> </div> # Dataset Card for INDAQA 2 <!-- Provide a quick summary of the dataset. --> **INDAQA 2 (CALAMITA update)** is a large-scale Italian reading-comprehension and question-answering benchmark built from classic narrative works. The dataset is designed to support research in Italian NLP, reading comprehension, information retrieval, and language model evaluation on medium- and long-context narratives and it is released as part of the CALAMITA 2026 edition. ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> **INDAQA 2** is a substantial revision and extension over the previous version, [INDAQA](https://huggingface.co/datasets/sapienzanlp/indaqa). It contains two complementary splits: - **summary_level**: **362** books; **13,661** open-ended QA items generated from book summaries. - **passage_level**: **99** books; **11,560** open-ended QA items generated from single passages or clusters of passages tied to target entities. QA from single passages is also provided in a multiple-choice format (MCQA). ### Summary-level Section This section includes **13,661 open-ended QA items** generated starting from the summary of the book, following the style of [NarrativeQA](https://aclanthology.org/Q18-1023/). For further details, please refer to the previous dataset, [INDAQA](https://huggingface.co/datasets/sapienzanlp/indaqa). We retain all original QA items from INDAQA (previously validated and confirmed high quality), while redownloading all source texts from scratch to fix inconsistencies and broken texts found in the original dataset. ### Passage-level Section This section includes **11,560 QA items** organized in three question sets: - **Local Questions**: 7,854 items - These questions are generated from a single passage (~20 sentences) randomly selected at runtime. They typically focus on specific details explicitly stated in the text. - **Alternative Local Questions**: 2,308 items - These questions are also generated from a single passage, but the LLM is additionally provided with the previously generated sample from the first set, encouraging the creation of less obvious questions. - **Entity Questions**: 1,388 items - These questions are generated from three passages in which an entity consistently appears, selected from the beginning, middle, and ending sections of the documents. The passages, together with the questions about that entity, are provided as input to generate final samples that target overarching plot elements and character development across the entire narrative. ### Dataset Statistics **General dataset statistics for summary-level and local-level sets** | Metric | Summary-level | Local-level | |--------|:---------------:|:-------------:| | Number of documents | 362 | 99 | | Total QA items | 13,661 | 11,560 | | QA items per document (Mean ± Std) | 38 ± 2 | 117 ± 20 | | **Document length (words)** | | | | Min-Max | 0.5K - 242K | 8K - 188K | | Mean ± Std | 26K ± 33K | 58K ± 31K | **QA items length statistics by question type (word count)** | Question Type | items per doc | Question length | Answer length | |---------------|:-----------------:|:-----------------:|:---------------:| | Summary Question | 38 ± 2 | 7 ± 2 | 5 ± 3 | | Local Question | 80 ± 14 | 8 ± 2 | 4 ± 2 | | Local Question (Alternative) | 23 ± 5 | 9 ± 3 | 6 ± 4 | | Entity Question | 14 ± 6 | 13 ± 3 | 24 ± 8 | ### Dataset Structure The dataset is released as a `DatasetDict` with two configurations: ```python DatasetDict({ summary_level: Dataset({ features: ['id', 'qas', 'text', 'metadata'], num_rows: 361 }) passage_level: Dataset({ features: ['id', 'qas', 'text', 'metadata'], num_rows: 99 }) }) ``` <details> <summary><b>Data schema</b></summary> Each split uses the same schema: - **id** `str` — unique identifier for the book or text unit. - **text** `str` — text of the document. - **qas** `list[dict]` — QA entries associated with the document. - **question_id** `str` — unique ID for the QA item. - **question** `str` — the question text. - **answers** `list` — list of free-form reference answers. - **choices** `list` — list of MCQA options (present for MCQA items). - **target** `dict`: - **label** `str` — correct MCQA label (e.g., `"C"`). - **text** `str`— canonical correct answer. - **entity** `str` — entity targeted by the question (nullable). - **model** `str` — generator model used (e.g., `"gemini-2.5-flash"`). - **kind** `str` — question type (e.g., `"summary_question"`). - **source_paragraphs_ids** `list` — list of paragraph indices used to generate the sample. - **source_questions_ids** `list` — list of question indices used to generate the sample. - **metadata** `dict`— book-level metadata. - **title** `str` — title of the work. - **author** `str` — author name. - **year** `int` — publication year (when available). - **genres** `list[str]` — main literary genres. - **subgenres** `list[str]` — granular genre tags. - **summary** `str` — book summary used in `summary_level`. - **summary_length** `int` — length of the summary (in words). - **text_length** `int` — length of the text (in words). - **source_link** `str` — link to the text source. - **summary_link** `str` — link to the summary source. - **qa_paragraphs** `list[str]` — list of text chunks used to generate the QAs. </details> **Note:** some fields are not available for certain question kinds. We show here the most important differences with an example for each kind of QA sample. <details> <summary><b>Data examples</b></summary> ```json // summary_level { "answers": [ "In un villaggio della Foresta Nera.", "Nella Foresta Nera, in un villaggio." ], "choices": [], // not available "entity": null, // not available "kind": "summary_question", "model": "gemini2-flash", "question": "Dove si svolge la festa di fidanzamento iniziale?", "question_id": "000_le_villi.summary.0", "source_paragraphs_ids": [], // not available "source_questions_ids": [], // not available "target": { // not available "label": null, "text": null } } // passage_level - Local Question sample { "answers": [ "Giacometta Maldi", "Giacometta" ], "choices": [ "A. Carolina", "B. Elena", "C. Giacometta Maldi", "D. Geltrude" ], "entity": null, "kind": "local_question", "model": "gemini-2.5-flash", "question": "Come si chiama la giovane donna al centro delle attenzioni per il matrimonio?", "question_id": "00_ahi_giacometta_la_tua_ghirlandella.set-a.1", "source_paragraphs_ids": [0], "source_questions_ids": [], // not available "target": { "label": "C", "text": "Giacometta Maldi" } } // passage_level - Alternative Local Question sample { "answers": [ "Biondi", "Erano biondi" ], "choices": [ "A. Neri", "B. Biondi", "C. Castani", "D. Rossi" ], "entity": null, "kind": "local_question_alt", "model": "gemini-2.5-flash", "question": "Di che colore erano i capelli di Giacometta?", "question_id": "00_ahi_giacometta_la_tua_ghirlandella.set-b.1", "source_paragraphs_ids": [0], "source_questions_ids": [], // not available "target": { "label": "B", "text": "Biondi" } } // passage_level - Entity Questions { "answers": ["La sua eccentricità e la tendenza a comportarsi in modo inappropriato o fuori luogo."], "choices": [], // not available "entity": "adalgisa", "kind": "entity_question", "model": "gemini-2.5-flash", "question": "Qual è una caratteristica distintiva del personaggio di Adalgisa?", "question_id": "00_ahi_giacometta_la_tua_ghirlandella.", "source_paragraphs_ids": [4, 8], "source_questions_ids": [0, 2, 4], "target": { // not available "label": null, "text": null } } ``` </details> ### Dataset Creation The dataset is built from a total of **461** narrative works (novels, novellas, short stories, screenplays, poems) written in Italian. These texts were selected from public domain collections to ensure legal availability. The majority of these texts were published between 1827 and 1948. All QA items were generated using a specific version of **Gemini**; the version is stored along other info in each QA sample. QA items were deduplicated, filtered (using RegExp- and LLM-based approaches) to remove low-quality questions, and finally validated by two expert annotators (either native or proficient in Italian). ### Dataset Sources <!-- Provide the basic links for the dataset. --> - [**Github Repository**](https://github.com/Andrew-Wyn/INDAQA_CALAMITA) - **Paper [optional]** [More Information Needed] Source texts were gathered from the following sources: - [Project Gutenberg](https://www.gutenberg.org/) - [Wikisource](https://it.wikisource.org/wiki/) - [LiberLiber](https://liberliber.it/) Summaries and additional metadated were sourced from: - [Wikipedia](https://it.wikipedia.org/) ## Uses <!-- Address questions around how the dataset is intended to be used. --> This dataset is intended for: 1. **Reading Comprehension Research**: Training and evaluating models on Italian reading comprehension 2. **Question Answering Systems**: Developing and benchmarking QA models for Italian 3. **Information Retrieval**: Evaluating semantic search and ranking systems (e.g., E5, BM25) 4. **Language Model Evaluation**: Benchmarking LLMs on Italian understanding tasks 5. **Literary Analysis**: Studying narrative structures and character development 6. **NLP Downstream Tasks**: Fine-tuning language models on Italian ### Not Recommended Use - This dataset should not be used for commercial reproduction of copyrighted literary works (though the source texts are in the public domain) - Not suitable for tasks requiring modern contemporary language (texts are historical) ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> 1. **Historical Background**: Texts use Italian from 1827-1948, which includes archaic vocabulary and grammatical forms. Moreover, it includes primarily male authors from this era (reflective of historical publication patterns) and thus narrative works may contain outdated attitudes and perspectives from the 19th and 20th century. 2. **LLM-generated**: While quality-controlled, QA items are generated by an LLM and may contain hallucinations or show the LLM bias ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** ```bibtex TBA ```
提供机构:
sapienzanlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作