CZLC/history_retrieval

Name: CZLC/history_retrieval
Creator: CZLC
Published: 2024-08-21 13:42:37
License: 暂无描述

Hugging Face2024-08-21 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/CZLC/history_retrieval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 --- ## Introduction The HistoryIR dataset was annotated on top of the historical part of [BUT-LCC](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) corpus. We urged annotators to search for historical events (from their own mind, or using our inspirator, more details in the upcoming paper), using the semantic search tool we developed (translation service + English contriever model setup). Then the annotators annotated top retrieved passages as relevant or irrelevant. We've done additional filtering step that included manual verification of several annotators of every annotator. We've only included annotations from the given time period (specific to the annotator), which passed the manual verification test. ## Creating Multi-Choice format To create multi-choice format, we sample quadruplets of relevant/irrelevant documents for each query such that - 3 documents were irrelevant - 1 document was relevant or - 3 documents were relevant - 1 document was irrelevant. The task then is the identify the incosisistent document (e.g. identify single relevant from 4, or identify single irrelevant from 4). The multi-choice format conversion script is available in the repository as [convert_histir_filtered.py](https://huggingface.co/datasets/CZLC/history_retrieval/blob/main/convert_histir_filtered.py). ## Licensing The historical documents are not owned by CZLC affiliated members and belong to the original authors. The annotations are released under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) license and extra code is released under [Apache-2.0 licensing](https://www.apache.org/licenses/LICENSE-2.0). ## Citation If you use this dataset, please cite the following bibtex ``` @misc{fajcik2024czech, author = {Martin Fajčík and Martin Dočekal and Jakub Štetina and Michal Hradiš}, title = {HistIrCzech: A Czech Corpus for Historical Document Retrieval}, year = {2024}, howpublished = {\url{https://huggingface.co/datasets/CZLC/history_retrieval}}, note = {Dataset published on Hugging Face} } ``` We still seek to extend the size and variability of our IR dataset before it's final publication. ## Additional statistics ``` Dataset size: 1196 Average positive (1 pos, 3 neg) word length: 100.05287713841369 Average negative (1 pos, 3 neg) word length: 84.81804043545878 Average positive (3 pos, 1 neg) word length: 95.6931886678722 Average negative (3 pos, 1 neg) word length: 85.69620253164557 3pos_1neg examples: 553 1pos_3neg examples: 643 ``` <img src="document_lengths_histogram.png" width="900"/>

提供机构：

CZLC

5,000+

优质数据集

54 个

任务类型

进入经典数据集