five

chonkie-ai/gacha

收藏
Hugging Face2025-05-01 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/chonkie-ai/gacha
下载链接
链接失效反馈
官方服务:
资源简介:
Gacha数据集是从Project Gutenberg收集的100篇文本的语料库,旨在评估分块算法。该数据集受到了LumberChunker数据集的启发,但包含了原始文本,以便更容易地扩展到新的分块方法和模型。数据集分为三个部分:原始文本、包含元数据的100篇文本语料库,以及来自GutenQA语料库的问题和答案集。数据集目前遵循CC-BY-NC-SA-4.0许可证,但计划在未来采用更宽松的许可证。

Gacha is a corpus of 100 texts from Project Gutenberg, collected for the purpose of evaluating chunking algorithms. Its heavily inspired by the work of LumberChunker. The paper LumberChunker: Long-Form Narrative Document Segmentation released a dataset named GutenQA which has the same 100 texts as this corpus. However, because the GutenQA dataset does not contain the original texts, it is hard to extend it to newer methods of chunking as well as newer models as they come out. Chunking is an active area of research and new methods are being developed all the time. We wished to evaluate the performance of chunking algorithms and having a dataset that contains the original texts is crucial for this. To this end, we spent significant time and effort on creating this dataset over the course of a few weeks. The exact books were extracted from the Gutenberg Project, manually cleaned for any formatting issues, metadata extracted via LLMs (authors, titles, publication dates, etc.) and the questions from the GutenQA corpus were aligned to fit the new texts. Given the amount of time and work that went into creating this corpus, we decided to share this corpus with the community as Gacha.
提供机构:
chonkie-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作