Laz4rz/wikipedia_science_chunked_small_rag_512
收藏Hugging Face2024-06-12 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/Laz4rz/wikipedia_science_chunked_small_rag_512
下载链接
链接失效反馈官方服务:
资源简介:
ScienceWikiSmallChunk是millawell/wikipedia_field_of_science数据集的处理版本,专门用于小上下文长度的RAG系统。每个数据块的长度大约为512个token,较长的维基百科页面被分割成较小的条目,并在每个条目前添加了标题作为前缀。此外,还有一个256个token的数据集可供使用。如果需要准备其他长度的数据块,可以使用提供的代码示例。
The ScienceWikiSmallChunk dataset is a processed version of millawell/wikipedia_field_of_science, designed for RAG systems with small context lengths. Each chunks length is tokenizer-dependent, but each chunk should be around 512 tokens. Longer Wikipedia pages have been split into smaller entries, with the title added as a prefix. There is also a 256 tokens dataset available: Laz4rz/wikipedia_science_chunked_small_rag_256. If you wish to prepare some other chunk length, you can use millawell/wikipedia_field_of_science and adapt the chunker function.
提供机构:
Laz4rz
原始信息汇总
ScienceWikiSmallChunk
概述
- 名称: ScienceWikiSmallChunk
- 标签: RAG, Retrieval Augmented Generation, Small Chunks, Wikipedia, Science, Scientific, Scientific Wikipedia, Science Wikipedia, 512 tokens
- 许可证: cc-by-sa-3.0
- 任务类别: text-generation, text-classification, question-answering
描述
- 数据集来源: 处理自 millawell/wikipedia_field_of_science
- 用途: 适用于小上下文长度的RAG系统
- 分块长度: 依赖于分词器,每个块大约512个令牌
- 处理方式: 较长的维基百科页面已被分割成较小的条目,标题作为前缀添加
其他信息
- 256令牌数据集: Laz4rz/wikipedia_science_chunked_small_rag_256
- 自定义分块长度: 使用millawell/wikipedia_field_of_science并调整分块函数



