MongoDB/cosmopedia-wikihow-chunked
收藏数据集概述
该数据集是从 Cosmopedia 数据集中精选的一部分 Wikihow 文章的片段版本。每个文章被分割成不超过两个段落的片段。
数据集结构
每个记录代表一个较大文章的片段,包含以下字段:
doc_id: 父文章的唯一标识符chunk_id: 每个片段的唯一标识符text_token_length: 片段文本中的标记数量text: 片段的原始文本
使用场景
该数据集可用于评估和测试:
- 嵌入模型的性能和 RAG
- 语义搜索的检索质量
- 问答性能
示例文档
MongoDB 中的文档应如下所示:
json { "_id": { "$oid": "65d93cb0653af71f15a888ae" }, "doc_id": { "$numberInt": "0" }, "chunk_id": { "$numberInt": "1" }, "text_token_length": { "$numberInt": "111" }, "text": "**Step 1: Choose a Location ** Select a well-draining spot in your backyard, away from your house or other structures, as compost piles can produce odors. Ideally, locate the pile in partial shade or a location with morning sun only. This allows the pile to retain moisture while avoiding overheating during peak sunlight hours.
Key tip: Aim for a minimum area of 3 x 3 feet (0.9m x 0.9m) for proper decomposition; smaller piles may not generate enough heat for optimal breakdown of materials." }



