mohit-raghavendra/self-instruct-wikipedia
收藏数据集卡片
数据集详情
数据集描述
- 数据集名称: 未明确提及,但基于内容可推测为与TriviaQA相关的搜索词标注数据集。
- 数据来源: TriviaQA数据集的一个子样本,具体为TriviaQA数据集训练集的前1%。
- 数据集大小: 116267字节
- 样本数量: 1384个样本
- 下载大小: 82027字节
- 特征:
question: 问题,数据类型为字符串。query_terms: 搜索词,数据类型为字符串。
- 数据分割:
train: 训练集,包含1384个样本,大小为116267字节。
数据集创建
数据收集和处理
- 数据子样本: 从TriviaQA数据集的训练集中提取的前1%样本。
- 数据加载代码: python datasets.load_dataset("trivia_qa", "rc.nocontext", split="train[:1%]")
标注
-
初始标注: 前30个样本由作者手动标注。
-
模型标注: 使用Gemini-Pro模型,基于前30个样本作为k-shot示例(k=10),标注剩余数据集。
-
系统消息: python SYSTEM_MESSAGE = f"""There exists a wikipedia summarizer that can return a summary for a topic. Your job is to act as an aid to a question answering tool. Whenever you are asked about a question related to general knowledge, instead of using your internal knowledge (which can be faulty or out of date), format a Wikipedia search query string that can help answer the question.
Wikipedia Entries are usually about a simple entity or event, so keep the query short, and about the entity being asked about. Also, dont use your knowledge to ask about the answer. Instead form queries about the entity in the question. This will help you get the right wikipedia entries for questions when you dont know the answer """
使用场景
- 应用: 用于微调一个代理,根据给定的问题,找到在Wikipedia中搜索的相关词条。
数据集作者
- 作者: Mohit Raghavendra



