andersonbcdefg/synthetic_retrieval_tasks

Name: andersonbcdefg/synthetic_retrieval_tasks
Creator: andersonbcdefg
Published: 2024-02-03 04:30:28
License: 暂无描述

Hugging Face2024-02-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/andersonbcdefg/synthetic_retrieval_tasks

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是用于生成嵌入训练数据的合成数据。数据通过多次迭代生成，每次迭代都使用前一次生成的任务作为种子任务，通过GPT模型生成更多的任务。第一次迭代使用了一组种子任务，第二次迭代使用了第一次生成的约40,000个任务，第三次迭代使用了前两次生成的约80,000个任务。

This dataset is synthetic data intended for generating embedding training data. It is generated through multiple iterative processes, where each iteration uses the tasks produced in the previous iteration as seed tasks to generate additional tasks via GPT models. The first iteration employs a set of initial seed tasks; the second iteration utilizes approximately 40,000 tasks generated in the first iteration; the third iteration uses roughly 80,000 tasks generated across the first two iterations.

提供机构：

andersonbcdefg

原始信息汇总

数据集概述

该数据集是用于生成嵌入训练数据的检索提示的合成数据。数据集中的“iteration”列表示数据生成的过程。

数据生成迭代

Iteration 1：
- 使用以下种子任务池，提示 GPT-3.5-Turbo 生成额外任务。
- 种子任务示例： python RETRIEVAL_EXAMPLES = [ Provide a scientific claim as query, retrieve documents that help verify or refute the claim., Search for documents that answers a FAQ-style query on childrens nutrition., "Retrieve companys financial reports for a given stock ticker symbol.", "Given a book name as a query, retrieve reviews, ratings and summaries of that book.", "Search for scientific research papers supporting a medical diagnosis for a specified disease.", "Given a question, retrieve Wikipedia passages that answer the question.", "Provided a user question, retrieve the highest voted answers on Reddit ELI5 forum.", "Given a web search engine query, retrieve relevant passages that answer the query.", "Find Amazon reviews similar to the input review.", "Find the song lyrics most related to the users search.", "Given a multi-hop question, retrieve documents that can help answer the question.", "Retrieve tweets that are semantically similar to the given tweet", "Given a news summary, retrieve other semantically similar summaries", "Given a question, retrieve relevant answers from Stackexchange", "Given a scientific paper title, retrieve paper abstracts that are cited by the given paper." ]
Iteration 2：
- 使用 Iteration 1 生成的约 40,000 个任务作为种子任务，提示 GPT-3.5-Turbo 生成额外任务。
Iteration 3：
- 使用 Iterations 1-2 生成的约 80,000 个任务作为种子任务，提示 GPT-4-Turbo 生成额外任务。

5,000+

优质数据集

54 个

任务类型

进入经典数据集