five

Locutusque/dibt-instruct

收藏
Hugging Face2024-03-15 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/Locutusque/dibt-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: question dtype: string - name: answer dtype: string splits: - name: train num_bytes: 6743305 num_examples: 3340 download_size: 3803007 dataset_size: 6743305 configs: - config_name: default data_files: - split: train path: data/train-* license: other task_categories: - text-generation - text2text-generation - fill-mask language: - en size_categories: - 1K<n<10K --- # Dataset Card for dibt-instruct `Locutusque/dibt-instruct` is a dataset derived from the `10k_prompts_ranked` dataset, where an answer has been generated for each prompt using Google's Gemini Pro language model. Before augmenting, 5,000 prompts were sampled from the original 10,000 prompts, and those with a quality score less than or equal to 3.5 were removed, resulting in 3,340 prompt-answer pairs. ## Dataset Details - **Curated by:** Derived from the `10k_prompts_ranked` dataset created by Argilla, Hugging Face, and the Prompt Collective community. - **Language:** English - **License:** Inherited from `10k_prompts_ranked` dataset [More Information Needed] ## Dataset Description This augmented dataset contains 3,340 examples, each consisting of a prompt from the original `10k_prompts_ranked` dataset and a generated answer using Google's Gemini Pro language model. The prompts were filtered to only include those with an average quality rating greater than 3.5 out of 5 in the original dataset. ## Dataset Creation ### Source Data The source data is the `10k_prompts_ranked` dataset, which contains 10,331 prompts with quality rankings from 314 community members. ### Data Augmentation 1. 5,000 prompts were randomly sampled from the `10k_prompts_ranked` dataset. 2. Prompts with an average quality score <= 3.5 were removed, leaving 3,340 prompts. 3. For each remaining prompt, an answer was generated using Google's Gemini Pro language model. 4. The generated answers were combined with the corresponding prompts to create the augmented dataset. ## Dataset Structure Each example in the augmented dataset is a dictionary with the following keys: - `question`: The original prompt text from `10k_prompts_ranked`. - `answer`: The generated answer text from Gemini Pro for this prompt. ## Intended Use This augmented dataset can be used for tasks such as: - Training language models on prompt-answer pairs - Evaluating the quality of generated answers - Analyzing biases or limitations in Gemini Pro's outputs - Data augmentation for other language tasks ## Limitations - The generated answers come from a single language model (Gemini Pro) and may reflect biases of that model. - The quality of the generated answers has not been manually verified. - The prompts were filtered based only on the average quality score, other filtering criteria could be applied. ## Maintenance This is currently a static dataset with no plans for updates. However, the process of generating answers could be repeated with different language models or prompts from the original `10k_prompts_ranked` dataset.
提供机构:
Locutusque
原始信息汇总

数据集卡片 for dibt-instruct

Locutusque/dibt-instruct 是从 10k_prompts_ranked 数据集中派生的数据集,其中每个提示都使用 Google 的 Gemini Pro 语言模型生成了答案。在增强之前,从原始的 10,000 个提示中抽样了 5,000 个提示,并移除了质量评分小于或等于 3.5 的提示,最终得到了 3,340 个提示-答案对。

数据集详情

  • 语言: 英语
  • 许可: 继承自 10k_prompts_ranked 数据集 [更多信息需要]

数据集描述

这个增强的数据集包含 3,340 个示例,每个示例由原始 10k_prompts_ranked 数据集中的一个提示和使用 Google 的 Gemini Pro 语言模型生成的答案组成。

提示经过筛选,仅包括原始数据集中平均质量评分大于 3.5 的提示。

数据集创建

源数据

源数据是 10k_prompts_ranked 数据集,其中包含 10,331 个带有质量评分的提示,由 314 名社区成员评分。

数据增强

  1. 10k_prompts_ranked 数据集中随机抽样了 5,000 个提示。
  2. 移除了平均质量评分 <= 3.5 的提示,剩下 3,340 个提示。
  3. 对于每个剩余的提示,使用 Google 的 Gemini Pro 语言模型生成答案。
  4. 将生成的答案与相应的提示组合,创建了增强的数据集。

数据集结构

增强数据集中的每个示例都是一个字典,包含以下键:

  • question:来自 10k_prompts_ranked 的原始提示文本。
  • answer:针对该提示从 Gemini Pro 生成的答案文本。

预期用途

这个增强的数据集可用于以下任务:

  • 在提示-答案对上训练语言模型
  • 评估生成答案的质量
  • 分析 Gemini Pro 输出的偏差或限制
  • 其他语言任务的数据增强

局限性

  • 生成的答案来自单一语言模型(Gemini Pro),可能反映该模型的偏差。
  • 生成答案的质量尚未经过人工验证。
  • 提示仅根据平均质量评分进行筛选,其他筛选标准可能适用。

维护

目前这是一个静态数据集,没有更新计划。然而,生成答案的过程可以用不同的语言模型或来自原始 10k_prompts_ranked 数据集的提示重复进行。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作