starfishmedical/webGPT_x_dolly

Name: starfishmedical/webGPT_x_dolly
Creator: starfishmedical
Published: 2023-05-30 19:47:30
License: 暂无描述

Hugging Face2023-05-30 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/starfishmedical/webGPT_x_dolly

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-3.0 task_categories: - question-answering size_categories: - 10K<n<100K --- This dataset contains a selection of Q&A-related tasks gathered and cleaned from the webGPT_comparisons set and the databricks-dolly-15k set. Unicode escapes were explicitly removed, and wikipedia citations in the "output" were stripped through regex to hopefully help any end-product model ignore these artifacts within their input context. This data is formatted for use in the alpaca instruction format, however the instruction, input, and output columns are kept separate in the raw data to allow for other configurations. The data has been filtered so that every entry is less than our chosen truncation length of 1024 (LLaMA-style) tokens with the format: ``` """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {inputt} ### Response: {output}""" ``` <h3>webGPT</h3> This set was filtered from the webGPT_comparisons data by taking any Q&A option that was positively or neutrally-rated by humans (e.g. "score" >= 0). This might not provide the ideal answer, but this dataset was assembled specifically for extractive Q&A with less regard for how humans feel about the results. This selection comprises 23826 of the total entries in the data. <h3>Dolly</h3> The dolly data was selected primarily to focus on closed-qa tasks. For this purpose, only entries in the "closed-qa", "information_extraction", "summarization", "classification", and "creative_writing" were used. While not all of these include a context, they were judged to help flesh out the training set. This selection comprises 5362 of the total entries in the data.

提供机构：

starfishmedical

原始信息汇总

数据集概述

数据来源

webGPT_comparisons: 筛选自webGPT_comparisons数据集，包含人类评分非负的Q&A选项，共计23826条。
databricks-dolly-15k: 主要选取了“closed-qa”, “information_extraction”, “summarization”, “classification”, 和 “creative_writing”类别的数据，共计5362条。

数据处理

Unicode转义已被移除。
Wikipedia引用已通过正则表达式去除。
数据格式遵循alpaca指令格式，但原始数据中的指令、输入和输出列是分开的，以适应其他配置。
每条数据已过滤，确保长度不超过1024个LLaMA风格的令牌。

数据格式

"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:

{instruction}

Input:

{inputt}

Response:

{output}"""

许可

本数据集遵循CC-BY-SA-3.0许可。

数据规模

数据集大小介于10K到100K之间。

5,000+

优质数据集

54 个

任务类型

进入经典数据集