lucasmccabe-lmi/oig_small_chip2_noncode
收藏Hugging Face2023-05-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lucasmccabe-lmi/oig_small_chip2_noncode
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: output
dtype: string
- name: input
dtype: string
splits:
- name: train
num_bytes: 54969398.0
num_examples: 150453
download_size: 34598598
dataset_size: 54969398.0
---
# Dataset Card for "oig_small_chip2_noncode"
From LAION's Open Instruction Generalist (OIG) dataset, we provide a subset whose samples are not code-related. OIG text elements are formatted as dialogue exerpts between a "human" and "bot" agent. The code generation prompt is parsed from the initial "human" agent's statement and the resultant response from the "bot" agent's statement. We then reformat the text/response pairs according to the format of the original Alpaca dataset; that is, instruction/input/output triplets.
The OIG dataset was prepared by LAION, and released under the Apache 2.0 license.
Numbers:
Prompts: 150453
Tokens: 11522004 using the EleutherAI/gpt-neox-20b tokenizer (counting instruction+input+output)
提供机构:
lucasmccabe-lmi
原始信息汇总
数据集概述
数据集名称
"oig_small_chip2_noncode"
数据集来源
源自LAION的Open Instruction Generalist (OIG)数据集的一个子集,该子集不包含与代码相关的样本。
数据集内容
数据集包含对话片段,格式为“人类”和“机器人”代理之间的对话。数据集中的文本/响应对被重新格式化为指令/输入/输出三元组,与原始Alpaca数据集的格式一致。
数据集特征
- instruction: 数据类型为字符串
- output: 数据类型为字符串
- input: 数据类型为字符串
数据集分割
- train: 包含150453个样本,总字节数为54969398.0
数据集大小
- 下载大小: 34598598字节
- 数据集大小: 54969398.0字节
数据集许可证
Apache 2.0许可证
数据集统计
- 提示数量: 150453
- 令牌数量: 11522004(使用EleutherAI/gpt-neox-20b分词器计数指令+输入+输出)



