five

jayelm/natural-instructions

收藏
Hugging Face2023-01-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jayelm/natural-instructions
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced - expert-generated language: - en multilinguality: - monolingual size_categories: - 100M<n<1B task_categories: - other --- Preprocessed version of Super-Natural-Instructions from https://github.com/allenai/natural-instructions/tree/master/splits. The same inputs may appear with different outputs, thus to avoid duplicate inputs, you can deduplicate by the `id` or the `inputs` field. This is modified from https://huggingface.co/datasets/Muennighoff/natural-instructions with a few improvements: 1. Adds positive/negative examples, outputs, explanations for each task, to support different task definitions. 2. Adds an "eval" field which which is True for the first 100 examples of each test task (119 * 100 = 11900 examples). This field indicates whether an example is part of the abbreviated + balanced test split. See https://github.com/allenai/natural-instructions/blob/master/src/reorder_instances_for_testing.py. 3. Adds an "eval" field to the training dataset, which can be used as an in-domain evaluation set. To do so, we sample a balanced set the first 15 examples of each train split (757 * 15 = 11355 examples) and mark the "eval" field as true.
提供机构:
jayelm
原始信息汇总

数据集概述

数据集基本信息

  • 标注创建者: 众包生成、专家生成
  • 语言: 英语
  • 多语言性: 单语种
  • 数据集大小: 100M<n<1B
  • 任务类别: 其他

数据集修改与改进

  • 来源: 基于Super-Natural-Instructions的预处理版本,修改自Muennighoff/natural-instructions
  • 改进内容:
    1. 增加了每个任务的正负示例、输出和解释,以支持不同的任务定义。
    2. 在测试任务的前100个示例中添加了"eval"字段(119 * 100 = 11900个示例),用于指示示例是否属于简短且平衡的测试分割。
    3. 在训练数据集中添加了"eval"字段,可作为域内评估集使用。通过平衡采样每个训练分割的前15个示例(757 * 15 = 11355个示例)并标记"eval"字段为真。

数据集处理

  • 去重方法: 可以通过idinputs字段进行去重,以避免输入重复。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作