wisenut-nlp-team/natural-instructions
收藏Hugging Face2024-05-10 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/wisenut-nlp-team/natural-instructions
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: Answer Verification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 36921031
num_examples: 32417
download_size: 13174403
dataset_size: 36921031
- config_name: Answerability Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 67465958
num_examples: 51795
download_size: 26434426
dataset_size: 67465958
- config_name: Cause Effect Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 17411444
num_examples: 32038
download_size: 2820959
dataset_size: 17411444
- config_name: Code to Text
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 38548880
num_examples: 21328
download_size: 1326411
dataset_size: 38548880
- config_name: Coherence Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 21499865
num_examples: 30077
download_size: 4510956
dataset_size: 21499865
- config_name: Commonsense Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 110982684
num_examples: 130524
download_size: 4324986
dataset_size: 110982684
- config_name: Coreference Resolution
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 33082423
num_examples: 36990
download_size: 7467468
dataset_size: 33082423
- config_name: Data to Text
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 21602165
num_examples: 37695
download_size: 3668525
dataset_size: 21602165
- config_name: Dialogue Act Recognition
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 18637792
num_examples: 23085
download_size: 5199897
dataset_size: 18637792
- config_name: Dialogue Generation
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 50010131
num_examples: 54672
download_size: 13086755
dataset_size: 50010131
- config_name: Dialogue State Tracking
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 6744188
num_examples: 6810
download_size: 1123066
dataset_size: 6744188
- config_name: Discourse Connective Identification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 449631
num_examples: 1000
download_size: 150761
dataset_size: 449631
- config_name: Discourse Relation Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 820634
num_examples: 1000
download_size: 127572
dataset_size: 820634
- config_name: Entity Generation
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 1286682
num_examples: 3095
download_size: 75778
dataset_size: 1286682
- config_name: Entity Relation Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 2868956
num_examples: 5903
download_size: 134753
dataset_size: 2868956
- config_name: Ethics Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 31848628
num_examples: 25289
download_size: 14818576
dataset_size: 31848628
- config_name: Explanation
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 26977111
num_examples: 21352
download_size: 11447730
dataset_size: 26977111
- config_name: Fact Verification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 5230254
num_examples: 6553
download_size: 1250181
dataset_size: 5230254
- config_name: Fill in The Blank
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 112129292
num_examples: 93210
download_size: 34635798
dataset_size: 112129292
- config_name: Gender Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 7708700
num_examples: 19119
download_size: 675851
dataset_size: 7708700
- config_name: Grammar Error Correction
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 17186402
num_examples: 7239
download_size: 4559526
dataset_size: 17186402
- config_name: Grammar Error Detection
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 22339465
num_examples: 18015
download_size: 3319971
dataset_size: 22339465
- config_name: Information Extraction
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 64231233
num_examples: 91850
download_size: 14674014
dataset_size: 64231233
- config_name: Intent Identification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 18754320
num_examples: 16016
download_size: 6099376
dataset_size: 18754320
- config_name: Irony Detection
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 1867373
num_examples: 2854
download_size: 224057
dataset_size: 1867373
- config_name: Keyword Tagging
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 6278639
num_examples: 11083
download_size: 1751069
dataset_size: 6278639
- config_name: Language Identification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 21757044
num_examples: 43237
download_size: 5556218
dataset_size: 21757044
- config_name: Linguistic Probing
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 20518412
num_examples: 47482
download_size: 2502783
dataset_size: 20518412
- config_name: Mathematics
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 12388128
num_examples: 30317
download_size: 996758
dataset_size: 12388128
- config_name: Misc.
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 32043502
num_examples: 66066
download_size: 5041224
dataset_size: 32043502
- config_name: Named Entity Recognition
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 29316975
num_examples: 40001
download_size: 9514084
dataset_size: 29316975
- config_name: Negotiation Strategy Detection
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 7430813
num_examples: 7080
download_size: 1359096
dataset_size: 7430813
- config_name: Number Conversion
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 532830
num_examples: 998
download_size: 130491
dataset_size: 532830
- config_name: Overlap Extraction
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 4545110
num_examples: 7975
download_size: 924076
dataset_size: 4545110
- config_name: Paper Review
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 441145
num_examples: 157
download_size: 225924
dataset_size: 441145
- config_name: Paraphrasing
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 5812603
num_examples: 15439
download_size: 1392782
dataset_size: 5812603
- config_name: Poem Generation
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 4654770
num_examples: 6442
download_size: 1128202
dataset_size: 4654770
- config_name: Pos Tagging
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 96998743
num_examples: 62118
download_size: 6094397
dataset_size: 96998743
- config_name: Preposition Prediction
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 392546
num_examples: 926
download_size: 19026
dataset_size: 392546
- config_name: Program Execution
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 196128955
num_examples: 433157
download_size: 29857095
dataset_size: 196128955
- config_name: Punctuation Error Detection
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 42046
num_examples: 100
download_size: 12781
dataset_size: 42046
- config_name: Question Answering
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 632225857
num_examples: 470108
download_size: 250262921
dataset_size: 632225857
- config_name: Question Decomposition
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 24960993
num_examples: 9521
download_size: 1727434
dataset_size: 24960993
- config_name: Question Generation
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 404077154
num_examples: 230103
download_size: 148512423
dataset_size: 404077154
- config_name: Question Rewriting
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 50260852
num_examples: 42596
download_size: 6059105
dataset_size: 50260852
- config_name: Question Understanding
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 61204653
num_examples: 63448
download_size: 7814957
dataset_size: 61204653
- config_name: Section Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 13181798
num_examples: 11975
download_size: 1160774
dataset_size: 13181798
- config_name: Sentence Composition
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 59064347
num_examples: 72496
download_size: 13871750
dataset_size: 59064347
- config_name: Sentence Compression
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 2250408
num_examples: 4934
download_size: 1155675
dataset_size: 2250408
- config_name: Sentence Expansion
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 1035515
num_examples: 1761
download_size: 391975
dataset_size: 1035515
- config_name: Sentence Ordering
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 14605282
num_examples: 20184
download_size: 3063937
dataset_size: 14605282
- config_name: Sentence Perturbation
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 47077880
num_examples: 80789
download_size: 7584636
dataset_size: 47077880
- config_name: Sentiment Analysis
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 223188120
num_examples: 253432
download_size: 87290295
dataset_size: 223188120
- config_name: Spam Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 581151
num_examples: 1065
download_size: 83786
dataset_size: 581151
- config_name: Speaker Identification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 13932578
num_examples: 19800
download_size: 3558362
dataset_size: 13932578
- config_name: Speaker Relation Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 206205
num_examples: 153
download_size: 89490
dataset_size: 206205
- config_name: Spelling Error Detection
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 3931890
num_examples: 6499
download_size: 298131
dataset_size: 3931890
- config_name: Stance Detection
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 4187984
num_examples: 6593
download_size: 824202
dataset_size: 4187984
- config_name: Stereotype Detection
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 10079370
num_examples: 17351
download_size: 956839
dataset_size: 10079370
- config_name: Story Composition
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 59456550
num_examples: 45866
download_size: 17961987
dataset_size: 59456550
- config_name: Style Transfer
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 313982
num_examples: 985
download_size: 52693
dataset_size: 313982
- config_name: Summarization
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 207756651
num_examples: 59200
download_size: 107041304
dataset_size: 207756651
- config_name: Text Categorization
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 123784961
num_examples: 154556
download_size: 30374117
dataset_size: 123784961
- config_name: Text Completion
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 69736698
num_examples: 86145
download_size: 20698942
dataset_size: 69736698
- config_name: Text Matching
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 123645584
num_examples: 173171
download_size: 35143082
dataset_size: 123645584
- config_name: Text Quality Evaluation
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 13493797
num_examples: 23712
download_size: 2909390
dataset_size: 13493797
- config_name: Text Simplification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 8114719
num_examples: 12619
download_size: 1862246
dataset_size: 8114719
- config_name: Text to Code
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 109095054
num_examples: 49441
download_size: 3322639
dataset_size: 109095054
- config_name: Textual Entailment
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 66465227
num_examples: 92651
download_size: 11079779
dataset_size: 66465227
- config_name: Title Generation
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 114510156
num_examples: 80696
download_size: 56428683
dataset_size: 114510156
- config_name: Toxic Language Detection
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 66096426
num_examples: 115584
download_size: 14014796
dataset_size: 66096426
- config_name: Translation
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 606382471
num_examples: 1182213
download_size: 209410558
dataset_size: 606382471
- config_name: Word Analogy
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 2738270
num_examples: 6271
download_size: 82373
dataset_size: 2738270
- config_name: Word Relation Classification
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 5756454
num_examples: 8872
download_size: 161724
dataset_size: 5756454
- config_name: Word Semantics
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 5795348
num_examples: 19294
download_size: 745023
dataset_size: 5795348
- config_name: Wrong Candidate Generation
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: domain
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 81531096
num_examples: 73546
download_size: 24808989
dataset_size: 81531096
configs:
- config_name: Answer Verification
data_files:
- split: train
path: Answer Verification/train-*
- config_name: Answerability Classification
data_files:
- split: train
path: Answerability Classification/train-*
- config_name: Cause Effect Classification
data_files:
- split: train
path: Cause Effect Classification/train-*
- config_name: Code to Text
data_files:
- split: train
path: Code to Text/train-*
- config_name: Coherence Classification
data_files:
- split: train
path: Coherence Classification/train-*
- config_name: Commonsense Classification
data_files:
- split: train
path: Commonsense Classification/train-*
- config_name: Coreference Resolution
data_files:
- split: train
path: Coreference Resolution/train-*
- config_name: Data to Text
data_files:
- split: train
path: Data to Text/train-*
- config_name: Dialogue Act Recognition
data_files:
- split: train
path: Dialogue Act Recognition/train-*
- config_name: Dialogue Generation
data_files:
- split: train
path: Dialogue Generation/train-*
- config_name: Dialogue State Tracking
data_files:
- split: train
path: Dialogue State Tracking/train-*
- config_name: Discourse Connective Identification
data_files:
- split: train
path: Discourse Connective Identification/train-*
- config_name: Discourse Relation Classification
data_files:
- split: train
path: Discourse Relation Classification/train-*
- config_name: Entity Generation
data_files:
- split: train
path: Entity Generation/train-*
- config_name: Entity Relation Classification
data_files:
- split: train
path: Entity Relation Classification/train-*
- config_name: Ethics Classification
data_files:
- split: train
path: Ethics Classification/train-*
- config_name: Explanation
data_files:
- split: train
path: Explanation/train-*
- config_name: Fact Verification
data_files:
- split: train
path: Fact Verification/train-*
- config_name: Fill in The Blank
data_files:
- split: train
path: Fill in The Blank/train-*
- config_name: Gender Classification
data_files:
- split: train
path: Gender Classification/train-*
- config_name: Grammar Error Correction
data_files:
- split: train
path: Grammar Error Correction/train-*
- config_name: Grammar Error Detection
data_files:
- split: train
path: Grammar Error Detection/train-*
- config_name: Information Extraction
data_files:
- split: train
path: Information Extraction/train-*
- config_name: Intent Identification
data_files:
- split: train
path: Intent Identification/train-*
- config_name: Irony Detection
data_files:
- split: train
path: Irony Detection/train-*
- config_name: Keyword Tagging
data_files:
- split: train
path: Keyword Tagging/train-*
- config_name: Language Identification
data_files:
- split: train
path: Language Identification/train-*
- config_name: Linguistic Probing
data_files:
- split: train
path: Linguistic Probing/train-*
- config_name: Mathematics
data_files:
- split: train
path: Mathematics/train-*
- config_name: Misc.
data_files:
- split: train
path: Misc./train-*
- config_name: Named Entity Recognition
data_files:
- split: train
path: Named Entity Recognition/train-*
- config_name: Negotiation Strategy Detection
data_files:
- split: train
path: Negotiation Strategy Detection/train-*
- config_name: Number Conversion
data_files:
- split: train
path: Number Conversion/train-*
- config_name: Overlap Extraction
data_files:
- split: train
path: Overlap Extraction/train-*
- config_name: Paper Review
data_files:
- split: train
path: Paper Review/train-*
- config_name: Paraphrasing
data_files:
- split: train
path: Paraphrasing/train-*
- config_name: Poem Generation
data_files:
- split: train
path: Poem Generation/train-*
- config_name: Pos Tagging
data_files:
- split: train
path: Pos Tagging/train-*
- config_name: Preposition Prediction
data_files:
- split: train
path: Preposition Prediction/train-*
- config_name: Program Execution
data_files:
- split: train
path: Program Execution/train-*
- config_name: Punctuation Error Detection
data_files:
- split: train
path: Punctuation Error Detection/train-*
- config_name: Question Answering
data_files:
- split: train
path: Question Answering/train-*
- config_name: Question Decomposition
data_files:
- split: train
path: Question Decomposition/train-*
- config_name: Question Generation
data_files:
- split: train
path: Question Generation/train-*
- config_name: Question Rewriting
data_files:
- split: train
path: Question Rewriting/train-*
- config_name: Question Understanding
data_files:
- split: train
path: Question Understanding/train-*
- config_name: Section Classification
data_files:
- split: train
path: Section Classification/train-*
- config_name: Sentence Composition
data_files:
- split: train
path: Sentence Composition/train-*
- config_name: Sentence Compression
data_files:
- split: train
path: Sentence Compression/train-*
- config_name: Sentence Expansion
data_files:
- split: train
path: Sentence Expansion/train-*
- config_name: Sentence Ordering
data_files:
- split: train
path: Sentence Ordering/train-*
- config_name: Sentence Perturbation
data_files:
- split: train
path: Sentence Perturbation/train-*
- config_name: Sentiment Analysis
data_files:
- split: train
path: Sentiment Analysis/train-*
- config_name: Spam Classification
data_files:
- split: train
path: Spam Classification/train-*
- config_name: Speaker Identification
data_files:
- split: train
path: Speaker Identification/train-*
- config_name: Speaker Relation Classification
data_files:
- split: train
path: Speaker Relation Classification/train-*
- config_name: Spelling Error Detection
data_files:
- split: train
path: Spelling Error Detection/train-*
- config_name: Stance Detection
data_files:
- split: train
path: Stance Detection/train-*
- config_name: Stereotype Detection
data_files:
- split: train
path: Stereotype Detection/train-*
- config_name: Story Composition
data_files:
- split: train
path: Story Composition/train-*
- config_name: Style Transfer
data_files:
- split: train
path: Style Transfer/train-*
- config_name: Summarization
data_files:
- split: train
path: Summarization/train-*
- config_name: Text Categorization
data_files:
- split: train
path: Text Categorization/train-*
- config_name: Text Completion
data_files:
- split: train
path: Text Completion/train-*
- config_name: Text Matching
data_files:
- split: train
path: Text Matching/train-*
- config_name: Text Quality Evaluation
data_files:
- split: train
path: Text Quality Evaluation/train-*
- config_name: Text Simplification
data_files:
- split: train
path: Text Simplification/train-*
- config_name: Text to Code
data_files:
- split: train
path: Text to Code/train-*
- config_name: Textual Entailment
data_files:
- split: train
path: Textual Entailment/train-*
- config_name: Title Generation
data_files:
- split: train
path: Title Generation/train-*
- config_name: Toxic Language Detection
data_files:
- split: train
path: Toxic Language Detection/train-*
- config_name: Translation
data_files:
- split: train
path: Translation/train-*
- config_name: Word Analogy
data_files:
- split: train
path: Word Analogy/train-*
- config_name: Word Relation Classification
data_files:
- split: train
path: Word Relation Classification/train-*
- config_name: Word Semantics
data_files:
- split: train
path: Word Semantics/train-*
- config_name: Wrong Candidate Generation
data_files:
- split: train
path: Wrong Candidate Generation/train-*
---
提供机构:
wisenut-nlp-team
原始信息汇总
数据集概述
本数据集包含多个子数据集,每个子数据集专注于不同的自然语言处理任务。以下是各子数据集的详细信息:
1. Answer Verification
- 特征: instruction, input, output, domain, lang
- 训练集: 32417个样本,总大小36921031字节
- 下载大小: 13174403字节
2. Answerability Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 51795个样本,总大小67465958字节
- 下载大小: 26434426字节
3. Cause Effect Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 32038个样本,总大小17411444字节
- 下载大小: 2820959字节
4. Code to Text
- 特征: instruction, input, output, domain, lang
- 训练集: 21328个样本,总大小38548880字节
- 下载大小: 1326411字节
5. Coherence Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 30077个样本,总大小21499865字节
- 下载大小: 4510956字节
6. Commonsense Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 130524个样本,总大小110982684字节
- 下载大小: 4324986字节
7. Coreference Resolution
- 特征: instruction, input, output, domain, lang
- 训练集: 36990个样本,总大小33082423字节
- 下载大小: 7467468字节
8. Data to Text
- 特征: instruction, input, output, domain, lang
- 训练集: 37695个样本,总大小21602165字节
- 下载大小: 3668525字节
9. Dialogue Act Recognition
- 特征: instruction, input, output, domain, lang
- 训练集: 23085个样本,总大小18637792字节
- 下载大小: 5199897字节
10. Dialogue Generation
- 特征: instruction, input, output, domain, lang
- 训练集: 54672个样本,总大小50010131字节
- 下载大小: 13086755字节
11. Dialogue State Tracking
- 特征: instruction, input, output, domain, lang
- 训练集: 6810个样本,总大小6744188字节
- 下载大小: 1123066字节
12. Discourse Connective Identification
- 特征: instruction, input, output, domain, lang
- 训练集: 1000个样本,总大小449631字节
- 下载大小: 150761字节
13. Discourse Relation Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 1000个样本,总大小820634字节
- 下载大小: 127572字节
14. Entity Generation
- 特征: instruction, input, output, domain, lang
- 训练集: 3095个样本,总大小1286682字节
- 下载大小: 75778字节
15. Entity Relation Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 5903个样本,总大小2868956字节
- 下载大小: 134753字节
16. Ethics Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 25289个样本,总大小31848628字节
- 下载大小: 14818576字节
17. Explanation
- 特征: instruction, input, output, domain, lang
- 训练集: 21352个样本,总大小26977111字节
- 下载大小: 11447730字节
18. Fact Verification
- 特征: instruction, input, output, domain, lang
- 训练集: 6553个样本,总大小5230254字节
- 下载大小: 1250181字节
19. Fill in The Blank
- 特征: instruction, input, output, domain, lang
- 训练集: 93210个样本,总大小112129292字节
- 下载大小: 34635798字节
20. Gender Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 19119个样本,总大小7708700字节
- 下载大小: 675851字节
21. Grammar Error Correction
- 特征: instruction, input, output, domain, lang
- 训练集: 7239个样本,总大小17186402字节
- 下载大小: 4559526字节
22. Grammar Error Detection
- 特征: instruction, input, output, domain, lang
- 训练集: 18015个样本,总大小22339465字节
- 下载大小: 3319971字节
23. Information Extraction
- 特征: instruction, input, output, domain, lang
- 训练集: 91850个样本,总大小64231233字节
- 下载大小: 14674014字节
24. Intent Identification
- 特征: instruction, input, output, domain, lang
- 训练集: 16016个样本,总大小18754320字节
- 下载大小: 6099376字节
25. Irony Detection
- 特征: instruction, input, output, domain, lang
- 训练集: 2854个样本,总大小1867373字节
- 下载大小: 224057字节
26. Keyword Tagging
- 特征: instruction, input, output, domain, lang
- 训练集: 11083个样本,总大小6278639字节
- 下载大小: 1751069字节
27. Language Identification
- 特征: instruction, input, output, domain, lang
- 训练集: 43237个样本,总大小21757044字节
- 下载大小: 5556218字节
28. Linguistic Probing
- 特征: instruction, input, output, domain, lang
- 训练集: 47482个样本,总大小20518412字节
- 下载大小: 2502783字节
29. Mathematics
- 特征: instruction, input, output, domain, lang
- 训练集: 30317个样本,总大小12388128字节
- 下载大小: 996758字节
30. Misc.
- 特征: instruction, input, output, domain, lang
- 训练集: 66066个样本,总大小32043502字节
- 下载大小: 5041224字节
31. Named Entity Recognition
- 特征: instruction, input, output, domain, lang
- 训练集: 40001个样本,总大小29316975字节
- 下载大小: 9514084字节
32. Negotiation Strategy Detection
- 特征: instruction, input, output, domain, lang
- 训练集: 7080个样本,总大小7430813字节
- 下载大小: 1359096字节
33. Number Conversion
- 特征: instruction, input, output, domain, lang
- 训练集: 998个样本,总大小532830字节
- 下载大小: 130491字节
34. Overlap Extraction
- 特征: instruction, input, output, domain, lang
- 训练集: 7975个样本,总大小4545110字节
- 下载大小: 924076字节
35. Paper Review
- 特征: instruction, input, output, domain, lang
- 训练集: 157个样本,总大小441145字节
- 下载大小: 225924字节
36. Paraphrasing
- 特征: instruction, input, output, domain, lang
- 训练集: 15439个样本,总大小5812603字节
- 下载大小: 1392782字节
37. Poem Generation
- 特征: instruction, input, output, domain, lang
- 训练集: 6442个样本,总大小4654770字节
- 下载大小: 1128202字节
38. Pos Tagging
- 特征: instruction, input, output, domain, lang
- 训练集: 62118个样本,总大小96998743字节
- 下载大小: 6094397字节
39. Preposition Prediction
- 特征: instruction, input, output, domain, lang
- 训练集: 926个样本,总大小392546字节
- 下载大小: 19026字节
40. Program Execution
- 特征: instruction, input, output, domain, lang
- 训练集: 433157个样本,总大小196128955字节
- 下载大小: 29857095字节
41. Punctuation Error Detection
- 特征: instruction, input, output, domain, lang
- 训练集: 100个样本,总大小42046字节
- 下载大小: 12781字节
42. Question Answering
- 特征: instruction, input, output, domain, lang
- 训练集: 470108个样本,总大小632225857字节
- 下载大小: 250262921字节
43. Question Decomposition
- 特征: instruction, input, output, domain, lang
- 训练集: 9521个样本,总大小24960993字节
- 下载大小: 1727434字节
44. Question Generation
- 特征: instruction, input, output, domain, lang
- 训练集: 230103个样本,总大小404077154字节
- 下载大小: 148512423字节
45. Question Rewriting
- 特征: instruction, input, output, domain, lang
- 训练集: 42596个样本,总大小50260852字节
- 下载大小: 6059105字节
46. Question Understanding
- 特征: instruction, input, output, domain, lang
- 训练集: 63448个样本,总大小61204653字节
- 下载大小: 7814957字节
47. Section Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 11975个样本,总大小13181798字节
- 下载大小: 1160774字节
48. Sentence Composition
- 特征: instruction, input, output, domain, lang
- 训练集: 72496个样本,总大小59064347字节
- 下载大小: 13871750字节
49. Sentence Compression
- 特征: instruction, input, output, domain, lang
- 训练集: 4934个样本,总大小2250408字节
- 下载大小: 1155675字节
50. Sentence Expansion
- 特征: instruction, input, output, domain, lang
- 训练集: 1761个样本,总大小1035515字节
- 下载大小: 391975字节
51. Sentence Ordering
- 特征: instruction, input, output, domain, lang
- 训练集: 20184个样本,总大小14605282字节
- 下载大小: 3063937字节
52. Sentence Perturbation
- 特征: instruction, input, output, domain, lang
- 训练集: 80789个样本,总大小47077880字节
- 下载大小: 7584636字节
53. Sentiment Analysis
- 特征: instruction, input, output, domain, lang
- 训练集: 253432个样本,总大小223188120字节
- 下载大小: 87290295字节
54. Spam Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 1065个样本,总大小581151字节
- 下载大小: 83786字节
55. Speaker Identification
- 特征: instruction, input, output, domain, lang
- 训练集: 19800个样本,总大小13932578字节
- 下载大小: 3558362字节
56. Speaker Relation Classification
- 特征: instruction, input, output, domain, lang
- 训练集: 153个样本,总大小206205字节
- 下载大小: 89490字节
57. Spelling Error Detection
- 特征: instruction, input, output, domain, lang
- 训练集: 6499个样本,总大小3931890字节
- 下载大小: 298131字节
58. Stance Detection
- 特征: instruction, input, output, domain, lang
- 训练集: 2854个样本,总大小1867373字节
- 下载大小: 224057字节
每个子数据集都提供了详细的特征描述和数据集大小信息,便于用户根据需求选择合适的数据集进行研究和应用。
搜集汇总
数据集介绍

构建方式
在自然语言处理领域,构建一个能够指导模型执行多样化任务的指令数据集至关重要。Natural Instructions数据集通过整合超过60个不同的自然语言处理任务,构建了一个统一的指令-输入-输出三元组结构。每个任务配置均包含明确的指令描述、输入文本和期望的输出结果,并标注了领域和语言信息。这种构建方式源于对现有任务数据集的系统化收集与标准化转换,旨在为模型提供跨任务的通用指令遵循能力。数据集的构建过程注重任务覆盖的广度与数据格式的一致性,为指令微调研究奠定了坚实基础。
使用方法
在模型指令遵循能力的研究中,该数据集可作为多任务指令微调的核心资源。研究人员可通过加载特定任务配置(如Answer Verification或Question Answering)来获取对应的训练数据,每条数据包含明确的指令、输入和期望输出。使用时可针对单一任务进行模型微调,亦可混合多个任务数据以训练通用指令遵循模型。数据集的标准化格式便于直接整合到现有训练流程中,通过指令-输入-输出的映射关系,引导模型学习根据自然语言指令执行相应任务。这种使用方法有助于探索模型的任务泛化能力和零样本学习性能。
背景与挑战
背景概述
在自然语言处理领域,通用指令遵循模型的开发长期受限于任务特定数据集的碎片化与异构性。为应对这一挑战,WiseNut NLP团队于2022年推出了Natural Instructions数据集,该数据集整合了涵盖问答、文本分类、生成、推理等超过60种自然语言任务的多样化指令-输出对。其核心研究目标在于构建一个统一且规模宏大的基准,以促进模型在零样本和少样本场景下的泛化能力与任务适应性。这一资源显著推动了指令调优与元学习的研究进程,为探索模型如何理解并执行人类自然语言指令提供了关键数据支撑。
当前挑战
该数据集旨在解决的领域挑战,在于如何让单一模型通过自然语言指令泛化至大量未见过的复杂任务,这要求模型具备深层的语义理解与推理能力。在构建过程中,面临的挑战尤为突出:首先,需要从海量异构的现有数据集中提取并标准化任务描述与实例,确保指令的清晰性与一致性;其次,跨任务、领域与语言的广泛覆盖,使得数据质量把控与标注规范统一变得异常艰巨;最后,维持任务间的平衡性与代表性,避免数据偏差影响模型评估的公正性,亦是构建者需克服的关键难题。
常用场景
经典使用场景
在自然语言处理领域,指令微调已成为提升模型泛化能力的关键范式。Natural Instructions数据集以其涵盖的六十余种任务类型,为研究者提供了统一的指令-输入-输出格式,经典使用场景在于评估和训练模型遵循自然语言指令执行多样化任务的能力。该数据集通过整合问答、文本生成、分类等任务,构建了多任务学习的基准环境,使得模型能够在单一框架下处理跨领域的语言理解与生成挑战,推动了指令跟随模型的系统性评测与发展。
解决学术问题
面对传统NLP模型在任务泛化和零样本学习上的局限,Natural Instructions数据集通过大规模、多任务的指令集合,有效解决了模型对新任务适应性的学术难题。该数据集的意义在于为指令微调研究提供了标准化评估基准,促进了模型在未见任务上的泛化性能分析,从而深化了对模型迁移学习机制的理解。其影响体现在推动了如T0、FLAN等经典工作的诞生,为构建通用型语言智能奠定了数据基础。
实际应用
在实际应用层面,Natural Instructions数据集支撑了智能助手、自动化文本处理等系统的开发。基于该数据集训练的模型能够理解用户以自然语言下达的多样化指令,例如进行情感分析、信息抽取或对话生成,从而嵌入客服系统、内容审核工具等实际场景。这种能力降低了技术部署门槛,使得非专业用户也能通过自然交互完成复杂任务,提升了人机协作的效率和自然度。
数据集最近研究
最新研究方向
在自然语言处理领域,指令微调已成为提升模型泛化能力的关键路径。Natural Instructions数据集以其涵盖问答、文本生成、分类等多样化任务的特性,为研究指令遵循与零样本学习提供了丰富资源。当前前沿探索聚焦于如何利用此类大规模指令数据优化大语言模型的跨任务适应性,尤其在少样本场景下增强模型对未见任务的推理能力。相关研究正推动模型从单一任务专家向通用任务执行者演进,这一趋势在构建更智能、更灵活的对话系统与自动化工具中展现出深远影响。
以上内容由遇见数据集搜集并总结生成



