five

wisenut-nlp-team/natural-instructions

收藏
Hugging Face2024-05-10 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/wisenut-nlp-team/natural-instructions
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: Answer Verification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 36921031 num_examples: 32417 download_size: 13174403 dataset_size: 36921031 - config_name: Answerability Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 67465958 num_examples: 51795 download_size: 26434426 dataset_size: 67465958 - config_name: Cause Effect Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 17411444 num_examples: 32038 download_size: 2820959 dataset_size: 17411444 - config_name: Code to Text features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 38548880 num_examples: 21328 download_size: 1326411 dataset_size: 38548880 - config_name: Coherence Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 21499865 num_examples: 30077 download_size: 4510956 dataset_size: 21499865 - config_name: Commonsense Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 110982684 num_examples: 130524 download_size: 4324986 dataset_size: 110982684 - config_name: Coreference Resolution features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 33082423 num_examples: 36990 download_size: 7467468 dataset_size: 33082423 - config_name: Data to Text features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 21602165 num_examples: 37695 download_size: 3668525 dataset_size: 21602165 - config_name: Dialogue Act Recognition features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 18637792 num_examples: 23085 download_size: 5199897 dataset_size: 18637792 - config_name: Dialogue Generation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 50010131 num_examples: 54672 download_size: 13086755 dataset_size: 50010131 - config_name: Dialogue State Tracking features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 6744188 num_examples: 6810 download_size: 1123066 dataset_size: 6744188 - config_name: Discourse Connective Identification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 449631 num_examples: 1000 download_size: 150761 dataset_size: 449631 - config_name: Discourse Relation Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 820634 num_examples: 1000 download_size: 127572 dataset_size: 820634 - config_name: Entity Generation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 1286682 num_examples: 3095 download_size: 75778 dataset_size: 1286682 - config_name: Entity Relation Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 2868956 num_examples: 5903 download_size: 134753 dataset_size: 2868956 - config_name: Ethics Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 31848628 num_examples: 25289 download_size: 14818576 dataset_size: 31848628 - config_name: Explanation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 26977111 num_examples: 21352 download_size: 11447730 dataset_size: 26977111 - config_name: Fact Verification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 5230254 num_examples: 6553 download_size: 1250181 dataset_size: 5230254 - config_name: Fill in The Blank features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 112129292 num_examples: 93210 download_size: 34635798 dataset_size: 112129292 - config_name: Gender Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 7708700 num_examples: 19119 download_size: 675851 dataset_size: 7708700 - config_name: Grammar Error Correction features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 17186402 num_examples: 7239 download_size: 4559526 dataset_size: 17186402 - config_name: Grammar Error Detection features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 22339465 num_examples: 18015 download_size: 3319971 dataset_size: 22339465 - config_name: Information Extraction features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 64231233 num_examples: 91850 download_size: 14674014 dataset_size: 64231233 - config_name: Intent Identification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 18754320 num_examples: 16016 download_size: 6099376 dataset_size: 18754320 - config_name: Irony Detection features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 1867373 num_examples: 2854 download_size: 224057 dataset_size: 1867373 - config_name: Keyword Tagging features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 6278639 num_examples: 11083 download_size: 1751069 dataset_size: 6278639 - config_name: Language Identification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 21757044 num_examples: 43237 download_size: 5556218 dataset_size: 21757044 - config_name: Linguistic Probing features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 20518412 num_examples: 47482 download_size: 2502783 dataset_size: 20518412 - config_name: Mathematics features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 12388128 num_examples: 30317 download_size: 996758 dataset_size: 12388128 - config_name: Misc. features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 32043502 num_examples: 66066 download_size: 5041224 dataset_size: 32043502 - config_name: Named Entity Recognition features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 29316975 num_examples: 40001 download_size: 9514084 dataset_size: 29316975 - config_name: Negotiation Strategy Detection features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 7430813 num_examples: 7080 download_size: 1359096 dataset_size: 7430813 - config_name: Number Conversion features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 532830 num_examples: 998 download_size: 130491 dataset_size: 532830 - config_name: Overlap Extraction features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 4545110 num_examples: 7975 download_size: 924076 dataset_size: 4545110 - config_name: Paper Review features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 441145 num_examples: 157 download_size: 225924 dataset_size: 441145 - config_name: Paraphrasing features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 5812603 num_examples: 15439 download_size: 1392782 dataset_size: 5812603 - config_name: Poem Generation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 4654770 num_examples: 6442 download_size: 1128202 dataset_size: 4654770 - config_name: Pos Tagging features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 96998743 num_examples: 62118 download_size: 6094397 dataset_size: 96998743 - config_name: Preposition Prediction features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 392546 num_examples: 926 download_size: 19026 dataset_size: 392546 - config_name: Program Execution features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 196128955 num_examples: 433157 download_size: 29857095 dataset_size: 196128955 - config_name: Punctuation Error Detection features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 42046 num_examples: 100 download_size: 12781 dataset_size: 42046 - config_name: Question Answering features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 632225857 num_examples: 470108 download_size: 250262921 dataset_size: 632225857 - config_name: Question Decomposition features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 24960993 num_examples: 9521 download_size: 1727434 dataset_size: 24960993 - config_name: Question Generation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 404077154 num_examples: 230103 download_size: 148512423 dataset_size: 404077154 - config_name: Question Rewriting features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 50260852 num_examples: 42596 download_size: 6059105 dataset_size: 50260852 - config_name: Question Understanding features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 61204653 num_examples: 63448 download_size: 7814957 dataset_size: 61204653 - config_name: Section Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 13181798 num_examples: 11975 download_size: 1160774 dataset_size: 13181798 - config_name: Sentence Composition features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 59064347 num_examples: 72496 download_size: 13871750 dataset_size: 59064347 - config_name: Sentence Compression features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 2250408 num_examples: 4934 download_size: 1155675 dataset_size: 2250408 - config_name: Sentence Expansion features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 1035515 num_examples: 1761 download_size: 391975 dataset_size: 1035515 - config_name: Sentence Ordering features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 14605282 num_examples: 20184 download_size: 3063937 dataset_size: 14605282 - config_name: Sentence Perturbation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 47077880 num_examples: 80789 download_size: 7584636 dataset_size: 47077880 - config_name: Sentiment Analysis features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 223188120 num_examples: 253432 download_size: 87290295 dataset_size: 223188120 - config_name: Spam Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 581151 num_examples: 1065 download_size: 83786 dataset_size: 581151 - config_name: Speaker Identification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 13932578 num_examples: 19800 download_size: 3558362 dataset_size: 13932578 - config_name: Speaker Relation Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 206205 num_examples: 153 download_size: 89490 dataset_size: 206205 - config_name: Spelling Error Detection features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 3931890 num_examples: 6499 download_size: 298131 dataset_size: 3931890 - config_name: Stance Detection features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 4187984 num_examples: 6593 download_size: 824202 dataset_size: 4187984 - config_name: Stereotype Detection features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 10079370 num_examples: 17351 download_size: 956839 dataset_size: 10079370 - config_name: Story Composition features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 59456550 num_examples: 45866 download_size: 17961987 dataset_size: 59456550 - config_name: Style Transfer features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 313982 num_examples: 985 download_size: 52693 dataset_size: 313982 - config_name: Summarization features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 207756651 num_examples: 59200 download_size: 107041304 dataset_size: 207756651 - config_name: Text Categorization features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 123784961 num_examples: 154556 download_size: 30374117 dataset_size: 123784961 - config_name: Text Completion features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 69736698 num_examples: 86145 download_size: 20698942 dataset_size: 69736698 - config_name: Text Matching features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 123645584 num_examples: 173171 download_size: 35143082 dataset_size: 123645584 - config_name: Text Quality Evaluation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 13493797 num_examples: 23712 download_size: 2909390 dataset_size: 13493797 - config_name: Text Simplification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 8114719 num_examples: 12619 download_size: 1862246 dataset_size: 8114719 - config_name: Text to Code features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 109095054 num_examples: 49441 download_size: 3322639 dataset_size: 109095054 - config_name: Textual Entailment features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 66465227 num_examples: 92651 download_size: 11079779 dataset_size: 66465227 - config_name: Title Generation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 114510156 num_examples: 80696 download_size: 56428683 dataset_size: 114510156 - config_name: Toxic Language Detection features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 66096426 num_examples: 115584 download_size: 14014796 dataset_size: 66096426 - config_name: Translation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 606382471 num_examples: 1182213 download_size: 209410558 dataset_size: 606382471 - config_name: Word Analogy features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 2738270 num_examples: 6271 download_size: 82373 dataset_size: 2738270 - config_name: Word Relation Classification features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 5756454 num_examples: 8872 download_size: 161724 dataset_size: 5756454 - config_name: Word Semantics features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 5795348 num_examples: 19294 download_size: 745023 dataset_size: 5795348 - config_name: Wrong Candidate Generation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: domain dtype: string - name: lang dtype: string splits: - name: train num_bytes: 81531096 num_examples: 73546 download_size: 24808989 dataset_size: 81531096 configs: - config_name: Answer Verification data_files: - split: train path: Answer Verification/train-* - config_name: Answerability Classification data_files: - split: train path: Answerability Classification/train-* - config_name: Cause Effect Classification data_files: - split: train path: Cause Effect Classification/train-* - config_name: Code to Text data_files: - split: train path: Code to Text/train-* - config_name: Coherence Classification data_files: - split: train path: Coherence Classification/train-* - config_name: Commonsense Classification data_files: - split: train path: Commonsense Classification/train-* - config_name: Coreference Resolution data_files: - split: train path: Coreference Resolution/train-* - config_name: Data to Text data_files: - split: train path: Data to Text/train-* - config_name: Dialogue Act Recognition data_files: - split: train path: Dialogue Act Recognition/train-* - config_name: Dialogue Generation data_files: - split: train path: Dialogue Generation/train-* - config_name: Dialogue State Tracking data_files: - split: train path: Dialogue State Tracking/train-* - config_name: Discourse Connective Identification data_files: - split: train path: Discourse Connective Identification/train-* - config_name: Discourse Relation Classification data_files: - split: train path: Discourse Relation Classification/train-* - config_name: Entity Generation data_files: - split: train path: Entity Generation/train-* - config_name: Entity Relation Classification data_files: - split: train path: Entity Relation Classification/train-* - config_name: Ethics Classification data_files: - split: train path: Ethics Classification/train-* - config_name: Explanation data_files: - split: train path: Explanation/train-* - config_name: Fact Verification data_files: - split: train path: Fact Verification/train-* - config_name: Fill in The Blank data_files: - split: train path: Fill in The Blank/train-* - config_name: Gender Classification data_files: - split: train path: Gender Classification/train-* - config_name: Grammar Error Correction data_files: - split: train path: Grammar Error Correction/train-* - config_name: Grammar Error Detection data_files: - split: train path: Grammar Error Detection/train-* - config_name: Information Extraction data_files: - split: train path: Information Extraction/train-* - config_name: Intent Identification data_files: - split: train path: Intent Identification/train-* - config_name: Irony Detection data_files: - split: train path: Irony Detection/train-* - config_name: Keyword Tagging data_files: - split: train path: Keyword Tagging/train-* - config_name: Language Identification data_files: - split: train path: Language Identification/train-* - config_name: Linguistic Probing data_files: - split: train path: Linguistic Probing/train-* - config_name: Mathematics data_files: - split: train path: Mathematics/train-* - config_name: Misc. data_files: - split: train path: Misc./train-* - config_name: Named Entity Recognition data_files: - split: train path: Named Entity Recognition/train-* - config_name: Negotiation Strategy Detection data_files: - split: train path: Negotiation Strategy Detection/train-* - config_name: Number Conversion data_files: - split: train path: Number Conversion/train-* - config_name: Overlap Extraction data_files: - split: train path: Overlap Extraction/train-* - config_name: Paper Review data_files: - split: train path: Paper Review/train-* - config_name: Paraphrasing data_files: - split: train path: Paraphrasing/train-* - config_name: Poem Generation data_files: - split: train path: Poem Generation/train-* - config_name: Pos Tagging data_files: - split: train path: Pos Tagging/train-* - config_name: Preposition Prediction data_files: - split: train path: Preposition Prediction/train-* - config_name: Program Execution data_files: - split: train path: Program Execution/train-* - config_name: Punctuation Error Detection data_files: - split: train path: Punctuation Error Detection/train-* - config_name: Question Answering data_files: - split: train path: Question Answering/train-* - config_name: Question Decomposition data_files: - split: train path: Question Decomposition/train-* - config_name: Question Generation data_files: - split: train path: Question Generation/train-* - config_name: Question Rewriting data_files: - split: train path: Question Rewriting/train-* - config_name: Question Understanding data_files: - split: train path: Question Understanding/train-* - config_name: Section Classification data_files: - split: train path: Section Classification/train-* - config_name: Sentence Composition data_files: - split: train path: Sentence Composition/train-* - config_name: Sentence Compression data_files: - split: train path: Sentence Compression/train-* - config_name: Sentence Expansion data_files: - split: train path: Sentence Expansion/train-* - config_name: Sentence Ordering data_files: - split: train path: Sentence Ordering/train-* - config_name: Sentence Perturbation data_files: - split: train path: Sentence Perturbation/train-* - config_name: Sentiment Analysis data_files: - split: train path: Sentiment Analysis/train-* - config_name: Spam Classification data_files: - split: train path: Spam Classification/train-* - config_name: Speaker Identification data_files: - split: train path: Speaker Identification/train-* - config_name: Speaker Relation Classification data_files: - split: train path: Speaker Relation Classification/train-* - config_name: Spelling Error Detection data_files: - split: train path: Spelling Error Detection/train-* - config_name: Stance Detection data_files: - split: train path: Stance Detection/train-* - config_name: Stereotype Detection data_files: - split: train path: Stereotype Detection/train-* - config_name: Story Composition data_files: - split: train path: Story Composition/train-* - config_name: Style Transfer data_files: - split: train path: Style Transfer/train-* - config_name: Summarization data_files: - split: train path: Summarization/train-* - config_name: Text Categorization data_files: - split: train path: Text Categorization/train-* - config_name: Text Completion data_files: - split: train path: Text Completion/train-* - config_name: Text Matching data_files: - split: train path: Text Matching/train-* - config_name: Text Quality Evaluation data_files: - split: train path: Text Quality Evaluation/train-* - config_name: Text Simplification data_files: - split: train path: Text Simplification/train-* - config_name: Text to Code data_files: - split: train path: Text to Code/train-* - config_name: Textual Entailment data_files: - split: train path: Textual Entailment/train-* - config_name: Title Generation data_files: - split: train path: Title Generation/train-* - config_name: Toxic Language Detection data_files: - split: train path: Toxic Language Detection/train-* - config_name: Translation data_files: - split: train path: Translation/train-* - config_name: Word Analogy data_files: - split: train path: Word Analogy/train-* - config_name: Word Relation Classification data_files: - split: train path: Word Relation Classification/train-* - config_name: Word Semantics data_files: - split: train path: Word Semantics/train-* - config_name: Wrong Candidate Generation data_files: - split: train path: Wrong Candidate Generation/train-* ---
提供机构:
wisenut-nlp-team
原始信息汇总

数据集概述

本数据集包含多个子数据集,每个子数据集专注于不同的自然语言处理任务。以下是各子数据集的详细信息:

1. Answer Verification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 32417个样本,总大小36921031字节
  • 下载大小: 13174403字节

2. Answerability Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 51795个样本,总大小67465958字节
  • 下载大小: 26434426字节

3. Cause Effect Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 32038个样本,总大小17411444字节
  • 下载大小: 2820959字节

4. Code to Text

  • 特征: instruction, input, output, domain, lang
  • 训练集: 21328个样本,总大小38548880字节
  • 下载大小: 1326411字节

5. Coherence Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 30077个样本,总大小21499865字节
  • 下载大小: 4510956字节

6. Commonsense Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 130524个样本,总大小110982684字节
  • 下载大小: 4324986字节

7. Coreference Resolution

  • 特征: instruction, input, output, domain, lang
  • 训练集: 36990个样本,总大小33082423字节
  • 下载大小: 7467468字节

8. Data to Text

  • 特征: instruction, input, output, domain, lang
  • 训练集: 37695个样本,总大小21602165字节
  • 下载大小: 3668525字节

9. Dialogue Act Recognition

  • 特征: instruction, input, output, domain, lang
  • 训练集: 23085个样本,总大小18637792字节
  • 下载大小: 5199897字节

10. Dialogue Generation

  • 特征: instruction, input, output, domain, lang
  • 训练集: 54672个样本,总大小50010131字节
  • 下载大小: 13086755字节

11. Dialogue State Tracking

  • 特征: instruction, input, output, domain, lang
  • 训练集: 6810个样本,总大小6744188字节
  • 下载大小: 1123066字节

12. Discourse Connective Identification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 1000个样本,总大小449631字节
  • 下载大小: 150761字节

13. Discourse Relation Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 1000个样本,总大小820634字节
  • 下载大小: 127572字节

14. Entity Generation

  • 特征: instruction, input, output, domain, lang
  • 训练集: 3095个样本,总大小1286682字节
  • 下载大小: 75778字节

15. Entity Relation Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 5903个样本,总大小2868956字节
  • 下载大小: 134753字节

16. Ethics Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 25289个样本,总大小31848628字节
  • 下载大小: 14818576字节

17. Explanation

  • 特征: instruction, input, output, domain, lang
  • 训练集: 21352个样本,总大小26977111字节
  • 下载大小: 11447730字节

18. Fact Verification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 6553个样本,总大小5230254字节
  • 下载大小: 1250181字节

19. Fill in The Blank

  • 特征: instruction, input, output, domain, lang
  • 训练集: 93210个样本,总大小112129292字节
  • 下载大小: 34635798字节

20. Gender Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 19119个样本,总大小7708700字节
  • 下载大小: 675851字节

21. Grammar Error Correction

  • 特征: instruction, input, output, domain, lang
  • 训练集: 7239个样本,总大小17186402字节
  • 下载大小: 4559526字节

22. Grammar Error Detection

  • 特征: instruction, input, output, domain, lang
  • 训练集: 18015个样本,总大小22339465字节
  • 下载大小: 3319971字节

23. Information Extraction

  • 特征: instruction, input, output, domain, lang
  • 训练集: 91850个样本,总大小64231233字节
  • 下载大小: 14674014字节

24. Intent Identification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 16016个样本,总大小18754320字节
  • 下载大小: 6099376字节

25. Irony Detection

  • 特征: instruction, input, output, domain, lang
  • 训练集: 2854个样本,总大小1867373字节
  • 下载大小: 224057字节

26. Keyword Tagging

  • 特征: instruction, input, output, domain, lang
  • 训练集: 11083个样本,总大小6278639字节
  • 下载大小: 1751069字节

27. Language Identification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 43237个样本,总大小21757044字节
  • 下载大小: 5556218字节

28. Linguistic Probing

  • 特征: instruction, input, output, domain, lang
  • 训练集: 47482个样本,总大小20518412字节
  • 下载大小: 2502783字节

29. Mathematics

  • 特征: instruction, input, output, domain, lang
  • 训练集: 30317个样本,总大小12388128字节
  • 下载大小: 996758字节

30. Misc.

  • 特征: instruction, input, output, domain, lang
  • 训练集: 66066个样本,总大小32043502字节
  • 下载大小: 5041224字节

31. Named Entity Recognition

  • 特征: instruction, input, output, domain, lang
  • 训练集: 40001个样本,总大小29316975字节
  • 下载大小: 9514084字节

32. Negotiation Strategy Detection

  • 特征: instruction, input, output, domain, lang
  • 训练集: 7080个样本,总大小7430813字节
  • 下载大小: 1359096字节

33. Number Conversion

  • 特征: instruction, input, output, domain, lang
  • 训练集: 998个样本,总大小532830字节
  • 下载大小: 130491字节

34. Overlap Extraction

  • 特征: instruction, input, output, domain, lang
  • 训练集: 7975个样本,总大小4545110字节
  • 下载大小: 924076字节

35. Paper Review

  • 特征: instruction, input, output, domain, lang
  • 训练集: 157个样本,总大小441145字节
  • 下载大小: 225924字节

36. Paraphrasing

  • 特征: instruction, input, output, domain, lang
  • 训练集: 15439个样本,总大小5812603字节
  • 下载大小: 1392782字节

37. Poem Generation

  • 特征: instruction, input, output, domain, lang
  • 训练集: 6442个样本,总大小4654770字节
  • 下载大小: 1128202字节

38. Pos Tagging

  • 特征: instruction, input, output, domain, lang
  • 训练集: 62118个样本,总大小96998743字节
  • 下载大小: 6094397字节

39. Preposition Prediction

  • 特征: instruction, input, output, domain, lang
  • 训练集: 926个样本,总大小392546字节
  • 下载大小: 19026字节

40. Program Execution

  • 特征: instruction, input, output, domain, lang
  • 训练集: 433157个样本,总大小196128955字节
  • 下载大小: 29857095字节

41. Punctuation Error Detection

  • 特征: instruction, input, output, domain, lang
  • 训练集: 100个样本,总大小42046字节
  • 下载大小: 12781字节

42. Question Answering

  • 特征: instruction, input, output, domain, lang
  • 训练集: 470108个样本,总大小632225857字节
  • 下载大小: 250262921字节

43. Question Decomposition

  • 特征: instruction, input, output, domain, lang
  • 训练集: 9521个样本,总大小24960993字节
  • 下载大小: 1727434字节

44. Question Generation

  • 特征: instruction, input, output, domain, lang
  • 训练集: 230103个样本,总大小404077154字节
  • 下载大小: 148512423字节

45. Question Rewriting

  • 特征: instruction, input, output, domain, lang
  • 训练集: 42596个样本,总大小50260852字节
  • 下载大小: 6059105字节

46. Question Understanding

  • 特征: instruction, input, output, domain, lang
  • 训练集: 63448个样本,总大小61204653字节
  • 下载大小: 7814957字节

47. Section Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 11975个样本,总大小13181798字节
  • 下载大小: 1160774字节

48. Sentence Composition

  • 特征: instruction, input, output, domain, lang
  • 训练集: 72496个样本,总大小59064347字节
  • 下载大小: 13871750字节

49. Sentence Compression

  • 特征: instruction, input, output, domain, lang
  • 训练集: 4934个样本,总大小2250408字节
  • 下载大小: 1155675字节

50. Sentence Expansion

  • 特征: instruction, input, output, domain, lang
  • 训练集: 1761个样本,总大小1035515字节
  • 下载大小: 391975字节

51. Sentence Ordering

  • 特征: instruction, input, output, domain, lang
  • 训练集: 20184个样本,总大小14605282字节
  • 下载大小: 3063937字节

52. Sentence Perturbation

  • 特征: instruction, input, output, domain, lang
  • 训练集: 80789个样本,总大小47077880字节
  • 下载大小: 7584636字节

53. Sentiment Analysis

  • 特征: instruction, input, output, domain, lang
  • 训练集: 253432个样本,总大小223188120字节
  • 下载大小: 87290295字节

54. Spam Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 1065个样本,总大小581151字节
  • 下载大小: 83786字节

55. Speaker Identification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 19800个样本,总大小13932578字节
  • 下载大小: 3558362字节

56. Speaker Relation Classification

  • 特征: instruction, input, output, domain, lang
  • 训练集: 153个样本,总大小206205字节
  • 下载大小: 89490字节

57. Spelling Error Detection

  • 特征: instruction, input, output, domain, lang
  • 训练集: 6499个样本,总大小3931890字节
  • 下载大小: 298131字节

58. Stance Detection

  • 特征: instruction, input, output, domain, lang
  • 训练集: 2854个样本,总大小1867373字节
  • 下载大小: 224057字节

每个子数据集都提供了详细的特征描述和数据集大小信息,便于用户根据需求选择合适的数据集进行研究和应用。

搜集汇总
数据集介绍
main_image_url
构建方式
在自然语言处理领域,构建一个能够指导模型执行多样化任务的指令数据集至关重要。Natural Instructions数据集通过整合超过60个不同的自然语言处理任务,构建了一个统一的指令-输入-输出三元组结构。每个任务配置均包含明确的指令描述、输入文本和期望的输出结果,并标注了领域和语言信息。这种构建方式源于对现有任务数据集的系统化收集与标准化转换,旨在为模型提供跨任务的通用指令遵循能力。数据集的构建过程注重任务覆盖的广度与数据格式的一致性,为指令微调研究奠定了坚实基础。
使用方法
在模型指令遵循能力的研究中,该数据集可作为多任务指令微调的核心资源。研究人员可通过加载特定任务配置(如Answer Verification或Question Answering)来获取对应的训练数据,每条数据包含明确的指令、输入和期望输出。使用时可针对单一任务进行模型微调,亦可混合多个任务数据以训练通用指令遵循模型。数据集的标准化格式便于直接整合到现有训练流程中,通过指令-输入-输出的映射关系,引导模型学习根据自然语言指令执行相应任务。这种使用方法有助于探索模型的任务泛化能力和零样本学习性能。
背景与挑战
背景概述
在自然语言处理领域,通用指令遵循模型的开发长期受限于任务特定数据集的碎片化与异构性。为应对这一挑战,WiseNut NLP团队于2022年推出了Natural Instructions数据集,该数据集整合了涵盖问答、文本分类、生成、推理等超过60种自然语言任务的多样化指令-输出对。其核心研究目标在于构建一个统一且规模宏大的基准,以促进模型在零样本和少样本场景下的泛化能力与任务适应性。这一资源显著推动了指令调优与元学习的研究进程,为探索模型如何理解并执行人类自然语言指令提供了关键数据支撑。
当前挑战
该数据集旨在解决的领域挑战,在于如何让单一模型通过自然语言指令泛化至大量未见过的复杂任务,这要求模型具备深层的语义理解与推理能力。在构建过程中,面临的挑战尤为突出:首先,需要从海量异构的现有数据集中提取并标准化任务描述与实例,确保指令的清晰性与一致性;其次,跨任务、领域与语言的广泛覆盖,使得数据质量把控与标注规范统一变得异常艰巨;最后,维持任务间的平衡性与代表性,避免数据偏差影响模型评估的公正性,亦是构建者需克服的关键难题。
常用场景
经典使用场景
在自然语言处理领域,指令微调已成为提升模型泛化能力的关键范式。Natural Instructions数据集以其涵盖的六十余种任务类型,为研究者提供了统一的指令-输入-输出格式,经典使用场景在于评估和训练模型遵循自然语言指令执行多样化任务的能力。该数据集通过整合问答、文本生成、分类等任务,构建了多任务学习的基准环境,使得模型能够在单一框架下处理跨领域的语言理解与生成挑战,推动了指令跟随模型的系统性评测与发展。
解决学术问题
面对传统NLP模型在任务泛化和零样本学习上的局限,Natural Instructions数据集通过大规模、多任务的指令集合,有效解决了模型对新任务适应性的学术难题。该数据集的意义在于为指令微调研究提供了标准化评估基准,促进了模型在未见任务上的泛化性能分析,从而深化了对模型迁移学习机制的理解。其影响体现在推动了如T0、FLAN等经典工作的诞生,为构建通用型语言智能奠定了数据基础。
实际应用
在实际应用层面,Natural Instructions数据集支撑了智能助手、自动化文本处理等系统的开发。基于该数据集训练的模型能够理解用户以自然语言下达的多样化指令,例如进行情感分析、信息抽取或对话生成,从而嵌入客服系统、内容审核工具等实际场景。这种能力降低了技术部署门槛,使得非专业用户也能通过自然交互完成复杂任务,提升了人机协作的效率和自然度。
数据集最近研究
最新研究方向
在自然语言处理领域,指令微调已成为提升模型泛化能力的关键路径。Natural Instructions数据集以其涵盖问答、文本生成、分类等多样化任务的特性,为研究指令遵循与零样本学习提供了丰富资源。当前前沿探索聚焦于如何利用此类大规模指令数据优化大语言模型的跨任务适应性,尤其在少样本场景下增强模型对未见任务的推理能力。相关研究正推动模型从单一任务专家向通用任务执行者演进,这一趋势在构建更智能、更灵活的对话系统与自动化工具中展现出深远影响。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作