KomeijiForce/Cuckoo_C4_Super_Rainbow
收藏Hugging Face2025-02-19 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/KomeijiForce/Cuckoo_C4_Super_Rainbow
下载链接
链接失效反馈官方服务:
资源简介:
Cuckoo是一个模仿大型语言模型下一个token预测范式的信息提取(IE)模型,它不直接从词汇表中检索,而是通过在给定输入上下文中对它们进行标记来预测下一个token。Cuckoo可以通过利用为大型语言模型(LLM)策划的数据来增强自己,从而在IE预训练方面与之前的模型有显著的不同。目前,Cuckoo已经在多个数据集上进行预训练,包括从C4转换的100M下一个token提取(NTE)实例、Cuckoo-C4+2.6M NTE实例、Cuckoo-C4-Instruct+MultiNERD、MetaIE、NuNER、MRQA(不包括SQuAD、DROP)和Cuckoo-C4-Rainbow+多个NER数据集、WizardLM数据集、多项选择题数据集、MMLU、SQuAD、DROP、MNLI、SNLI。
Cuckoo is an information extraction (IE) model that mimics the next token prediction paradigm of large language models without directly retrieving from the vocabulary. Instead, it predicts the next tokens by tagging them in the given input context. Cuckoo stands out in IE pre-training by its ability to enhance itself using any text resource, particularly by leveraging data curated for LLMs. It has been pre-trained on various datasets, including 100M NTE instances from C4, Cuckoo-C4 + 2.6M NTE instances from a supervised fine-tuning dataset, TuluV3, Cuckoo-C4-Instruct + MultiNERD, MetaIE, NuNER, MRQA (excluding SQuAD, DROP), and Cuckoo-C4-Rainbow + Multiple NER Datasets, WizardLM Dataset, Multiple Choice QA Datasets, MMLU, SQuAD, DROP, MNLI, SNLI.
提供机构:
KomeijiForce



