ctga-v1
收藏数据集概述
数据集名称
- 名称: ctga-v1
- 链接: ctga-v1
数据集用途
- 用途: 用于生成合成指令调优数据集,支持从无标注文本转换为特定任务的训练数据集。
支持的任务类型
- 任务类型:
extractive question answering(exqa)multiple-choice question answering(mcqa)question generation(qg)question answering without choices(qa)yes-no question answering(ynqa)coreference resolution(coref)paraphrase generation(paraphrase)paraphrase identification(paraphrase_id)sentence completion(sent_comp)sentiment(sentiment)summarization(summarization)text generation(text_gen)topic classification(topic_class)word sense disambiguation(wsd)textual entailment(te)natural language inference(nli)
数据集生成方法
- 生成方法: 使用Bonito模型,结合Hugging Face的
transformers和vllm库,通过指定任务类型和采样参数,从无标注文本生成合成数据集。
示例代码
python from bonito import Bonito from vllm import SamplingParams from datasets import load_dataset
bonito = Bonito("BatsResearch/bonito-v1") unannotated_text = load_dataset( "BatsResearch/bonito-experiment", "unannotated_contract_nli" )["train"].select(range(10)) sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1) synthetic_dataset = bonito.generate_tasks( unannotated_text, context_col="input", task_type="nli", sampling_params=sampling_params )
引用信息
-
引用:
@inproceedings{bonito:aclfindings24, title = {Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation}, author = {Nayak, Nihal V. and Nan, Yiyang and Trost, Avi and Bach, Stephen H.}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2024}, year = {2024} }




