dapraws/college_alpaca_dataset
收藏Hugging Face2024-12-07 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/dapraws/college_alpaca_dataset
下载链接
链接失效反馈官方服务:
资源简介:
College_Alpaca_Dataset是原始Alpaca数据集的清理版本,解决了原始数据集中存在的多个问题,如幻觉、合并指令、空输出、空代码示例、生成图像的指令、N/A输出、不一致的输入字段、错误答案、非清晰指令和多余的转义和控制字符。数据集主要用于指令微调预训练语言模型,使其更好地遵循指令。数据集包含52,000个指令和演示,由OpenAI的`text-davinci-003`引擎生成。数据集的创建过程包括使用新的提示、更激进的批量解码、简化数据生成管道等。数据集的语言为英语,数据字段包括instruction、input、output和text。数据集的许可证为CC BY-NC 4.0。
The College_Alpaca_Dataset is a cleaned version of the original Alpaca Dataset, addressing several issues present in the original dataset such as hallucinations, merged instructions, empty outputs, empty code examples, instructions to generate images, N/A outputs, inconsistent input fields, wrong answers, non-sensical/unclear instructions, and extraneous escape and control characters. The dataset is primarily used for instruction-finetuning pretrained language models to make them better at following instructions. It contains 52,000 instructions and demonstrations generated by OpenAIs `text-davinci-003` engine. The dataset creation process includes using a new prompt, more aggressive batch decoding, and simplifying the data generation pipeline. The dataset is in English, with data fields including instruction, input, output, and text. The dataset is licensed under CC BY-NC 4.0.
提供机构:
dapraws



