robinsmits/ChatAlpaca-20K|对话系统数据集|自然语言处理数据集
收藏数据集卡片:ChatAlpaca 20K
数据集描述
ChatAlpaca 是一个聊天数据集,旨在帮助研究人员开发用于多轮对话中指令跟随的模型。该数据集是基于 Stanford Alpaca 数据 的扩展,包含多轮指令及其相应的回复。
ChatAlpaca 由中国科学院软件研究所的中国信息处理实验室开发。
在该数据集中,我们使用 ChatGPT(GPT-3.5-turbo)生成后续话语并继续与 ChatGPT 进行对话。这个过程产生了多轮对话,其中模拟用户提供指令,ChatGPT 相应地回应。
数据集结构
该数据集包含一个 JSON 文件 chatalpaca-20k.json
,其中包含多轮指令跟随对话。每个条目代表模拟用户和 ChatGPT 之间的不同对话。数据集目前总共包含 20,000 个对话。
数据集分为三个部分:
original
:原始chatalpaca-20k.json
中的所有数据train
:19600 行用于训练test
:400 行用于测试或验证
数据集中的对话以一个指令开始,随后是 ChatGPT 的回复,这些回复来自 Stanford Alpaca 数据。然后对话继续由用户提供进一步的指令,ChatGPT 相应地回应。
每个条目是一个字典列表,每个字典包含以下字段:
id
:整数,标识符messages
:列表,包含多个字典表示对话
messages
列表中的每个字典包含以下字段:
role
:字符串,值为 "user" 或 "assistant"content
:字符串,对话内容
示例对话
json [ { "role": "user", "content": "Identify the odd one out: Twitter, Instagram, Telegram" }, { "role": "assistant", "content": "Telegram" }, { "role": "user", "content": "What makes Telegram different from Twitter and Instagram?" }, { "role": "assistant", "content": "Telegram is a cloud-based instant messaging app that focuses on privacy and security. Unlike Twitter and Instagram which are mainly used for following news, celebrities, and sharing images, Telegram was created as a secure messaging app for private and group communication. Telegram also offers more advanced features than Twitter and Instagram, such as the ability to schedule messages, create bots, and send encrypted messages." }, { "role": "user", "content": "Can you give me an example of how the scheduling messages feature can be useful on Telegram?" }, { "role": "assistant", "content": "Sure! The scheduling messages feature on Telegram can be useful in many different situations. Heres an example:
Lets say you have a workgroup on Telegram, and you want to send a message to everyone in the group to remind them of a meeting. However, you know that most of the people in the group are busy at work during the time you want to send the message.
Thats where the scheduling messages feature comes in. You can schedule the message to be sent at a later time when people are more likely to be free, such as during their lunch break or after work. This way, everyone in the group will receive the message at the same time, and you wont have to worry about anyone missing the important reminder.
Overall, the scheduling messages feature can be really helpful for coordinating communication with groups of people on Telegram, whether its for work, social events, or anything else." }, { "role": "user", "content": "Goodbye." } ]
引用
请在使用数据集时引用原始仓库。
@misc{ChatAlpaca, author = {Ning Bian and Hongyu Lin and Yaojie Lu and Xianpei Han and Le Sun and Ben He }, title = {ChatAlpaca: A Multi-Turn Dialogue Corpus based on Alpaca Instructions}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/cascip/ChatAlpaca}}, }
yahoo-finance-data
该数据集包含从Yahoo! Finance、Nasdaq和U.S. Department of the Treasury获取的财务数据,旨在用于研究和教育目的。数据集包括公司详细信息、高管信息、财务指标、历史盈利、股票价格、股息事件、股票拆分、汇率和每日国债收益率等。每个数据集都有其来源、简要描述以及列出的列及其数据类型和描述。数据定期更新,并以Parquet格式提供,可通过DuckDB进行查询。
huggingface 收录
OpenPose
OpenPose数据集包含人体姿态估计的相关数据,主要用于训练和评估人体姿态检测算法。数据集包括多视角的图像和视频,标注了人体关键点位置,适用于研究人体姿态识别和动作分析。
github.com 收录
VoxBox
VoxBox是一个大规模语音语料库,由多样化的开源数据集构建而成,用于训练文本到语音(TTS)系统。
github 收录
iAWE
iAWE是一个专为非侵入式负载监测(NILM)算法设计的优秀数据集。数据集包含多个家用电器的能耗数据,如冰箱、空调、洗衣机等,每个CSV文件包含时间戳和能耗相关参数。数据集的原始格式存在一些问题,如时间戳处理和数据不一致性,需要进行预处理以适应NILM算法。
github 收录
CatMeows
该数据集包含440个声音样本,由21只属于两个品种(缅因州库恩猫和欧洲短毛猫)的猫在三种不同情境下发出的喵声组成。这些情境包括刷毛、在陌生环境中隔离和等待食物。每个声音文件都遵循特定的命名约定,包含猫的唯一ID、品种、性别、猫主人的唯一ID、录音场次和发声计数。此外,还有一个额外的zip文件,包含被排除的录音(非喵声)和未剪辑的连续发声序列。
huggingface 收录