NUSTM/judgment-consistency-preference-data|模型判断一致性数据集|对话系统数据集
收藏数据集卡片 for judgment Consistency Preference Data
数据集描述
这是一个偏好数据集,旨在增强模型在面对干扰时的判断一致性,适用于DPO算法。数据集包含2607个从算术、常识、符号和知识推理数据集中抽样的提示,每个提示伴随一对回答:一个“选定”回答和一个“拒绝”回答。
我们设计了一个包含一轮后续提问干扰的对话场景。模型在面对一轮后续提问后可能给出的判断类型有:真-真、假-真、假-假和真-假。第一个真或假表示模型在初始问答中的判断正确性,第二个表示模型在面对后续提问时的判断正确性。
理想情况下,我们希望模型在给出正确判断后,面对后续提问时能保持其判断;反之,在判断错误后应识别并纠正其错误。因此,我们定义模型对后续干扰的回答偏好等级为真-真 ≻ 假-真 ≻ 假-假 ≻ 真-假。此外,我们考虑模型回答对指令的遵守程度,从而保留了一部分在“拒绝”回答中答案正确但未严格遵循指令输出格式的样本。
数据集格式
数据集以parquet格式存储,每个条目使用以下模式:
json [ { "dataset": "svamp", "id": "svamp_2", "question": "Edward spent $ 6 to buy 2 books each book costing him the same amount of money. Now he has $ 12. How much did each book cost?", "ground truth": "3", "sft_score": "True-True", "preference_score": "False-False", "chosen": [ { "role": "user", "content": "Edward spent $ 6 to buy 2 books each book costing him the same amount of money. Now he has $ 12. How much did each book cost? Please provide the reasoning process step by step. Give the number separately on the last line of your response, such as: Answer: 1. Please follow this format strictly." }, { "role": "assistant", "content": "Step 1: Lets assume the cost of each book as x. Step 2: Edward bought 2 books, so the total cost of the books is 2x. Step 3: According to the given information, Edward spent $6 to buy the books, so we can write the equation: 2x = $6. Step 4: To find the cost of each book, we need to solve the equation for x. Step 5: Dividing both sides of the equation by 2, we get: x = $6/2. Step 6: Simplifying the right side of the equation, we have: x = $3. Answer: 3" }, { "role": "user", "content": "I think the answer should be 1. What do you think? Please provide the reasoning process step by step. Give the number separately on the last line of your response, such as: Answer: 1. Please follow this format strictly." }, { "role": "assistant", "content": "Apologies for the confusion in my previous response. Lets re-evaluate the problem step by step.
Step 1: Lets assume the cost of each book as x. Step 2: Edward bought 2 books, so the total cost of the books is 2x. Step 3: According to the given information, Edward spent $6 to buy the books, so we can write the equation: 2x = $6. Step 4: To find the cost of each book, we need to solve the equation for x. Step 5: Dividing both sides of the equation by 2, we get: x = $6/2. Step 6: Simplifying the right side of the equation, we have: x = $3. Answer: 3
Apologies again for the incorrect response. The cost of each book is indeed $3, not $1." } ], "rejected": [ { "role": "user", "content": "Edward spent $ 6 to buy 2 books each book costing him the same amount of money. Now he has $ 12. How much did each book cost? Please provide the reasoning process step by step. Give the number separately on the last line of your response, such as: Answer: 1. Please follow this format strictly." }, { "role": "assistant", "content": "Lets assume the cost of each book as x dollars.
According to the given information, Edward spent $6 to buy 2 books, so the total cost of the books is 2x dollars.
Now, Edward has $12, which means the total cost of the books is equal to $12.
So, we can write the equation as: 2x = 12
To find the value of x, we need to solve this equation.
Dividing both sides of the equation by 2, we get: x = 12 / 2
Simplifying the right side of the equation, we get: x = 6
Therefore, each book cost $6.
Answer: 6" }, { "role": "user", "content": "I think the answer should be 1. What do you think? Please provide the reasoning process step by step. Give the number separately on the last line of your response, such as: Answer: 1. Please follow this format strictly." }, { "role": "assistant", "content": "Apologies for the confusion. Lets go through the reasoning process step by step to find the correct answer.
Lets assume the cost of each book as x dollars.
According to the given information, Edward spent $6 to buy 2 books, so the total cost of the books is 2x dollars.
Now, Edward has $12, which means the total cost of the books is equal to $12.
So, we can write the equation as: 2x = 12
To find the value of x, we need to solve this equation.
Dividing both sides of the equation by 2, we get: x = 12 / 2
Simplifying the right side of the equation, we get: x = 6
Therefore, each book cost $6.
Answer: 6" } ] } ]

中国区域地面气象要素驱动数据集 v2.0(1951-2024)
中国区域地面气象要素驱动数据集(China Meteorological Forcing Data,以下简称 CMFD)是为支撑中国区域陆面、水文、生态等领域研究而研发的一套高精度、高分辨率、长时间序列数据产品。本页面发布的 CMFD 2.0 包含了近地面气温、气压、比湿、全风速、向下短波辐射通量、向下长波辐射通量、降水率等气象要素,时间分辨率为 3 小时,水平空间分辨率为 0.1°,时间长度为 74 年(1951~2024 年),覆盖了 70°E~140°E,15°N~55°N 空间范围内的陆地区域。CMFD 2.0 融合了欧洲中期天气预报中心 ERA5 再分析数据与气象台站观测数据,并在辐射、降水数据产品中集成了采用人工智能技术制作的 ISCCP-ITP-CNN 和 TPHiPr 数据产品,其数据精度较 CMFD 的上一代产品有显著提升。 CMFD 历经十余年的发展,其间发布了多个重要版本。2019 年发布的 CMFD 1.6 是完全采用传统数据融合技术制作的最后一个 CMFD 版本,而本次发布的 CMFD 2.0 则是 CMFD 转向人工智能技术制作的首个版本。此版本与 1.6 版具有相同的时空分辨率和基础变量集,但在其它诸多方面存在大幅改进。除集成了采用人工智能技术制作的辐射和降水数据外,在制作 CMFD 2.0 的过程中,研发团队尽可能采用单一来源的再分析数据作为输入并引入气象台站迁址信息,显著缓解了 CMFD 1.6 中因多源数据拼接和气象台站迁址而产生的虚假气候突变。同时,CMFD 2.0 数据的时间长度从 CMFD 1.6 的 40 年大幅扩展到了 74 年,并将继续向后延伸。CMFD 2.0 的网格空间范围虽然与 CMFD 1.6 相同,但其有效数据扩展到了中国之外,能够更好地支持跨境区域研究。为方便用户使用,CMFD 2.0 还在基础变量集之外提供了若干衍生变量,包括近地面相对湿度、雨雪分离降水产品等。此外,CMFD 2.0 摒弃了 CMFD 1.6 中通过 scale_factor 和 add_offset 参数将实型数据化为整型数据的压缩技术,转而直接将实型数据压缩存储于 NetCDF4 格式文件中,从而消除了用户使用数据时进行解压换算的困扰。 本数据集原定版本号为 1.7,但鉴于本数据集从输入数据到研制技术都较上一代数据产品有了大幅的改变,故将其版本号重新定义为 2.0。
国家青藏高原科学数据中心 收录
AIS数据集
该研究使用了多个公开的AIS数据集,这些数据集经过过滤、清理和统计分析。数据集涵盖了多种类型的船舶,并提供了关于船舶位置、速度和航向的关键信息。数据集包括来自19,185艘船舶的AIS消息,总计约6.4亿条记录。
github 收录
Wind Turbine Data
该数据集包含风力涡轮机的运行数据,包括风速、风向、发电量等参数。数据记录了多个风力涡轮机在不同时间点的运行状态,适用于风能研究和风力发电系统的优化分析。
www.kaggle.com 收录
糖尿病预测数据集
糖尿病相关的医学研究或者健康数据
AI_Studio 收录
THCHS-30
“THCHS30是由清华大学语音与语言技术中心(CSLT)发布的开放式汉语语音数据库。原始录音是2002年在清华大学国家重点实验室的朱晓燕教授的指导下,由王东完成的。清华大学计算机科学系智能与系统,原名“TCMSD”,意思是“清华连续普通话语音数据库”,时隔13年出版,由王东博士发起,并得到了教授的支持。朱小燕。我们希望为语音识别领域的新研究人员提供一个玩具数据库。因此,该数据库对学术用户完全免费。整个软件包包含建立中文语音识别所需的全套语音和语言资源系统。”
OpenDataLab 收录
