five

Orca

收藏
arXiv2023-10-13 更新2024-06-21 收录
下载链接:
https://github.com/nuochenpku/Orca
下载链接
链接失效反馈
官方服务:
资源简介:
Orca是首个针对中文对话机器阅读理解(CMRC)的基准数据集,由香港科技大学开发。该数据集包含831个热点话题驱动的对话,总计4742轮。每个对话轮次都与一个相关的回复段落配对,旨在更合理地评估模型的理解能力。话题来源于社交媒体平台,覆盖33个领域,力求与真实场景一致。Orca中的答案均为精心标注的自然回复,而非特定范围或短语,这要求模型不仅要有理解能力,还要有生成能力。数据集适用于评估模型对新知识和问题的适应性,以及在多领域中的泛化能力。

Orca is the first benchmark dataset for Chinese Machine Reading Comprehension (CMRC), developed by The Hong Kong University of Science and Technology. It contains 831 hot-topic-driven dialogues, totaling 4742 turns. Each dialogue turn is paired with a relevant response paragraph, aiming to enable a more reasonable and realistic evaluation of the model's comprehension capabilities. The topics are sourced from social media platforms, spanning 33 domains, and are designed to align with real-world application scenarios. All answers in Orca are carefully annotated natural conversational responses, rather than fixed-range or pre-specified phrases, which imposes dual requirements on the model: both comprehension ability and generation ability. This dataset is suitable for evaluating a model's adaptability to new knowledge and questions, as well as its cross-domain generalization ability.
提供机构:
香港科技大学(广州),香港科技大学
创建时间:
2023-02-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作