大规模中文语义相似度数据
收藏魔搭社区2026-05-22 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/DAMO_NLP/BQ_Corpus
下载链接
链接失效反馈官方服务:
资源简介:
作为语义匹配任务,句子语义等价识别(SSEI)是自然语言处理(NLP)在问答(QA)、自动客户服务和聊天机器人中的一项基础任务。在客户服务系统中,如果两个问题传达相同的意图或可以由相同的答案回答,则它们被定义为语义等价。我们介绍了银行问题(BQ)语料库,这是一个用于 SSEI 的大规模特定领域中文语料库。 BQ 语料库包含来自网上银行自定义服务日志的 120,000 个问题对。它分为三部分:100,000 对用于训练,10,000 对用于验证,10,000 对用于测试。我们在我们的语料库上展示了五个 SSEI 基准性能,包括最先进的算法。作为银行领域最大的人工标注公共中文 SSEI 语料库,BQ 语料不仅可用于中文问题语义匹配研究,也是跨语言、跨领域 SSEI 研究的重要资源。
As a semantic matching task, Sentence Semantic Equivalence Identification (SSEI) is a fundamental task in Natural Language Processing (NLP) applied to Question Answering (QA), automated customer service, and chatbots. In customer service systems, two questions are defined as semantically equivalent if they convey the same intent or can be answered by the same response. We introduce the Bank Question (BQ) Corpus, a large-scale domain-specific Chinese corpus for SSEI. The BQ Corpus contains 120,000 question pairs sourced from online banking customized service logs. It is split into three subsets: 100,000 pairs for training, 10,000 pairs for validation, and 10,000 pairs for testing. We present the baseline performance of five SSEI models, including state-of-the-art algorithms, evaluated on our corpus. As the largest manually annotated public Chinese SSEI corpus in the banking domain, the BQ Corpus can not only be used for research on Chinese question semantic matching, but also serve as an important resource for cross-lingual and cross-domain SSEI research.
提供机构:
maas
创建时间:
2022-09-28
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是BQ Corpus,一个用于句子语义等价识别的大规模中文银行领域语料库,包含12万对问题,用于训练和评估语义匹配模型。作为公开可用的最大手动标注中文SSEI资源,它支持跨语言和跨领域研究。
以上内容由遇见数据集搜集并总结生成



