LCQMC (Large-scale Chinese Question Matching Corpus)

Name: LCQMC (Large-scale Chinese Question Matching Corpus)
Creator: OpenDataLab
Published: 2026-05-24 04:30:13
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/LCQMC

下载链接

链接失效反馈

官方服务：

资源简介：

问题匹配是 QA 的一项基本任务，通常被认为是语义匹配任务，有时是释义识别任务。该任务的目标是从现有数据库中搜索与输入问题具有相似意图的问题。我们引入了一个大规模的中文问题匹配语料库（名为 LCQMC）。 LCQMC 比释义语料库更通用，因为它侧重于意图匹配而不是释义。语料库包含 260,068 个带有人工注释的问题对，我们将其分为三部分，即包含 238,766 个问题对的训练集、包含 8,802 个问题对的开发集和包含 12,500 个问题对的测试集。我们在上面测试了几种著名的句子匹配方法。实验结果不仅证明了 LCQMC 的良好质量，而且为进一步研究该语料库提供了可靠的基线性能。

Question Matching is a fundamental task in Question Answering (QA). It is generally categorized as a semantic matching task, and occasionally treated as a paraphrase identification task. The goal of this task is to retrieve questions with analogous intent from an existing question database for a given input query. We introduce a large-scale Chinese question matching corpus named LCQMC. Unlike conventional paraphrase corpora, LCQMC is more general as it focuses on intent matching rather than paraphrasing. The corpus contains 260,068 manually annotated question pairs, which are split into three subsets: a training set with 238,766 question pairs, a development set with 8,802 question pairs, and a test set with 12,500 question pairs. We evaluated several well-established sentence matching methods on this corpus. The experimental results not only validate the high quality of LCQMC, but also provide reliable baseline performances for subsequent research on this corpus.

提供机构：

OpenDataLab

创建时间：

2022-06-07

搜集汇总

数据集介绍