Shenzhen Government Similar Question Retrieval Training and Test Sets

Name: Shenzhen Government Similar Question Retrieval Training and Test Sets
Creator: Science Data Bank
Published: 2025-12-02 08:50:59
License: 暂无描述

DataCite Commons2025-12-02 更新2026-05-05 收录

下载链接：

https://www.scidb.cn/detail?dataSetId=9d1cb4d43952418285c26be63d5c8397

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is sourced from the interactive consultation section of the "Shenzhen Government Online" website. Each consultation entry includes information such as the question topic, detailed content, and corresponding responses. In the training set, based on the original crawled questions qi and their respective answers ai, PromptT is utilized to generate semantically positive samples (similar questions) qi+ and hard negative samples (dissimilar questions) qN-, forming the complete triplet data (qi , qi+, qi-) required for contrastive training. Incorporating answer information provides additional contextual background knowledge for the LLM, simultaneously bridging the semantic gap between different questions sharing identical answers. In the test set, rather than using triplet data, the dataset focuses on generating question pairs (qi , qi') that exhibit stricter semantic equivalence. This approach aims to simulate realistic scenarios encountered in similar question retrieval tasks. To achieve this, PromptS is designed to create qi' through a rewriting task based on the original question qi. Compared to directly using the original question-similar question pairs (qi , qi+) from the training set as test data, this new data generation strategy significantly reduces bias toward LLM-generated pseudo-data, thereby enhancing the fairness and credibility of the evaluation.

提供机构：

Science Data Bank

创建时间：

2025-12-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集