five

Shenzhen Government Similar Question Retrieval Training and Test Sets

收藏
DataCite Commons2025-12-02 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=9d1cb4d43952418285c26be63d5c8397
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is sourced from the interactive consultation section of the "Shenzhen Government Online" website. Each consultation entry includes information such as the question topic, detailed content, and corresponding responses. In the training set, based on the original crawled questions qi and their respective answers ai, PromptT is utilized to generate semantically positive samples (similar questions) qi+ and hard negative samples (dissimilar questions) qN-, forming the complete triplet data (qi , qi+, qi-) required for contrastive training. Incorporating answer information provides additional contextual background knowledge for the LLM, simultaneously bridging the semantic gap between different questions sharing identical answers. In the test set, rather than using triplet data, the dataset focuses on generating question pairs (qi , qi') that exhibit stricter semantic equivalence. This approach aims to simulate realistic scenarios encountered in similar question retrieval tasks. To achieve this, PromptS is designed to create qi' through a rewriting task based on the original question qi. Compared to directly using the original question-similar question pairs (qi , qi+) from the training set as test data, this new data generation strategy significantly reduces bias toward LLM-generated pseudo-data, thereby enhancing the fairness and credibility of the evaluation.
提供机构:
Science Data Bank
创建时间:
2025-12-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作