ProCQA 基于社区的编程问答数据集
收藏超神经2024-04-11 更新2024-05-15 收录
下载链接:
https://hyper.ai/cn/datasets/30703
下载链接
链接失效反馈官方服务:
资源简介:
ProCQA 是一个由北京航空航天大学创建的大规模编程问答数据集,包含约 500 万个问答对,涵盖 Python 、 Java 、 JavaScript 等 11 种不同的编程语言。这些问答涉及算法、框架、库的使用等多个知识领域,数据来源于 StackOverflow 社区,研究人员通过爬虫技术获取,采用了严格的规则过滤策略,包括过滤过短或过长的问答以及只保留被提问者接受的答案等,以确保数据的质量和公平性。 ProCQA 中的问答对是自然结构化的混合模态,即文本和代码在问答字段中交织在一起,为模型提供了自然监督信号,有助于对齐两种模态。该数据集可广泛应用于评估基准和预训练语料库,为代码检索和问答任务提供了重要的资源。
ProCQA is a large-scale programming question answering dataset developed by Beihang University. It contains approximately 5 million question-answer pairs, spanning 11 distinct programming languages including Python, Java, JavaScript, and others. These Q&A pairs encompass multiple knowledge domains such as algorithm design, framework utilization, library application, and more. The dataset is sourced from the StackOverflow community, with researchers collecting the data via web crawling techniques. A rigorous rule-based filtering pipeline was implemented to ensure data quality and fairness, including filtering out overly short or lengthy Q&A pairs and only retaining answers accepted by the original question askers. The Q&A pairs in ProCQA feature naturally structured mixed-modality content, where text and code are interleaved within the Q&A fields, providing natural supervision signals for models and facilitating the alignment of the two modalities. This dataset can be widely used as evaluation benchmarks and pre-training corpora, serving as a critical resource for code retrieval and question answering tasks.
创建时间:
2024-04-09
搜集汇总
数据集介绍

背景与挑战
背景概述
ProCQA是一个大规模编程问答数据集,包含约500万个涵盖11种编程语言的问答对,数据来源于StackOverflow并经过严格过滤。该数据集采用文本和代码混合模态,适用于代码检索和问答任务的研究与应用。
以上内容由遇见数据集搜集并总结生成



