op_dataset.pkl
收藏MARS-Dataset 概述
数据集内容
- 主要文件:
op_dataset.pkl - 内容: 包含格式化的StackOverflow帖子,包括问题、答案及帖子统计信息(如接受状态、星数、投票数等)。对于包含代码片段的答案,还解析了程序语法树结构和草图。
数据结构
- 数据结构: 使用Python的
pickle库加载,数据结构如下: python op_dataset = { "DPLYR": [ ( { "url": str, "vote": int, "ansr": int, "acpt": int, "view": int, "title": str, "tags": list of str, "time": float, }, [ (0,txt),(1,code),... ], [ { "acpt": bool, "vote": int, "ansr": [ (0,txt),(1,code),... ], "ansr_parsed": list of dict, "ansr_op": list of tuple, }, {...}, {...}, ... ], ), (...), (...), ... ], "TIDYR": [ ... ] }
数据集生成脚本
- 脚本: 包括
meta_scraper.py,content_scraper.py,code_parser.py,op_extractor.py。 - 功能: 从StackOverflow搜索并收集帖子,解析代码片段,提取有效组件。
引用信息
-
引用: 若使用本数据集,请引用以下文献:
@inproceedings{Chen:2019:MMS:3338906.3338951, author = {Chen, Yanju and Martins, Ruben and Feng, Yu}, title = {Maximal Multi-layer Specification Synthesis}, booktitle = {Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering}, series = {ESEC/FSE 2019}, year = {2019}, isbn = {978-1-4503-5572-8}, location = {Tallinn, Estonia}, pages = {602--612}, numpages = {11}, url = {http://doi.acm.org/10.1145/3338906.3338951}, doi = {10.1145/3338906.3338951}, acmid = {3338951}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Max-SMT, machine learning, neural networks, program synthesis}, }




