LukasSonn/DoxygenStrings-Long
收藏Hugging Face2023-11-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/LukasSonn/DoxygenStrings-Long
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# Dataset Info C++ + Natural Description -> Doxygen Documentation
This dataset was created for my bachelors thesis investigating how LLMs can be fine-tuned to generate doxygen documentation. It was created by using the “Source code analysis dataset”
by Gelman, Banjo Obayomi, Jessica Moore und David Slater (doi: 10.1016/j.dib.2019.104712).
The following SQL-Statement was used to grab raw data from the dataset:
```
SELECT * FROM all_data
WHERE LENGTH(comment) > 300 and LENGTH(code) > 100 AND LENGTH(code) < 80
AND code NOT LIKE '%//%' AND code NOT LIKE '%/*%' AND code NOT LIKE '%*/%'
AND filename LIKE '%.cpp%'
LIMIT 12000
```
After selecting the Data Code LLaMa Instruct 34B is tasked to combine the human-written description of the functionality with the function code into a Doxygen-Comment. Any results which included the sample doxygen string or no doxygen string at all where filtered from the set.
提供机构:
LukasSonn
原始信息汇总
数据集信息:C++代码与自然描述生成Doxygen文档
数据集来源
- 数据集基于Gelman, Banjo Obayomi, Jessica Moore和David Slater的“源代码分析数据集”(doi: 10.1016/j.dib.2019.104712)创建。
数据选择标准
- 使用SQL语句从源数据集中筛选数据: sql SELECT * FROM all_data WHERE LENGTH(comment) > 300 and LENGTH(code) > 100 AND LENGTH(code) < 80 AND code NOT LIKE %//% AND code NOT LIKE %/% AND code NOT LIKE %/% AND filename LIKE %.cpp% LIMIT 12000
数据处理
- 使用Code LLaMa Instruct 34B模型将人工编写的功能描述与函数代码结合,生成Doxygen注释。
- 过滤掉包含示例Doxygen字符串或无Doxygen字符串的结果。



