unum-cloud/ann-codesearch-4m
收藏Hugging Face2024-05-12 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/unum-cloud/ann-codesearch-4m
下载链接
链接失效反馈官方服务:
资源简介:
## Cleaning
Unlike the original dataset, the `func_code_string` column was updated to remove any comments and keep just the code.
The original version can still be found in the `whole_func_string`.
```py
import re
def remove_comments_docstrings(code, language):
if language == 'python':
# Remove docstrings
code = re.sub(r'"""(.*?)"""', '', code, flags=re.DOTALL)
code = re.sub(r"'''(.*?)'''", '', code, flags=re.DOTALL)
# Remove comments
code = re.sub(r'#.*', '', code)
elif language == 'java' or language == 'javascript':
# Remove multiline comments
code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
# Remove single line comments
code = re.sub(r'//.*', '', code)
elif language == 'go':
# Similar to Java/Javascript
code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
code = re.sub(r'//.*', '', code)
elif language == 'ruby':
# Remove multiline comments
code = re.sub(r'=begin.*?=end', '', code, flags=re.DOTALL)
# Remove single line comments
code = re.sub(r'#.*', '', code)
elif language == 'php':
# Remove multiline comments
code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
# Remove single line and hash comments
code = re.sub(r'//.*', '', code)
code = re.sub(r'#.*', '', code)
return code.strip()
```
The validity of that snippet can be tested with the following example:
```py
# Example DataFrame
import pandas as pd
example = {
'language': ['python', 'java', 'javascript', 'go', 'ruby', 'php'],
'func_code_string': [
'"""Example docstring""" def foo(): # This is a comment\n return 1',
'/** Java doc */ public class Test { // Comment\n public void method() {} }',
'/* JS doc */ function test() { // Comment\n return true; }',
'/* Go doc */ package main // Import comment\nimport "fmt"',
'=begin Ruby doc =end def foo # Comment\n 1 + 1 end',
'<?php /* PHP doc */ // Comment\necho "Hello"; # Another comment ?>'
]}
example_df = pd.DataFrame(example)
example_df['cleaned_code'] = example_df.apply(lambda x: remove_comments_docstrings(x['func_code_string'], x['language']), axis=1)
print(example_df[['language', 'cleaned_code']])
```
提供机构:
unum-cloud
原始信息汇总
数据集清洗说明
清洗内容
func_code_string列已被更新,移除了所有注释,仅保留代码部分。- 原始包含注释的代码可在
whole_func_string列中找到。
清洗方法
- 使用正则表达式针对不同编程语言进行注释和文档字符串的移除。
- 支持的语言包括:Python, Java, JavaScript, Go, Ruby, PHP。
示例
- 提供了一个示例数据框,展示了如何应用清洗函数
remove_comments_docstrings来清理不同语言的代码。



