unum-cloud/ann-codesearch-4m

Name: unum-cloud/ann-codesearch-4m
Creator: unum-cloud
Published: 2024-05-12 04:55:43
License: 暂无描述

Hugging Face2024-05-12 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/unum-cloud/ann-codesearch-4m

下载链接

链接失效反馈

官方服务：

资源简介：

## Cleaning Unlike the original dataset, the `func_code_string` column was updated to remove any comments and keep just the code. The original version can still be found in the `whole_func_string`. ```py import re def remove_comments_docstrings(code, language): if language == 'python': # Remove docstrings code = re.sub(r'"""(.*?)"""', '', code, flags=re.DOTALL) code = re.sub(r"'''(.*?)'''", '', code, flags=re.DOTALL) # Remove comments code = re.sub(r'#.*', '', code) elif language == 'java' or language == 'javascript': # Remove multiline comments code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL) # Remove single line comments code = re.sub(r'//.*', '', code) elif language == 'go': # Similar to Java/Javascript code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL) code = re.sub(r'//.*', '', code) elif language == 'ruby': # Remove multiline comments code = re.sub(r'=begin.*?=end', '', code, flags=re.DOTALL) # Remove single line comments code = re.sub(r'#.*', '', code) elif language == 'php': # Remove multiline comments code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL) # Remove single line and hash comments code = re.sub(r'//.*', '', code) code = re.sub(r'#.*', '', code) return code.strip() ``` The validity of that snippet can be tested with the following example: ```py # Example DataFrame import pandas as pd example = { 'language': ['python', 'java', 'javascript', 'go', 'ruby', 'php'], 'func_code_string': [ '"""Example docstring""" def foo(): # This is a comment\n return 1', '/** Java doc */ public class Test { // Comment\n public void method() {} }', '/* JS doc */ function test() { // Comment\n return true; }', '/* Go doc */ package main // Import comment\nimport "fmt"', '=begin Ruby doc =end def foo # Comment\n 1 + 1 end', '<?php /* PHP doc */ // Comment\necho "Hello"; # Another comment ?>' ]} example_df = pd.DataFrame(example) example_df['cleaned_code'] = example_df.apply(lambda x: remove_comments_docstrings(x['func_code_string'], x['language']), axis=1) print(example_df[['language', 'cleaned_code']]) ```

提供机构：

unum-cloud

原始信息汇总

数据集清洗说明

清洗内容

func_code_string 列已被更新，移除了所有注释，仅保留代码部分。
原始包含注释的代码可在 whole_func_string 列中找到。

清洗方法

使用正则表达式针对不同编程语言进行注释和文档字符串的移除。
支持的语言包括：Python, Java, JavaScript, Go, Ruby, PHP。

示例

提供了一个示例数据框，展示了如何应用清洗函数 remove_comments_docstrings 来清理不同语言的代码。

5,000+

优质数据集

54 个

任务类型

进入经典数据集