semeru/code-text-java

Name: semeru/code-text-java
Creator: semeru
Published: 2023-03-23 20:10:55
License: 暂无描述

Hugging Face2023-03-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/semeru/code-text-java

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit Programminglanguage: "Java" version: "N/A" Date: "Codesearchnet(Jun 2020 - paper release date)" Contaminated: "Very Likely" Size: "Standard Tokenizer (TreeSitter)" --- ### Dataset is imported from CodeXGLUE and pre-processed using their script. # Where to find in Semeru: The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/java in Semeru # CodeXGLUE -- Code-To-Text ## Task Definition The task is to generate natural language comments for a code, and evaluted by [smoothed bleu-4](https://www.aclweb.org/anthology/C04-1072.pdf) score. ## Dataset The dataset we use comes from [CodeSearchNet](https://arxiv.org/pdf/1909.09436.pdf) and we filter the dataset as the following: - Remove examples that codes cannot be parsed into an abstract syntax tree. - Remove examples that #tokens of documents is < 3 or >256 - Remove examples that documents contain special tokens (e.g. <img ...> or https:...) - Remove examples that documents are not English. ### Data Format After preprocessing dataset, you can obtain three .jsonl files, i.e. train.jsonl, valid.jsonl, test.jsonl For each file, each line in the uncompressed file represents one function. One row is illustrated below. - **repo:** the owner/repo - **path:** the full path to the original file - **func_name:** the function or method name - **original_string:** the raw string before tokenization or parsing - **language:** the programming language - **code/function:** the part of the `original_string` that is code - **code_tokens/function_tokens:** tokenized version of `code` - **docstring:** the top-level comment or docstring, if it exists in the original string - **docstring_tokens:** tokenized version of `docstring` ### Data Statistic | Programming Language | Training | Dev | Test | | :------------------- | :------: | :----: | :----: | | Java | 164,923 | 5,183 | 10,955 | ## Reference <pre><code>@article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} }</code></pre>

提供机构：

semeru

原始信息汇总

CodeXGLUE -- Code-To-Text

任务定义

任务是生成代码的自然语言注释，并通过平滑BLEU-4评分进行评估。

数据集

数据集来自CodeSearchNet，并进行了以下过滤：

移除无法解析为抽象语法树的代码示例。
移除文档的#tokens数小于3或大于256的示例。
移除文档包含特殊标记（例如<img ...>或https:...）的示例。
移除文档不是英文的示例。

数据格式

预处理后，可以获得三个.jsonl文件，即train.jsonl、valid.jsonl和test.jsonl。

每个文件中，每一行代表一个函数。每行包含以下字段：

repo: 所有者/仓库
path: 原始文件的完整路径
func_name: 函数或方法名称
original_string: 分词或解析前的原始字符串
language: 编程语言
code/function: original_string中的代码部分
code_tokens/function_tokens: code的分词版本
docstring: 原始字符串中的顶级注释或文档字符串（如果存在）
docstring_tokens: docstring的分词版本

数据统计

编程语言	训练集	开发集	测试集
Java	164,923	5,183	10,955

5,000+

优质数据集

54 个

任务类型

进入经典数据集