semeru/code-text-java
收藏Hugging Face2023-03-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/semeru/code-text-java
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
Programminglanguage: "Java"
version: "N/A"
Date: "Codesearchnet(Jun 2020 - paper release date)"
Contaminated: "Very Likely"
Size: "Standard Tokenizer (TreeSitter)"
---
### Dataset is imported from CodeXGLUE and pre-processed using their script.
# Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/java in Semeru
# CodeXGLUE -- Code-To-Text
## Task Definition
The task is to generate natural language comments for a code, and evaluted by [smoothed bleu-4](https://www.aclweb.org/anthology/C04-1072.pdf) score.
## Dataset
The dataset we use comes from [CodeSearchNet](https://arxiv.org/pdf/1909.09436.pdf) and we filter the dataset as the following:
- Remove examples that codes cannot be parsed into an abstract syntax tree.
- Remove examples that #tokens of documents is < 3 or >256
- Remove examples that documents contain special tokens (e.g. <img ...> or https:...)
- Remove examples that documents are not English.
### Data Format
After preprocessing dataset, you can obtain three .jsonl files, i.e. train.jsonl, valid.jsonl, test.jsonl
For each file, each line in the uncompressed file represents one function. One row is illustrated below.
- **repo:** the owner/repo
- **path:** the full path to the original file
- **func_name:** the function or method name
- **original_string:** the raw string before tokenization or parsing
- **language:** the programming language
- **code/function:** the part of the `original_string` that is code
- **code_tokens/function_tokens:** tokenized version of `code`
- **docstring:** the top-level comment or docstring, if it exists in the original string
- **docstring_tokens:** tokenized version of `docstring`
### Data Statistic
| Programming Language | Training | Dev | Test |
| :------------------- | :------: | :----: | :----: |
| Java | 164,923 | 5,183 | 10,955 |
## Reference
<pre><code>@article{husain2019codesearchnet,
title={Codesearchnet challenge: Evaluating the state of semantic code search},
author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
journal={arXiv preprint arXiv:1909.09436},
year={2019}
}</code></pre>
提供机构:
semeru
原始信息汇总
CodeXGLUE -- Code-To-Text
任务定义
任务是生成代码的自然语言注释,并通过平滑BLEU-4评分进行评估。
数据集
数据集来自CodeSearchNet,并进行了以下过滤:
- 移除无法解析为抽象语法树的代码示例。
- 移除文档的#tokens数小于3或大于256的示例。
- 移除文档包含特殊标记(例如<img ...>或https:...)的示例。
- 移除文档不是英文的示例。
数据格式
预处理后,可以获得三个.jsonl文件,即train.jsonl、valid.jsonl和test.jsonl。
每个文件中,每一行代表一个函数。每行包含以下字段:
- repo: 所有者/仓库
- path: 原始文件的完整路径
- func_name: 函数或方法名称
- original_string: 分词或解析前的原始字符串
- language: 编程语言
- code/function:
original_string中的代码部分 - code_tokens/function_tokens:
code的分词版本 - docstring: 原始字符串中的顶级注释或文档字符串(如果存在)
- docstring_tokens:
docstring的分词版本
数据统计
| 编程语言 | 训练集 | 开发集 | 测试集 |
|---|---|---|---|
| Java | 164,923 | 5,183 | 10,955 |



