semeru/code-text-ruby
收藏Hugging Face2023-03-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/semeru/code-text-ruby
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
Programminglanguage: "ruby"
version: "N/A"
Date: "Codesearchnet(Jun 2020 - paper release date)"
Contaminated: "Very Likely"
Size: "Standar Tokenizer (TreeSitter)"
---
### Dataset is imported from CodeXGLUE and pre-processed using their script.
# Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/ruby in Semeru
# CodeXGLUE -- Code-To-Text
## Task Definition
The task is to generate natural language comments for a code, and evaluted by [smoothed bleu-4](https://www.aclweb.org/anthology/C04-1072.pdf) score.
## Dataset
The dataset we use comes from [CodeSearchNet](https://arxiv.org/pdf/1909.09436.pdf) and we filter the dataset as the following:
- Remove examples that codes cannot be parsed into an abstract syntax tree.
- Remove examples that #tokens of documents is < 3 or >256
- Remove examples that documents contain special tokens (e.g. <img ...> or https:...)
- Remove examples that documents are not English.
### Data Format
After preprocessing dataset, you can obtain three .jsonl files, i.e. train.jsonl, valid.jsonl, test.jsonl
For each file, each line in the uncompressed file represents one function. One row is illustrated below.
- **repo:** the owner/repo
- **path:** the full path to the original file
- **func_name:** the function or method name
- **original_string:** the raw string before tokenization or parsing
- **language:** the programming language
- **code/function:** the part of the `original_string` that is code
- **code_tokens/function_tokens:** tokenized version of `code`
- **docstring:** the top-level comment or docstring, if it exists in the original string
- **docstring_tokens:** tokenized version of `docstring`
### Data Statistic
| Programming Language | Training | Dev | Test |
| :------------------- | :------: | :----: | :----: |
| Ruby | 24,927 | 1,400 | 1,261 |
## Reference
<pre><code>@article{husain2019codesearchnet,
title={Codesearchnet challenge: Evaluating the state of semantic code search},
author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
journal={arXiv preprint arXiv:1909.09436},
year={2019}
}</code></pre>
提供机构:
semeru
原始信息汇总
数据集概述
数据集来源与处理
- 数据集来源于CodeSearchNet,通过CodeXGLUE的脚本进行预处理。
数据集存储位置
- 在Semeru系统中,数据集位于
/nfs/semeru/semeru_datasets/code_xglue/code-to-text/ruby。
任务定义
- 任务目标为生成代码的自然语言注释,评估指标为smoothed bleu-4分数。
数据集过滤条件
- 移除无法解析为抽象语法树的代码示例。
- 移除文档token数量少于3或大于256的示例。
- 移除包含特殊token(如
<img ...>或https:...)的文档。 - 移除非英语文档。
数据格式
- 预处理后,数据集包含三个
.jsonl文件:train.jsonl,valid.jsonl,test.jsonl。 - 每个文件的每一行代表一个函数,包含以下字段:
- repo: 仓库所有者/仓库名
- path: 原始文件的完整路径
- func_name: 函数或方法名
- original_string: 未进行tokenization或解析的原始字符串
- language: 编程语言
- code/function:
original_string中的代码部分 - code_tokens/function_tokens:
code的tokenized版本 - docstring: 原始字符串中的顶级注释或docstring(如果存在)
- docstring_tokens:
docstring的tokenized版本
数据统计
| 编程语言 | 训练集 | 开发集 | 测试集 |
|---|---|---|---|
| Ruby | 24,927 | 1,400 | 1,261 |
引用
@article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} }



