semeru/code-text-ruby

Name: semeru/code-text-ruby
Creator: semeru
Published: 2023-03-23 20:02:18
License: 暂无描述

Hugging Face2023-03-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/semeru/code-text-ruby

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit Programminglanguage: "ruby" version: "N/A" Date: "Codesearchnet(Jun 2020 - paper release date)" Contaminated: "Very Likely" Size: "Standar Tokenizer (TreeSitter)" --- ### Dataset is imported from CodeXGLUE and pre-processed using their script. # Where to find in Semeru: The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/ruby in Semeru # CodeXGLUE -- Code-To-Text ## Task Definition The task is to generate natural language comments for a code, and evaluted by [smoothed bleu-4](https://www.aclweb.org/anthology/C04-1072.pdf) score. ## Dataset The dataset we use comes from [CodeSearchNet](https://arxiv.org/pdf/1909.09436.pdf) and we filter the dataset as the following: - Remove examples that codes cannot be parsed into an abstract syntax tree. - Remove examples that #tokens of documents is < 3 or >256 - Remove examples that documents contain special tokens (e.g. <img ...> or https:...) - Remove examples that documents are not English. ### Data Format After preprocessing dataset, you can obtain three .jsonl files, i.e. train.jsonl, valid.jsonl, test.jsonl For each file, each line in the uncompressed file represents one function. One row is illustrated below. - **repo:** the owner/repo - **path:** the full path to the original file - **func_name:** the function or method name - **original_string:** the raw string before tokenization or parsing - **language:** the programming language - **code/function:** the part of the `original_string` that is code - **code_tokens/function_tokens:** tokenized version of `code` - **docstring:** the top-level comment or docstring, if it exists in the original string - **docstring_tokens:** tokenized version of `docstring` ### Data Statistic | Programming Language | Training | Dev | Test | | :------------------- | :------: | :----: | :----: | | Ruby | 24,927 | 1,400 | 1,261 | ## Reference <pre><code>@article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} }</code></pre>

提供机构：

semeru

原始信息汇总

数据集概述

数据集来源与处理

数据集来源于CodeSearchNet，通过CodeXGLUE的脚本进行预处理。

数据集存储位置

在Semeru系统中，数据集位于/nfs/semeru/semeru_datasets/code_xglue/code-to-text/ruby。

任务定义

任务目标为生成代码的自然语言注释，评估指标为smoothed bleu-4分数。

数据集过滤条件

移除无法解析为抽象语法树的代码示例。
移除文档token数量少于3或大于256的示例。
移除包含特殊token（如<img ...>或https:...）的文档。
移除非英语文档。

数据格式

预处理后，数据集包含三个.jsonl文件：train.jsonl, valid.jsonl, test.jsonl。
每个文件的每一行代表一个函数，包含以下字段：
- repo: 仓库所有者/仓库名
- path: 原始文件的完整路径
- func_name: 函数或方法名
- original_string: 未进行tokenization或解析的原始字符串
- language: 编程语言
- code/function: original_string中的代码部分
- code_tokens/function_tokens: code的tokenized版本
- docstring: 原始字符串中的顶级注释或docstring（如果存在）
- docstring_tokens: docstring的tokenized版本

数据统计

编程语言	训练集	开发集	测试集
Ruby	24,927	1,400	1,261

引用

@article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集