five

semeru/code-text-ruby

收藏
Hugging Face2023-03-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/semeru/code-text-ruby
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit Programminglanguage: "ruby" version: "N/A" Date: "Codesearchnet(Jun 2020 - paper release date)" Contaminated: "Very Likely" Size: "Standar Tokenizer (TreeSitter)" --- ### Dataset is imported from CodeXGLUE and pre-processed using their script. # Where to find in Semeru: The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/ruby in Semeru # CodeXGLUE -- Code-To-Text ## Task Definition The task is to generate natural language comments for a code, and evaluted by [smoothed bleu-4](https://www.aclweb.org/anthology/C04-1072.pdf) score. ## Dataset The dataset we use comes from [CodeSearchNet](https://arxiv.org/pdf/1909.09436.pdf) and we filter the dataset as the following: - Remove examples that codes cannot be parsed into an abstract syntax tree. - Remove examples that #tokens of documents is < 3 or >256 - Remove examples that documents contain special tokens (e.g. <img ...> or https:...) - Remove examples that documents are not English. ### Data Format After preprocessing dataset, you can obtain three .jsonl files, i.e. train.jsonl, valid.jsonl, test.jsonl For each file, each line in the uncompressed file represents one function. One row is illustrated below. - **repo:** the owner/repo - **path:** the full path to the original file - **func_name:** the function or method name - **original_string:** the raw string before tokenization or parsing - **language:** the programming language - **code/function:** the part of the `original_string` that is code - **code_tokens/function_tokens:** tokenized version of `code` - **docstring:** the top-level comment or docstring, if it exists in the original string - **docstring_tokens:** tokenized version of `docstring` ### Data Statistic | Programming Language | Training | Dev | Test | | :------------------- | :------: | :----: | :----: | | Ruby | 24,927 | 1,400 | 1,261 | ## Reference <pre><code>@article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} }</code></pre>
提供机构:
semeru
原始信息汇总

数据集概述

数据集来源与处理

  • 数据集来源于CodeSearchNet,通过CodeXGLUE的脚本进行预处理。

数据集存储位置

  • 在Semeru系统中,数据集位于/nfs/semeru/semeru_datasets/code_xglue/code-to-text/ruby

任务定义

  • 任务目标为生成代码的自然语言注释,评估指标为smoothed bleu-4分数。

数据集过滤条件

  • 移除无法解析为抽象语法树的代码示例。
  • 移除文档token数量少于3或大于256的示例。
  • 移除包含特殊token(如<img ...>https:...)的文档。
  • 移除非英语文档。

数据格式

  • 预处理后,数据集包含三个.jsonl文件:train.jsonl, valid.jsonl, test.jsonl
  • 每个文件的每一行代表一个函数,包含以下字段:
    • repo: 仓库所有者/仓库名
    • path: 原始文件的完整路径
    • func_name: 函数或方法名
    • original_string: 未进行tokenization或解析的原始字符串
    • language: 编程语言
    • code/function: original_string中的代码部分
    • code_tokens/function_tokens: code的tokenized版本
    • docstring: 原始字符串中的顶级注释或docstring(如果存在)
    • docstring_tokens: docstring的tokenized版本

数据统计

编程语言 训练集 开发集 测试集
Ruby 24,927 1,400 1,261

引用

@article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作