semeru/code-code-translation-java-csharp
收藏Hugging Face2023-03-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/semeru/code-code-translation-java-csharp
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
Programminglanguage: "Java/C#"
version: "N/A"
Date: "Most likely 2020"
Contaminated: "Very Likely"
Size: "Standard Tokenizer"
---
### Dataset is imported from CodeXGLUE and pre-processed using their script.
# Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/code-to-code-trans in Semeru
# CodeXGLUE -- Code2Code Translation
## Task Definition
Code translation aims to migrate legacy software from one programming language in a platform toanother.
In CodeXGLUE, given a piece of Java (C#) code, the task is to translate the code into C# (Java) version.
Models are evaluated by BLEU scores, accuracy (exactly match), and [CodeBLEU](https://github.com/microsoft/CodeXGLUE/blob/main/code-to-code-trans/CodeBLEU.MD) scores.
## Dataset
The dataset is collected from several public repos, including Lucene(http://lucene.apache.org/), POI(http://poi.apache.org/), JGit(https://github.com/eclipse/jgit/) and Antlr(https://github.com/antlr/).
We collect both the Java and C# versions of the codes and find the parallel functions. After removing duplicates and functions with the empty body, we split the whole dataset into training, validation and test sets.
### Data Format
The dataset is in the "data" folder. Each line of the files is a function, and the suffix of the file indicates the programming language.
### Data Statistics
Data statistics of the dataset are shown in the below table:
| | #Examples |
| ------- | :-------: |
| Train | 10,300 |
| Valid | 500 |
| Test | 1,000 |
提供机构:
semeru
原始信息汇总
数据集概述
数据集来源与处理
- 数据集从CodeXGLUE导入,并使用其提供的脚本进行预处理。
数据集存储位置
- 在Semeru系统中,数据集位于
/nfs/semeru/semeru_datasets/code_xglue/code-to-code/code-to-code-trans。
任务定义
- 代码翻译任务旨在将遗留软件从一个编程语言平台迁移到另一个。在CodeXGLUE中,任务是给定一段Java(C#)代码,将其翻译成C#(Java)版本。
评估指标
- 模型评估使用BLEU分数、准确性(完全匹配)和CodeBLEU分数。
数据集组成
- 数据集从Lucene、POI、JGit和Antlr等公共仓库收集,包含Java和C#版本的代码,并找到平行函数。
- 经过去重和去除空体函数后,数据集被分为训练集、验证集和测试集。
数据格式
- 数据集位于“data”文件夹中,每行代表一个函数,文件后缀表示编程语言。
数据统计
| #Examples | |
|---|---|
| Train | 10,300 |
| Valid | 500 |
| Test | 1,000 |



