google/code_x_glue_cc_code_to_code_trans
收藏Hugging Face2024-01-24 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/google/code_x_glue_cc_code_to_code_trans
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- found
language:
- code
license:
- c-uda
multilinguality:
- other-programming-languages
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- translation
task_ids: []
pretty_name: CodeXGlueCcCodeToCodeTrans
tags:
- code-to-code
dataset_info:
features:
- name: id
dtype: int32
- name: java
dtype: string
- name: cs
dtype: string
splits:
- name: train
num_bytes: 4372641
num_examples: 10300
- name: validation
num_bytes: 226407
num_examples: 500
- name: test
num_bytes: 418587
num_examples: 1000
download_size: 2064764
dataset_size: 5017635
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
# Dataset Card for "code_x_glue_cc_code_to_code_trans"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits-sample-size)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans
- **Paper:** https://arxiv.org/abs/2102.04664
### Dataset Summary
CodeXGLUE code-to-code-trans dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans
The dataset is collected from several public repos, including Lucene(http://lucene.apache.org/), POI(http://poi.apache.org/), JGit(https://github.com/eclipse/jgit/) and Antlr(https://github.com/antlr/).
We collect both the Java and C# versions of the codes and find the parallel functions. After removing duplicates and functions with the empty body, we split the whole dataset into training, validation and test sets.
### Supported Tasks and Leaderboards
- `machine-translation`: The dataset can be used to train a model for translating code in Java to C# and vice versa.
### Languages
- Java **programming** language
- C# **programming** language
## Dataset Structure
### Data Instances
An example of 'validation' looks as follows.
```
{
"cs": "public DVRecord(RecordInputStream in1){_option_flags = in1.ReadInt();_promptTitle = ReadUnicodeString(in1);_errorTitle = ReadUnicodeString(in1);_promptText = ReadUnicodeString(in1);_errorText = ReadUnicodeString(in1);int field_size_first_formula = in1.ReadUShort();_not_used_1 = in1.ReadShort();_formula1 = NPOI.SS.Formula.Formula.Read(field_size_first_formula, in1);int field_size_sec_formula = in1.ReadUShort();_not_used_2 = in1.ReadShort();_formula2 = NPOI.SS.Formula.Formula.Read(field_size_sec_formula, in1);_regions = new CellRangeAddressList(in1);}\n",
"id": 0,
"java": "public DVRecord(RecordInputStream in) {_option_flags = in.readInt();_promptTitle = readUnicodeString(in);_errorTitle = readUnicodeString(in);_promptText = readUnicodeString(in);_errorText = readUnicodeString(in);int field_size_first_formula = in.readUShort();_not_used_1 = in.readShort();_formula1 = Formula.read(field_size_first_formula, in);int field_size_sec_formula = in.readUShort();_not_used_2 = in.readShort();_formula2 = Formula.read(field_size_sec_formula, in);_regions = new CellRangeAddressList(in);}\n"
}
```
### Data Fields
In the following each data field in go is explained for each config. The data fields are the same among all splits.
#### default
|field name| type | description |
|----------|------|-----------------------------|
|id |int32 | Index of the sample |
|java |string| The java version of the code|
|cs |string| The C# version of the code |
### Data Splits
| name |train|validation|test|
|-------|----:|---------:|---:|
|default|10300| 500|1000|
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
https://github.com/microsoft, https://github.com/madlag
### Licensing Information
Computational Use of Data Agreement (C-UDA) License.
### Citation Information
```
@article{DBLP:journals/corr/abs-2102-04664,
author = {Shuai Lu and
Daya Guo and
Shuo Ren and
Junjie Huang and
Alexey Svyatkovskiy and
Ambrosio Blanco and
Colin B. Clement and
Dawn Drain and
Daxin Jiang and
Duyu Tang and
Ge Li and
Lidong Zhou and
Linjun Shou and
Long Zhou and
Michele Tufano and
Ming Gong and
Ming Zhou and
Nan Duan and
Neel Sundaresan and
Shao Kun Deng and
Shengyu Fu and
Shujie Liu},
title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding
and Generation},
journal = {CoRR},
volume = {abs/2102.04664},
year = {2021}
}
```
### Contributions
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.
提供机构:
google
原始信息汇总
数据集概述
数据集名称: CodeXGlueCcCodeToCodeTrans
数据集描述: 该数据集包含Java和C#两种编程语言的代码翻译任务,旨在训练模型实现从Java到C#或从C#到Java的代码翻译。
语言:
- Java
- C#
任务:
- 机器翻译(代码到代码)
数据集结构
数据实例: 每个数据实例包含以下字段:
id: 样本索引,类型为int32java: Java版本的代码,类型为stringcs: C#版本的代码,类型为string
数据分割:
- 训练集: 10300个样本
- 验证集: 500个样本
- 测试集: 1000个样本
数据集创建
许可证:
- C-UDA许可证
贡献者:
- Microsoft
- @madlag



