five

google/code_x_glue_cc_code_to_code_trans

收藏
Hugging Face2024-01-24 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/google/code_x_glue_cc_code_to_code_trans
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - found language: - code license: - c-uda multilinguality: - other-programming-languages size_categories: - 10K<n<100K source_datasets: - original task_categories: - translation task_ids: [] pretty_name: CodeXGlueCcCodeToCodeTrans tags: - code-to-code dataset_info: features: - name: id dtype: int32 - name: java dtype: string - name: cs dtype: string splits: - name: train num_bytes: 4372641 num_examples: 10300 - name: validation num_bytes: 226407 num_examples: 500 - name: test num_bytes: 418587 num_examples: 1000 download_size: 2064764 dataset_size: 5017635 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- # Dataset Card for "code_x_glue_cc_code_to_code_trans" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits-sample-size) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans - **Paper:** https://arxiv.org/abs/2102.04664 ### Dataset Summary CodeXGLUE code-to-code-trans dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans The dataset is collected from several public repos, including Lucene(http://lucene.apache.org/), POI(http://poi.apache.org/), JGit(https://github.com/eclipse/jgit/) and Antlr(https://github.com/antlr/). We collect both the Java and C# versions of the codes and find the parallel functions. After removing duplicates and functions with the empty body, we split the whole dataset into training, validation and test sets. ### Supported Tasks and Leaderboards - `machine-translation`: The dataset can be used to train a model for translating code in Java to C# and vice versa. ### Languages - Java **programming** language - C# **programming** language ## Dataset Structure ### Data Instances An example of 'validation' looks as follows. ``` { "cs": "public DVRecord(RecordInputStream in1){_option_flags = in1.ReadInt();_promptTitle = ReadUnicodeString(in1);_errorTitle = ReadUnicodeString(in1);_promptText = ReadUnicodeString(in1);_errorText = ReadUnicodeString(in1);int field_size_first_formula = in1.ReadUShort();_not_used_1 = in1.ReadShort();_formula1 = NPOI.SS.Formula.Formula.Read(field_size_first_formula, in1);int field_size_sec_formula = in1.ReadUShort();_not_used_2 = in1.ReadShort();_formula2 = NPOI.SS.Formula.Formula.Read(field_size_sec_formula, in1);_regions = new CellRangeAddressList(in1);}\n", "id": 0, "java": "public DVRecord(RecordInputStream in) {_option_flags = in.readInt();_promptTitle = readUnicodeString(in);_errorTitle = readUnicodeString(in);_promptText = readUnicodeString(in);_errorText = readUnicodeString(in);int field_size_first_formula = in.readUShort();_not_used_1 = in.readShort();_formula1 = Formula.read(field_size_first_formula, in);int field_size_sec_formula = in.readUShort();_not_used_2 = in.readShort();_formula2 = Formula.read(field_size_sec_formula, in);_regions = new CellRangeAddressList(in);}\n" } ``` ### Data Fields In the following each data field in go is explained for each config. The data fields are the same among all splits. #### default |field name| type | description | |----------|------|-----------------------------| |id |int32 | Index of the sample | |java |string| The java version of the code| |cs |string| The C# version of the code | ### Data Splits | name |train|validation|test| |-------|----:|---------:|---:| |default|10300| 500|1000| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators https://github.com/microsoft, https://github.com/madlag ### Licensing Information Computational Use of Data Agreement (C-UDA) License. ### Citation Information ``` @article{DBLP:journals/corr/abs-2102-04664, author = {Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu}, title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding and Generation}, journal = {CoRR}, volume = {abs/2102.04664}, year = {2021} } ``` ### Contributions Thanks to @madlag (and partly also @ncoop57) for adding this dataset.
提供机构:
google
原始信息汇总

数据集概述

数据集名称: CodeXGlueCcCodeToCodeTrans

数据集描述: 该数据集包含Java和C#两种编程语言的代码翻译任务,旨在训练模型实现从Java到C#或从C#到Java的代码翻译。

语言:

  • Java
  • C#

任务:

  • 机器翻译(代码到代码)

数据集结构

数据实例: 每个数据实例包含以下字段:

  • id: 样本索引,类型为int32
  • java: Java版本的代码,类型为string
  • cs: C#版本的代码,类型为string

数据分割:

  • 训练集: 10300个样本
  • 验证集: 500个样本
  • 测试集: 1000个样本

数据集创建

许可证:

  • C-UDA许可证

贡献者:

  • Microsoft
  • @madlag
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作