code_x_glue_tc_text_to_code
收藏魔搭社区2025-11-07 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/code_x_glue_tc_text_to_code
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "code_x_glue_tc_text_to_code"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits-sample-size)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code
### Dataset Summary
CodeXGLUE text-to-code dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code
The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/.
### Supported Tasks and Leaderboards
- `machine-translation`: The dataset can be used to train a model for generating Java code from an **English** natural language description.
### Languages
- Java **programming** language
## Dataset Structure
### Data Instances
An example of 'train' looks as follows.
```
{
"code": "boolean function ( ) { return isParsed ; }",
"id": 0,
"nl": "check if details are parsed . concode_field_sep Container parent concode_elem_sep boolean isParsed concode_elem_sep long offset concode_elem_sep long contentStartPosition concode_elem_sep ByteBuffer deadBytes concode_elem_sep boolean isRead concode_elem_sep long memMapSize concode_elem_sep Logger LOG concode_elem_sep byte[] userType concode_elem_sep String type concode_elem_sep ByteBuffer content concode_elem_sep FileChannel fileChannel concode_field_sep Container getParent concode_elem_sep byte[] getUserType concode_elem_sep void readContent concode_elem_sep long getOffset concode_elem_sep long getContentSize concode_elem_sep void getContent concode_elem_sep void setDeadBytes concode_elem_sep void parse concode_elem_sep void getHeader concode_elem_sep long getSize concode_elem_sep void parseDetails concode_elem_sep String getType concode_elem_sep void _parseDetails concode_elem_sep String getPath concode_elem_sep boolean verify concode_elem_sep void setParent concode_elem_sep void getBox concode_elem_sep boolean isSmallBox"
}
```
### Data Fields
In the following each data field in go is explained for each config. The data fields are the same among all splits.
#### default
|field name| type | description |
|----------|------|---------------------------------------------|
|id |int32 | Index of the sample |
|nl |string| The natural language description of the task|
|code |string| The programming source code for the task |
### Data Splits
| name |train |validation|test|
|-------|-----:|---------:|---:|
|default|100000| 2000|2000|
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
https://github.com/microsoft, https://github.com/madlag
### Licensing Information
Computational Use of Data Agreement (C-UDA) License.
### Citation Information
```
@article{iyer2018mapping,
title={Mapping language to code in programmatic context},
author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:1808.09588},
year={2018}
}
```
### Contributions
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.
# 「code_x_glue_tc_text_to_code」数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与基准测试榜单](#supported-tasks-and-leaderboards)
- [使用语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分(样本量)](#data-splits-sample-size)
- [数据集构建](#dataset-creation)
- [构建依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏倚讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页**:https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code
### 数据集概述
CodeXGLUE文本到代码数据集,可从https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code获取。本数据集从微软文档中爬取并筛选得到,其官方文档位于https://github.com/MicrosoftDocs/。
### 支持任务与基准测试榜单
- `机器翻译`:该数据集可用于训练模型,从英文自然语言描述生成Java代码。
### 使用语言
- Java编程语言
## 数据集结构
### 数据实例
训练集(train)的一个示例格式如下:
{
"code": "boolean function ( ) { return isParsed ; }",
"id": 0,
"nl": "check if details are parsed . concode_field_sep Container parent concode_elem_sep boolean isParsed concode_elem_sep long offset concode_elem_sep long contentStartPosition concode_elem_sep ByteBuffer deadBytes concode_elem_sep boolean isRead concode_elem_sep long memMapSize concode_elem_sep Logger LOG concode_elem_sep byte[] userType concode_elem_sep String type concode_elem_sep ByteBuffer content concode_elem_sep FileChannel fileChannel concode_field_sep Container getParent concode_elem_sep byte[] getUserType concode_elem_sep void readContent concode_elem_sep long getOffset concode_elem_sep long getContentSize concode_elem_sep void getContent concode_elem_sep void setDeadBytes concode_elem_sep void parse concode_elem_sep void getHeader concode_elem_sep long getSize concode_elem_sep void parseDetails concode_elem_sep String getType concode_elem_sep void _parseDetails concode_elem_sep String getPath concode_elem_sep boolean verify concode_elem_sep void setParent concode_elem_sep void getBox concode_elem_sep boolean isSmallBox"
}
### 数据字段
下文将针对各配置逐一解释数据字段,所有数据划分下的数据字段均保持一致。
#### 默认配置
|字段名|类型|描述|
|----------|------|---------------------------------------------|
|id |int32 | 样本索引 |
|nl |string| 任务对应的自然语言描述|
|code |string| 任务对应的编程源代码 |
### 数据划分
| 划分名称 | 训练集 | 验证集 | 测试集 |
|-------|-----:|---------:|---:|
|默认配置|100000| 2000|2000|
## 数据集构建
### 构建依据
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生产者是谁?
[需补充更多信息]
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注者是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏倚讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
https://github.com/microsoft, https://github.com/madlag
### 许可信息
计算数据使用协议(Computational Use of Data Agreement, C-UDA)许可证。
### 引用信息
@article{iyer2018mapping,
title={Mapping language to code in programmatic context},
author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:1808.09588},
year={2018}
}
### 贡献致谢
感谢@madlag(部分贡献来自@ncoop57)添加本数据集。
提供机构:
maas
创建时间:
2025-04-21



