five

code_x_glue_tc_text_to_code

收藏
魔搭社区2025-11-07 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/code_x_glue_tc_text_to_code
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "code_x_glue_tc_text_to_code" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits-sample-size) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code ### Dataset Summary CodeXGLUE text-to-code dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/. ### Supported Tasks and Leaderboards - `machine-translation`: The dataset can be used to train a model for generating Java code from an **English** natural language description. ### Languages - Java **programming** language ## Dataset Structure ### Data Instances An example of 'train' looks as follows. ``` { "code": "boolean function ( ) { return isParsed ; }", "id": 0, "nl": "check if details are parsed . concode_field_sep Container parent concode_elem_sep boolean isParsed concode_elem_sep long offset concode_elem_sep long contentStartPosition concode_elem_sep ByteBuffer deadBytes concode_elem_sep boolean isRead concode_elem_sep long memMapSize concode_elem_sep Logger LOG concode_elem_sep byte[] userType concode_elem_sep String type concode_elem_sep ByteBuffer content concode_elem_sep FileChannel fileChannel concode_field_sep Container getParent concode_elem_sep byte[] getUserType concode_elem_sep void readContent concode_elem_sep long getOffset concode_elem_sep long getContentSize concode_elem_sep void getContent concode_elem_sep void setDeadBytes concode_elem_sep void parse concode_elem_sep void getHeader concode_elem_sep long getSize concode_elem_sep void parseDetails concode_elem_sep String getType concode_elem_sep void _parseDetails concode_elem_sep String getPath concode_elem_sep boolean verify concode_elem_sep void setParent concode_elem_sep void getBox concode_elem_sep boolean isSmallBox" } ``` ### Data Fields In the following each data field in go is explained for each config. The data fields are the same among all splits. #### default |field name| type | description | |----------|------|---------------------------------------------| |id |int32 | Index of the sample | |nl |string| The natural language description of the task| |code |string| The programming source code for the task | ### Data Splits | name |train |validation|test| |-------|-----:|---------:|---:| |default|100000| 2000|2000| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators https://github.com/microsoft, https://github.com/madlag ### Licensing Information Computational Use of Data Agreement (C-UDA) License. ### Citation Information ``` @article{iyer2018mapping, title={Mapping language to code in programmatic context}, author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke}, journal={arXiv preprint arXiv:1808.09588}, year={2018} } ``` ### Contributions Thanks to @madlag (and partly also @ncoop57) for adding this dataset.

# 「code_x_glue_tc_text_to_code」数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准测试榜单](#supported-tasks-and-leaderboards) - [使用语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分(样本量)](#data-splits-sample-size) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏倚讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页**:https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code ### 数据集概述 CodeXGLUE文本到代码数据集,可从https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code获取。本数据集从微软文档中爬取并筛选得到,其官方文档位于https://github.com/MicrosoftDocs/。 ### 支持任务与基准测试榜单 - `机器翻译`:该数据集可用于训练模型,从英文自然语言描述生成Java代码。 ### 使用语言 - Java编程语言 ## 数据集结构 ### 数据实例 训练集(train)的一个示例格式如下: { "code": "boolean function ( ) { return isParsed ; }", "id": 0, "nl": "check if details are parsed . concode_field_sep Container parent concode_elem_sep boolean isParsed concode_elem_sep long offset concode_elem_sep long contentStartPosition concode_elem_sep ByteBuffer deadBytes concode_elem_sep boolean isRead concode_elem_sep long memMapSize concode_elem_sep Logger LOG concode_elem_sep byte[] userType concode_elem_sep String type concode_elem_sep ByteBuffer content concode_elem_sep FileChannel fileChannel concode_field_sep Container getParent concode_elem_sep byte[] getUserType concode_elem_sep void readContent concode_elem_sep long getOffset concode_elem_sep long getContentSize concode_elem_sep void getContent concode_elem_sep void setDeadBytes concode_elem_sep void parse concode_elem_sep void getHeader concode_elem_sep long getSize concode_elem_sep void parseDetails concode_elem_sep String getType concode_elem_sep void _parseDetails concode_elem_sep String getPath concode_elem_sep boolean verify concode_elem_sep void setParent concode_elem_sep void getBox concode_elem_sep boolean isSmallBox" } ### 数据字段 下文将针对各配置逐一解释数据字段,所有数据划分下的数据字段均保持一致。 #### 默认配置 |字段名|类型|描述| |----------|------|---------------------------------------------| |id |int32 | 样本索引 | |nl |string| 任务对应的自然语言描述| |code |string| 任务对应的编程源代码 | ### 数据划分 | 划分名称 | 训练集 | 验证集 | 测试集 | |-------|-----:|---------:|---:| |默认配置|100000| 2000|2000| ## 数据集构建 ### 构建依据 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁? [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注者是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏倚讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 https://github.com/microsoft, https://github.com/madlag ### 许可信息 计算数据使用协议(Computational Use of Data Agreement, C-UDA)许可证。 ### 引用信息 @article{iyer2018mapping, title={Mapping language to code in programmatic context}, author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke}, journal={arXiv preprint arXiv:1808.09588}, year={2018} } ### 贡献致谢 感谢@madlag(部分贡献来自@ncoop57)添加本数据集。
提供机构:
maas
创建时间:
2025-04-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作