下载链接：

https://modelscope.cn/datasets/google/code_x_glue_tt_text_to_text

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "code_x_glue_tt_text_to_text" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits-sample-size) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text - **Paper:** https://arxiv.org/abs/2102.04664 ### Dataset Summary CodeXGLUE text-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/. ### Supported Tasks and Leaderboards - `machine-translation`: The dataset can be used to train a model for translating Technical documentation between languages. ### Languages da_en, lv_en, no_en, zh_en ## Dataset Structure ### Data Instances #### da_en An example of 'test' looks as follows. ``` { "id": 0, "source": "4 . K\u00f8r modellen , og udgiv den som en webtjeneste .\n", "target": "4 . Run the model , and publish it as a web service .\n" } ``` #### lv_en An example of 'train' looks as follows. ``` { "id": 0, "source": "title : Pakalpojumu objektu izveide\n", "target": "title : Create service objects\n" } ``` #### no_en An example of 'validation' looks as follows. ``` { "id": 0, "source": "2 . \u00c5pne servicevaren du vil definere komponenter fra en stykkliste for .\n", "target": "2 . Open the service item for which you want to set up components from a BOM .\n" } ``` #### zh_en An example of 'validation' looks as follows. ``` { "id": 0, "source": "& # 124 ; MCDUserNotificationReadStateFilterAny & # 124 ; 0 & # 124 ; \u5305\u62ec \u901a\u77e5 , \u800c \u4e0d \u8003\u8651 \u8bfb\u53d6 \u72b6\u6001 \u3002 & # 124 ;\n", "target": "| MCDUserNotificationReadStateFilterAny | 0 | Include notifications regardless of read state . |\n" } ``` ### Data Fields In the following each data field in go is explained for each config. The data fields are the same among all splits. #### da_en, lv_en, no_en, zh_en |field name| type | description | |----------|------|----------------------------------------| |id |int32 | The index of the sample | |source |string| The source language version of the text| |target |string| The target language version of the text| ### Data Splits |name |train|validation|test| |-----|----:|---------:|---:| |da_en|42701| 1000|1000| |lv_en|18749| 1000|1000| |no_en|44322| 1000|1000| |zh_en|50154| 1000|1000| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators https://github.com/microsoft, https://github.com/madlag ### Licensing Information Computational Use of Data Agreement (C-UDA) License. ### Citation Information ``` @article{DBLP:journals/corr/abs-2102-04664, author = {Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu}, title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding and Generation}, journal = {CoRR}, volume = {abs/2102.04664}, year = {2021} } ``` ### Contributions Thanks to @madlag (and partly also @ncoop57) for adding this dataset.

# 数据集卡片："code_x_glue_tt_text_to_text" ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持的任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据拆分](#data-splits-sample-size) - [数据集构建](#dataset-creation) - [策划依据](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集策划方](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - **主页**：https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text - **论文**：https://arxiv.org/abs/2102.04664 ### 数据集摘要 CodeXGLUE 文本到文本数据集，获取地址为 https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text。本数据集从微软官方文档中爬取并过滤得到，其文档仓库位于 https://github.com/MicrosoftDocs/。 ### 支持的任务与排行榜 - `机器翻译`：本数据集可用于训练模型，实现跨语言技术文档翻译。 ### 语言语言对：da_en（丹麦语-英语）、lv_en（拉脱维亚语-英语）、no_en（挪威语-英语）、zh_en（中文-英语） ## 数据集结构 ### 数据实例 #### da_en `test` 拆分的示例如下： { "id": 0, "source": "4 . Ku00f8r modellen , og udgiv den som en webtjeneste . ", "target": "4 . Run the model , and publish it as a web service . " } #### lv_en `train` 拆分的示例如下： { "id": 0, "source": "title : Pakalpojumu objektu izveide ", "target": "title : Create service objects " } #### no_en `validation` 拆分的示例如下： { "id": 0, "source": "2 . u00c5pne servicevaren du vil definere komponenter fra en stykkliste for . ", "target": "2 . Open the service item for which you want to set up components from a BOM . " } #### zh_en `validation` 拆分的示例如下： { "id": 0, "source": "& # 124 ; MCDUserNotificationReadStateFilterAny & # 124 ; 0 & # 124 ; u5305u62ec u901au77e5 , u800c u4e0d u8003u8651 u8bfbu53d6 u72b6u6001 . & # 124 ; ", "target": "| MCDUserNotificationReadStateFilterAny | 0 | Include notifications regardless of read state . | " } ### 数据字段以下针对每个配置逐一说明数据字段，所有数据拆分下的字段均保持一致。 #### da_en、lv_en、no_en、zh_en |字段名|类型|描述| |----------|------|----------------------------------------| |id |int32 | 样本索引| |source |string| 源语言文本版本| |target |string| 目标语言文本版本| ### 数据拆分 |拆分名称|训练集样本数|验证集样本数|测试集样本数| |-----|----:|---------:|---:| |da_en|42701| 1000|1000| |lv_en|18749| 1000|1000| |no_en|44322| 1000|1000| |zh_en|50154| 1000|1000| ## 数据集构建 ### 策划依据 [需补充更多信息] ### 源数据 #### 初始数据收集与归一化 [需补充更多信息] #### 源语言文本的创作者是谁？ [需补充更多信息] ### 标注 #### 标注流程 [需补充更多信息] #### 标注人员是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集策划方 https://github.com/microsoft, https://github.com/madlag ### 许可信息计算数据使用协议（Computational Use of Data Agreement, C-UDA）许可证。 ### 引用信息 @article{DBLP:journals/corr/abs-2102-04664, author = {Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu}, title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding and Generation}, journal = {CoRR}, volume = {abs/2102.04664}, year = {2021} } ### 贡献感谢@madlag（以及部分@ncoop57）贡献本数据集。

应用场景：