code_x_glue_tt_text_to_text
收藏魔搭社区2025-11-07 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/code_x_glue_tt_text_to_text
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "code_x_glue_tt_text_to_text"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits-sample-size)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text
- **Paper:** https://arxiv.org/abs/2102.04664
### Dataset Summary
CodeXGLUE text-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text
The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/.
### Supported Tasks and Leaderboards
- `machine-translation`: The dataset can be used to train a model for translating Technical documentation between languages.
### Languages
da_en, lv_en, no_en, zh_en
## Dataset Structure
### Data Instances
#### da_en
An example of 'test' looks as follows.
```
{
"id": 0,
"source": "4 . K\u00f8r modellen , og udgiv den som en webtjeneste .\n",
"target": "4 . Run the model , and publish it as a web service .\n"
}
```
#### lv_en
An example of 'train' looks as follows.
```
{
"id": 0,
"source": "title : Pakalpojumu objektu izveide\n",
"target": "title : Create service objects\n"
}
```
#### no_en
An example of 'validation' looks as follows.
```
{
"id": 0,
"source": "2 . \u00c5pne servicevaren du vil definere komponenter fra en stykkliste for .\n",
"target": "2 . Open the service item for which you want to set up components from a BOM .\n"
}
```
#### zh_en
An example of 'validation' looks as follows.
```
{
"id": 0,
"source": "& # 124 ; MCDUserNotificationReadStateFilterAny & # 124 ; 0 & # 124 ; \u5305\u62ec \u901a\u77e5 , \u800c \u4e0d \u8003\u8651 \u8bfb\u53d6 \u72b6\u6001 \u3002 & # 124 ;\n",
"target": "| MCDUserNotificationReadStateFilterAny | 0 | Include notifications regardless of read state . |\n"
}
```
### Data Fields
In the following each data field in go is explained for each config. The data fields are the same among all splits.
#### da_en, lv_en, no_en, zh_en
|field name| type | description |
|----------|------|----------------------------------------|
|id |int32 | The index of the sample |
|source |string| The source language version of the text|
|target |string| The target language version of the text|
### Data Splits
|name |train|validation|test|
|-----|----:|---------:|---:|
|da_en|42701| 1000|1000|
|lv_en|18749| 1000|1000|
|no_en|44322| 1000|1000|
|zh_en|50154| 1000|1000|
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
https://github.com/microsoft, https://github.com/madlag
### Licensing Information
Computational Use of Data Agreement (C-UDA) License.
### Citation Information
```
@article{DBLP:journals/corr/abs-2102-04664,
author = {Shuai Lu and
Daya Guo and
Shuo Ren and
Junjie Huang and
Alexey Svyatkovskiy and
Ambrosio Blanco and
Colin B. Clement and
Dawn Drain and
Daxin Jiang and
Duyu Tang and
Ge Li and
Lidong Zhou and
Linjun Shou and
Long Zhou and
Michele Tufano and
Ming Gong and
Ming Zhou and
Nan Duan and
Neel Sundaresan and
Shao Kun Deng and
Shengyu Fu and
Shujie Liu},
title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding
and Generation},
journal = {CoRR},
volume = {abs/2102.04664},
year = {2021}
}
```
### Contributions
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.
# 数据集卡片:"code_x_glue_tt_text_to_text"
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持的任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据拆分](#data-splits-sample-size)
- [数据集构建](#dataset-creation)
- [策划依据](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集策划方](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **主页**:https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text
- **论文**:https://arxiv.org/abs/2102.04664
### 数据集摘要
CodeXGLUE 文本到文本数据集,获取地址为 https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text。本数据集从微软官方文档中爬取并过滤得到,其文档仓库位于 https://github.com/MicrosoftDocs/。
### 支持的任务与排行榜
- `机器翻译`:本数据集可用于训练模型,实现跨语言技术文档翻译。
### 语言
语言对:da_en(丹麦语-英语)、lv_en(拉脱维亚语-英语)、no_en(挪威语-英语)、zh_en(中文-英语)
## 数据集结构
### 数据实例
#### da_en
`test` 拆分的示例如下:
{
"id": 0,
"source": "4 . Ku00f8r modellen , og udgiv den som en webtjeneste .
",
"target": "4 . Run the model , and publish it as a web service .
"
}
#### lv_en
`train` 拆分的示例如下:
{
"id": 0,
"source": "title : Pakalpojumu objektu izveide
",
"target": "title : Create service objects
"
}
#### no_en
`validation` 拆分的示例如下:
{
"id": 0,
"source": "2 . u00c5pne servicevaren du vil definere komponenter fra en stykkliste for .
",
"target": "2 . Open the service item for which you want to set up components from a BOM .
"
}
#### zh_en
`validation` 拆分的示例如下:
{
"id": 0,
"source": "& # 124 ; MCDUserNotificationReadStateFilterAny & # 124 ; 0 & # 124 ; u5305u62ec u901au77e5 , u800c u4e0d u8003u8651 u8bfbu53d6 u72b6u6001 . & # 124 ;
",
"target": "| MCDUserNotificationReadStateFilterAny | 0 | Include notifications regardless of read state . |
"
}
### 数据字段
以下针对每个配置逐一说明数据字段,所有数据拆分下的字段均保持一致。
#### da_en、lv_en、no_en、zh_en
|字段名|类型|描述|
|----------|------|----------------------------------------|
|id |int32 | 样本索引|
|source |string| 源语言文本版本|
|target |string| 目标语言文本版本|
### 数据拆分
|拆分名称|训练集样本数|验证集样本数|测试集样本数|
|-----|----:|---------:|---:|
|da_en|42701| 1000|1000|
|lv_en|18749| 1000|1000|
|no_en|44322| 1000|1000|
|zh_en|50154| 1000|1000|
## 数据集构建
### 策划依据
[需补充更多信息]
### 源数据
#### 初始数据收集与归一化
[需补充更多信息]
#### 源语言文本的创作者是谁?
[需补充更多信息]
### 标注
#### 标注流程
[需补充更多信息]
#### 标注人员是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集策划方
https://github.com/microsoft, https://github.com/madlag
### 许可信息
计算数据使用协议(Computational Use of Data Agreement, C-UDA)许可证。
### 引用信息
@article{DBLP:journals/corr/abs-2102-04664,
author = {Shuai Lu and
Daya Guo and
Shuo Ren and
Junjie Huang and
Alexey Svyatkovskiy and
Ambrosio Blanco and
Colin B. Clement and
Dawn Drain and
Daxin Jiang and
Duyu Tang and
Ge Li and
Lidong Zhou and
Linjun Shou and
Long Zhou and
Michele Tufano and
Ming Gong and
Ming Zhou and
Nan Duan and
Neel Sundaresan and
Shao Kun Deng and
Shengyu Fu and
Shujie Liu},
title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding
and Generation},
journal = {CoRR},
volume = {abs/2102.04664},
year = {2021}
}
### 贡献
感谢@madlag(以及部分@ncoop57)贡献本数据集。
提供机构:
maas
创建时间:
2025-04-21



