five

codeparrot/xlcost-text-to-code

收藏
Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/codeparrot/xlcost-text-to-code
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: [] language_creators: - crowdsourced - expert-generated language: - code license: - cc-by-sa-4.0 multilinguality: - multilingual size_categories: - unknown source_datasets: [] task_categories: - text-generation task_ids: - language-modeling pretty_name: xlcost-text-to-code --- # XLCost for text-to-code synthesis ## Dataset Description This is a subset of [XLCoST benchmark](https://github.com/reddy-lab-code-research/XLCoST), for text-to-code generation at snippet level and program level for **7** programming languages: `Python, C, C#, C++, Java, Javascript and PHP`. ## Languages The dataset contains text in English and its corresponding code translation. Each program is divided into several code snippets, so the snipppet-level subsets contain these code snippets with their corresponding comments, for program-level subsets, the comments were concatenated in one long description. Moreover, programs in all the languages are aligned at the snippet level and the comment for a particular snippet is the same across all the languages. ## Dataset Structure To load the dataset you need to specify a subset among the **14 exiting instances**: `LANGUAGE-snippet-level/LANGUAGE-program-level` for `LANGUAGE` in `[Python, C, Csharp, C++, Java, Javascript and PHP]`. By default `Python-snippet-level` is loaded. ```python from datasets import load_dataset load_dataset("codeparrot/xlcost-text-to-code", "Python-program-level") DatasetDict({ train: Dataset({ features: ['text', 'code'], num_rows: 9263 }) test: Dataset({ features: ['text', 'code'], num_rows: 887 }) validation: Dataset({ features: ['text', 'code'], num_rows: 472 }) }) ``` ```python next(iter(data["train"])) {'text': 'Maximum Prefix Sum possible by merging two given arrays | Python3 implementation of the above approach ; Stores the maximum prefix sum of the array A [ ] ; Traverse the array A [ ] ; Stores the maximum prefix sum of the array B [ ] ; Traverse the array B [ ] ; Driver code', 'code': 'def maxPresum ( a , b ) : NEW_LINE INDENT X = max ( a [ 0 ] , 0 ) NEW_LINE for i in range ( 1 , len ( a ) ) : NEW_LINE INDENT a [ i ] += a [ i - 1 ] NEW_LINE X = max ( X , a [ i ] ) NEW_LINE DEDENT Y = max ( b [ 0 ] , 0 ) NEW_LINE for i in range ( 1 , len ( b ) ) : NEW_LINE INDENT b [ i ] += b [ i - 1 ] NEW_LINE Y = max ( Y , b [ i ] ) NEW_LINE DEDENT return X + Y NEW_LINE DEDENT A = [ 2 , - 1 , 4 , - 5 ] NEW_LINE B = [ 4 , - 3 , 12 , 4 , - 3 ] NEW_LINE print ( maxPresum ( A , B ) ) NEW_LINE'} ``` Note that the data undergo some tokenization hence the additional whitespaces and the use of NEW_LINE instead of `\n` and INDENT instead of `\t`, DEDENT to cancel indentation... ## Data Fields * text: natural language description/comment * code: code at snippet/program level ## Data Splits Each subset has three splits: train, test and validation. ## Citation Information ``` @misc{zhu2022xlcost, title = {XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence}, url = {https://arxiv.org/abs/2206.08474}, author = {Zhu, Ming and Jain, Aneesh and Suresh, Karthik and Ravindran, Roshan and Tipirneni, Sindhu and Reddy, Chandan K.}, year = {2022}, eprint={2206.08474}, archivePrefix={arXiv} } ```

annotations_creators: [] language_creators: - crowdsourced - expert-generated language: - code license: - cc-by-sa-4.0 multilinguality: - multilingual size_categories: - unknown source_datasets: [] task_categories: - text-generation task_ids: - language-modeling pretty_name: xlcost-text-to-code # 面向文本转代码合成的XLCost数据集 ## 数据集描述 本数据集是[XLCoST基准测试集(XLCoST benchmark)](https://github.com/reddy-lab-code-research/XLCoST)的子集,用于**7**种编程语言(Python、C、C#、C++、Java、JavaScript和PHP)的代码片段级与程序级文本转代码生成任务。 ## 语言说明 本数据集包含英文自然语言描述及其对应的代码译文。每个程序会被拆分为若干代码片段(code snippet),因此代码片段级子集包含这些代码片段及其对应注释;而程序级子集则将所有注释拼接为一段完整的自然语言描述。此外,所有编程语言的程序均在代码片段级别对齐,且特定代码片段的注释在所有语言中保持一致。 ## 数据集结构 加载该数据集时,需从14个现有子集(针对Python、C、C#、C++、Java、JavaScript、PHP的`LANGUAGE-snippet-level`与`LANGUAGE-program-level`格式)中指定所需子集。默认加载的子集为`Python-snippet-level`。 python from datasets import load_dataset load_dataset("codeparrot/xlcost-text-to-code", "Python-program-level") python DatasetDict({ train: Dataset({ features: ['text', 'code'], num_rows: 9263 }) test: Dataset({ features: ['text', 'code'], num_rows: 887 }) validation: Dataset({ features: ['text', 'code'], num_rows: 472 }) }) python next(iter(data["train"])) {'text': 'Maximum Prefix Sum possible by merging two given arrays | Python3 implementation of the above approach ; Stores the maximum prefix sum of the array A [ ] ; Traverse the array A [ ] ; Stores the maximum prefix sum of the array B [ ] ; Traverse the array B [ ] ; Driver code', 'code': 'def maxPresum ( a , b ) : NEW_LINE INDENT X = max ( a [ 0 ] , 0 ) NEW_LINE for i in range ( 1 , len ( a ) ) : NEW_LINE INDENT a [ i ] += a [ i - 1 ] NEW_LINE X = max ( X , a [ i ] ) NEW_LINE DEDENT Y = max ( b [ 0 ] , 0 ) NEW_LINE for i in range ( 1 , len ( b ) ) : NEW_LINE INDENT b [ i ] += b [ i - 1 ] NEW_LINE Y = max ( Y , b [ i ] ) NEW_LINE DEDENT return X + Y NEW_LINE DEDENT A = [ 2 , - 1 , 4 , - 5 ] NEW_LINE B = [ 4 , - 3 , 12 , 4 , - 3 ] NEW_LINE print ( maxPresum ( A , B ) ) NEW_LINE'} 请注意,该数据集经过了特定的分词处理,因此出现了额外的空格,且使用`NEW_LINE`替代换行符` `、`INDENT`替代制表符` `,`DEDENT`用于取消缩进等。 ## 数据字段 * `text`:自然语言描述/注释 * `code`:代码片段级或程序级代码 ## 数据划分 每个子集均包含训练、测试与验证三个数据划分。 ## 引用信息 @misc{zhu2022xlcost, title = {XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence}, url = {https://arxiv.org/abs/2206.08474}, author = {Zhu, Ming and Jain, Aneesh and Suresh, Karthik and Ravindran, Roshan and Tipirneni, Sindhu and Reddy, Chandan K.}, year = {2022}, eprint={2206.08474}, archivePrefix={arXiv} }
提供机构:
codeparrot
原始信息汇总

数据集概述

基本信息

  • 名称: xlcost-text-to-code
  • 任务类型: text-generation, language-modeling
  • 许可证: cc-by-sa-4.0
  • 语言: 多语言,包括英语和7种编程语言(Python, C, C#, C++, Java, Javascript, PHP)
  • 数据集大小: 未知
  • 数据来源: 无
  • 注释创建者: 无
  • 语言创建方式: 众包和专家生成

数据集描述

  • 目的: 用于文本到代码的生成,支持片段级和程序级的代码生成。
  • 编程语言支持: Python, C, C#, C++, Java, Javascript, PHP
  • 数据结构: 包含文本描述和对应的代码翻译。每个程序被分割成多个代码片段,片段级子集包含这些代码片段及其对应评论,程序级子集的评论被连接成一个长描述。

数据集结构

  • 加载方式: 通过指定14个现有实例中的一个来加载数据集,格式为LANGUAGE-snippet-level/LANGUAGE-program-level。默认加载Python-snippet-level

  • 数据集示例: python from datasets import load_dataset load_dataset("codeparrot/xlcost-text-to-code", "Python-program-level")

  • 数据字段:

    • text: 自然语言描述/评论
    • code: 代码片段/程序级代码
  • 数据分割: 每个子集包含训练集、测试集和验证集。

引用信息

@misc{zhu2022xlcost, title = {XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence}, url = {https://arxiv.org/abs/2206.08474}, author = {Zhu, Ming and Jain, Aneesh and Suresh, Karthik and Ravindran, Roshan and Tipirneni, Sindhu and Reddy, Chandan K.}, year = {2022}, eprint={2206.08474}, archivePrefix={arXiv} }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作