gabeorlanski/bc-transcoder

Name: gabeorlanski/bc-transcoder
Creator: gabeorlanski
Published: 2023-07-18 16:22:39
License: 暂无描述

Hugging Face2023-07-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/gabeorlanski/bc-transcoder

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation - text2text-generation - translation language: - en tags: - code pretty_name: BabelCode Transcoder size_categories: - 1K<n<10K source_datasets: - original - extended|transcoder --- # Dataset Card for BabelCode Transcoder ## Dataset Description - **Repository:** [GitHub Repository](https://github.com/google-research/babelcode) - **Paper:** [Measuring The Impact Of Programming Language Distribution](https://arxiv.org/abs/2302.01973) ### How To Use This Dataset To use this dataset, you can either use the original [BabelCode Repo](https://github.com/google-research/babelcode), or you can use the [`bc_eval` Metric](https://huggingface.co/spaces/gabeorlanski/bc_eval). ### Dataset Summary The [Transcoder](https://github.com/facebookresearch/CodeGen) dataset in BabelCode format. Currently supports translation from C++ and Python. ### Supported Tasks and Leaderboards ### Languages BC-Transcoder supports: * C++ * C# * Dart * Go * Haskell * Java * Javascript * Julia * Kotlin * Lua * PHP * Python * R * Rust * Scala * TypeScript ## Dataset Structure ```python >>> from datasets import load_dataset >>> load_dataset("gabeorlanski/bc-transcoder") DatasetDict({ test: Dataset({ features: ['qid', 'title', 'language', 'signature', 'arguments', 'source_py', 'source_cpp', 'question_info'], num_rows: 8384 }) }) ``` ### Data Fields - `qid`: The question ID used for running tests. - `title`: The title of the question. - `language`: The programming language of the example. - `signature`: The signature for the problem. - `arguments`: The arguments of the problem. - `source_py`: The source solution in Python. - `source_cpp`: The source in C++. - `question_info`: The dict of information used for executing predictions. It has the keys: - `test_code`: The raw testing script used in the language. If you want to use this, replace `PLACEHOLDER_FN_NAME` (and `PLACEHOLDER_CLS_NAME` if needed) with the corresponding entry points. Next, replace `PLACEHOLDER_CODE_BODY` with the postprocessed prediction. - `test_list`: The raw json line of the list of tests for the problem. To load them, use `json.loads` - `test_case_ids`: The list of test case ids for the problem. These are used to determine if a prediction passes or not. - `entry_fn_name`: The function's name to use an entry point. - `entry_cls_name`: The class name to use an entry point. - `commands`: The commands used to execute the prediction. Includes a `__FILENAME__` hole that is replaced with the filename. - `timeouts`: The default timeouts for each command. - `extension`: The extension for the prediction file. **NOTE:** If you want to use a different function name (or class name for languages that require class names) for the prediction, you must update the `entry_fn_name` and `entry_cls_name` accordingly. For example, if you have the original question with `entry_fn_name` of `add`, but want to change it to `f`, you must update `ds["question_info"]["entry_fn_name"]` to `f`: ```python >>> from datasets import load_dataset >>> ds = load_dataset("gabeorlanski/bc-mbpp")['test'] >>> # The original entry_fn_name >>> ds[0]['question_info']['entry_fn_name'] removeOcc >>> # You MUST update the corresponding entry_fn_name >>> ds[0]['question_info']['entry_fn_name'] = 'f' >>> ds[0]['question_info']['entry_fn_name'] f ``` ## Dataset Creation See section 2 of the [BabelCode Paper](https://arxiv.org/abs/2302.01973) to learn more about how the datasets are translated. For information on the original curation of the Transcoder Dataset, please see [Unsupervised Translation of Programming Languages](https://arxiv.org/pdf/2006.03511.pdf) by Roziere et. al. ### Dataset Curators Google Research ### Licensing Information CC-BY-4.0 ### Citation Information ``` @article{orlanski2023measuring, title={Measuring The Impact Of Programming Language Distribution}, author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele}, journal={arXiv preprint arXiv:2302.01973}, year={2023} } @article{roziere2020unsupervised, title={Unsupervised translation of programming languages}, author={Roziere, Baptiste and Lachaux, Marie-Anne and Chanussot, Lowik and Lample, Guillaume}, journal={Advances in Neural Information Processing Systems}, volume={33}, year={2020} } ```

--- **许可证**: Apache-2.0 **任务类别**: - 文本生成 - 文本到文本生成 - 翻译 **语言**: - 英语 **标签**: - 代码 **友好名称**: BabelCode Transcoder **规模类别**: - 1000 < 样本数 < 10000 **源数据集**: - 原始数据集 - 扩展|Transcoder --- # BabelCode Transcoder 数据集卡片 ## 数据集描述 - **仓库**: [GitHub仓库](https://github.com/google-research/babelcode) - **论文**: [《衡量编程语言分布的影响》](https://arxiv.org/abs/2302.01973) ### 数据集使用方法你可以使用原始的[BabelCode仓库](https://github.com/google-research/babelcode)，或者借助[`bc_eval` 评测指标](https://huggingface.co/spaces/gabeorlanski/bc_eval)来使用该数据集。 ### 数据集概览本数据集为BabelCode格式的[Transcoder](https://github.com/facebookresearch/CodeGen)数据集，目前支持C++与Python之间的代码转换。 ### 支持的任务与排行榜 ### 支持语言 BC-Transcoder支持以下编程语言： * C++ * C# * Dart * Go * Haskell * Java * JavaScript * Julia * Kotlin * Lua * PHP * Python * R * Rust * Scala * TypeScript ## 数据集结构 python >>> from datasets import load_dataset >>> load_dataset("gabeorlanski/bc-transcoder") DatasetDict({ test: Dataset({ features: ['qid', 'title', 'language', 'signature', 'arguments', 'source_py', 'source_cpp', 'question_info'], num_rows: 8384 }) }) ### 数据字段 - `qid`: 用于运行测试的问题ID。 - `title`: 问题的标题。 - `language`: 示例所用的编程语言。 - `signature`: 问题的函数签名。 - `arguments`: 问题的参数列表。 - `source_py`: Python语言的参考实现代码。 - `source_cpp`: C++语言的参考实现代码。 - `question_info`: 用于执行预测的信息字典，包含以下键： - `test_code`: 对应编程语言的原始测试脚本。使用时需将`PLACEHOLDER_FN_NAME`（若需类名则同时替换`PLACEHOLDER_CLS_NAME`）替换为对应的入口点，再将`PLACEHOLDER_CODE_BODY`替换为后处理后的预测代码。 - `test_list`: 问题测试用例列表的原始JSON行，可通过`json.loads`加载解析。 - `test_case_ids`: 问题的测试用例ID列表，用于判定预测结果是否通过。 - `entry_fn_name`: 用作入口点的函数名称。 - `entry_cls_name`: 用作入口点的类名称。 - `commands`: 用于执行预测的命令，包含`__FILENAME__`占位符，会被替换为实际文件名。 - `timeouts`: 每条命令的默认超时时间。 - `extension`: 预测文件的扩展名。 **注意：** 如果你希望为预测代码使用不同的函数名（或需要类名的语言中使用不同的类名），必须相应更新`entry_fn_name`和`entry_cls_name`。例如，若原始问题的`entry_fn_name`为`removeOcc`，但你希望将其改为`f`，则需按如下方式更新`ds["question_info"]["entry_fn_name"]`： python >>> from datasets import load_dataset >>> ds = load_dataset("gabeorlanski/bc-mbpp")['test'] >>> # 原始的entry_fn_name >>> ds[0]['question_info']['entry_fn_name'] removeOcc >>> # 必须更新对应的entry_fn_name >>> ds[0]['question_info']['entry_fn_name'] = 'f' >>> ds[0]['question_info']['entry_fn_name'] f ## 数据集构建详见[BabelCode论文](https://arxiv.org/abs/2302.01973)的第2节，了解该数据集的构建与翻译方式。如需了解Transcoder数据集的原始构建信息，请参见Roziere等人发表的论文《[编程语言的无监督翻译](https://arxiv.org/pdf/2006.03511.pdf)》。 ### 数据集策展人谷歌研究院（Google Research） ### 许可信息 CC-BY-4.0 ### 引用信息 @article{orlanski2023measuring, title={Measuring The Impact Of Programming Language Distribution}, author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele}, journal={arXiv preprint arXiv:2302.01973}, year={2023} } @article{roziere2020unsupervised, title={Unsupervised translation of programming languages}, author={Roziere, Baptiste and Lachaux, Marie-Anne and Chanussot, Lowik and Lample, Guillaume}, journal={Advances in Neural Information Processing Systems}, volume={33}, year={2020} }

提供机构：

gabeorlanski

原始信息汇总

数据集概述

数据集名称

名称: BabelCode Transcoder
别名: BC-Transcoder

数据集描述

任务类别:
- 文本生成
- 文本到文本生成
- 翻译
支持语言:
- 英语
标签:
- 代码
大小分类:
- 1K<n<10K
源数据集:
- 原始
- 扩展|Transcoder

数据集结构

数据集类型: DatasetDict
测试集结构:
- 特征:
  - qid
  - title
  - language
  - signature
  - arguments
  - source_py
  - source_cpp
  - question_info
- 行数: 8384

数据字段

qid: 问题ID，用于测试。
title: 问题标题。
language: 示例的编程语言。
signature: 问题的签名。
arguments: 问题的参数。
source_py: Python源解决方案。
source_cpp: C++源代码。
question_info: 用于执行预测的信息字典，包含多个键值对，如测试代码、测试列表、测试案例ID等。

支持的任务和语言

支持的任务: 翻译
支持的语言:
- C++
- C#
- Dart
- Go
- Haskell
- Java
- Javascript
- Julia
- Kotlin
- Lua
- PHP
- Python
- R
- Rust
- Scala
- TypeScript

许可证

许可证: Apache-2.0

引用信息

@article{orlanski2023measuring, title={Measuring The Impact Of Programming Language Distribution}, author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele}, journal={arXiv preprint arXiv:2302.01973}, year={2023} } @article{roziere2020unsupervised, title={Unsupervised translation of programming languages}, author={Roziere, Baptiste and Lachaux, Marie-Anne and Chanussot, Lowik and Lample, Guillaume}, journal={Advances in Neural Information Processing Systems}, volume={33}, year={2020} }

搜集汇总

数据集介绍

构建方式

BabelCode Transcoder数据集的构建，基于程序语言之间的翻译任务，通过收集并转换编程语言代码实例，形成了包含多种编程语言代码对的语料库。该数据集的构建过程涉及对原始代码的提取、处理以及跨语言的转换，旨在为编程语言的翻译和生成任务提供高质量的训练数据。

特点

该数据集的特点在于，它涵盖了C++、Python等在内的多种编程语言，支持多种自然语言处理任务，如文本生成、文本到文本生成以及翻译任务。数据集结构化清晰，包含问题ID、标题、语言类型、方法签名、参数、源代码（Python和C++）以及执行预测所需的信息等字段，为研究者提供了丰富的数据维度和灵活的使用方式。

使用方法

使用BabelCode Transcoder数据集，研究者可以通过Hugging Face的库直接加载，利用其中的测试和训练数据来进行模型的训练和评估。数据集支持多种编程语言，用户需根据具体语言特性调整预测时的入口函数名或类名，并替换相应的代码体以适应不同的测试脚本。此外，数据集还提供了测试用例ID、执行命令、超时设置等详细信息，以辅助研究者进行精准的预测结果验证。

背景与挑战

背景概述

BabelCode Transcoder数据集，简称BC-Transcoder，是在编程语言翻译领域的一项重要研究成果。该数据集由Google Research团队于2023年创建，旨在通过编程语言之间的转换来衡量语言分布的影响。BC-Transcoder数据集支持多种编程语言之间的翻译，包括C++、Python、Java等，为编程语言翻译的研究提供了一个丰富的实验平台。此数据集的构建不仅推动了编程语言翻译技术的发展，也为编程语言的语义理解提供了新的视角。

当前挑战

在构建BC-Transcoder数据集的过程中，研究人员面临了诸多挑战。首先，编程语言之间的语法差异巨大，如何准确地进行语义对齐是一个关键问题。其次，数据集的构建需要大量高质量的平行语料，而这类数据的获取与处理成本极高。此外，为了保证翻译的准确性和实用性，数据集的评估和校准也是一大挑战。在所解决的领域问题方面，BC-Transcoder数据集面临着如何提高跨语言编程任务自动化水平的挑战，这对于促进多语言编程环境的构建具有重要意义。

常用场景

经典使用场景

在编程语言转换领域，BabelCode Transcoder数据集被广泛用于评估和训练编程语言之间的翻译模型。其经典使用场景在于，研究者可以利用该数据集对模型进行训练，以实现从一种编程语言到另一种编程语言的自动转换，从而提升编程语言之间的互操作性。

解决学术问题

该数据集解决了编程语言转换中的准确性、效率和实用性问题。通过提供大量的编程语言对翻译实例，研究者可以构建和评估翻译模型的性能，进而推动编程语言翻译技术的进步，对软件工程、编译器设计和自然语言处理等领域具有重要的学术意义和影响。

衍生相关工作

基于BabelCode Transcoder数据集，学术界已经衍生出一系列相关工作，如编程语言翻译的准确性评估、跨语言代码搜索系统的开发以及编程语言演化的自动化预测等，这些工作进一步扩展了该数据集的应用范围，并推动了相关领域的科研进展。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集