blindsubmissions/M2CRB

Name: blindsubmissions/M2CRB
Creator: blindsubmissions
Published: 2023-08-08 15:06:30
License: 暂无描述

Hugging Face2023-08-08 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/blindsubmissions/M2CRB

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: identifier dtype: string - name: parameters dtype: string - name: return_statement dtype: string - name: docstring dtype: string - name: docstring_summary dtype: string - name: function dtype: string - name: function_tokens sequence: string - name: start_point sequence: int64 - name: end_point sequence: int64 - name: argument_list dtype: 'null' - name: language dtype: string - name: docstring_language dtype: string - name: docstring_language_predictions dtype: string - name: is_langid_reliable dtype: string - name: is_langid_extra_reliable dtype: bool - name: type dtype: string splits: - name: test num_bytes: 15742687 num_examples: 7743 download_size: 5530793 dataset_size: 15742687 license: other task_categories: - translation - summarization language: - pt - de - fr - es tags: - code pretty_name: m size_categories: - 1K<n<10K --- # M2CRB ## Dataset Summary M2CRB contains pairs of text and code data with multiple natural and programming language pairs. Namely: Spanish, Portuguese, German, and French, each paired with code snippets for: Python, Java, and JavaScript. The data is curated via an automated filtering pipeline from source files within [The Stack](https://huggingface.co/datasets/bigcode/the-stack) followed by human verification to ensure accurate language classification I.e., humans were asked to filter out data for which natural language did not correspond to a target language. ## Supported Tasks M2CRB is a multilingual evaluation dataset for code-to-text and/or text-to-code models, both on information retrieval or conditional generation evaluations. ## Currently Supported Languages ```python NATURAL_LANGUAGE_SET = {"es", "fr", "pt", "de"} PROGRAMMING_LANGUAGE_SET = {"python", "java", "javascript"} ``` ## How to get the data with a given language combination ```python from datasets import load_dataset def get_dataset(prog_lang, nat_lang): test_data = load_dataset("blindsubmissions/M2CRB") test_data = test_data.filter( lambda example: example["docstring_language"] == nat_lang and example["language"] == prog_lang ) return test_data ``` ## Dataset Structure ### Data Instances Each data instance corresponds to function/methods occurring in licensed files that compose The Stack. That is, files with permissive licences collected from GitHub. ### Relevant Data Fields - identifier (string): Function/method name. - parameters (string): Function parameters. - return_statement (string): Return statement if found during parsing. - docstring (string): Complete docstring content. - docstring_summary (string): Summary/processed docstring dropping args and return statements. - function (string): Actual function/method content. - argument_list (null): List of arguments. - language (string): Programming language of the function. - docstring_language (string): Natural language of the docstring. - type (string): Return type if found during parsing. ## Summary of data curation pipeline - Filtering out repositories that appear in [CodeSearchNet](https://huggingface.co/datasets/code_search_net). - Filtering the files that belong to the programming languages of interest. - Pre-filtering the files that likely contain text in the natural languages of interest. - AST parsing with [Tree-sitter](\url{https://tree-sitter.github.io/tree-sitter/). - Perform language identification of docstrings in the resulting set of functions/methods. - Perform human verification/validation of the underlying language of docstrings. ## Social Impact of the dataset M2CRB is released with the aim to increase the coverage of the NLP for code research community by providing data from scarce combinations of languages. We expect this data to help enable more accurate information retrieval systems and text-to-code or code-to-text summarization on languages other than English. As a subset of The Stack, this dataset inherits de-risking efforts carried out when that dataset was built, though we highlight risks exist and malicious use of the data could exist such as, for instance, to aid on creation of malicious code. We highlight however that this is a risk shared by any code dataset made openly available. Moreover, we remark that while unlikely due to human filtering, the data may contain harmful or offensive language, which could be learned by the models. ## Discussion of Biases The data is collected from GitHub and naturally occurring text on that platform. As a consequence, certain language combinations are more or less likely to contain well documented code and, as such, resulting data will not be uniformly represented in terms of their natural and programing languages. ## Known limitations While we cover 16 scarce combinations of programming and natural languages, our evaluation dataset can be expanded to further improve its coverage. Moreover, we use text naturally occurring as comments or docstrings as opposed to human annotators. As such, resulting data will have high variance in terms of quality and depending on practices of sub-communities of software developers. However, we remark that the task our evaluation dataset defines is reflective of what searching on a real codebase would look like. Finally, we note that some imbalance on data is observed due to the same reason since certain language combinations are more or less likely to contain well documented code. ## Maintenance plan: The data will be kept up to date by following The Stack releases. We should rerun our pipeline for every new release and add non-overlapping new content to both training and testing partitions of our data. This is so that we carry over opt-out updates and include fresh repos. ## Update plan: - Short term: - Cover all 6 programming languages from CodeSearchNet. - Long-term - Add an extra test set containing human-generated text/code pairs so the gap between in-the-wild and controlled performances can be measured. - Include extra natural languages. ## Licensing Information M2CRB is a subset filtered and pre-processed from [The Stack](https://huggingface.co/datasets/bigcode/the-stack), a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in M2CRB must abide by the terms of the original licenses.

提供机构：

blindsubmissions

原始信息汇总

M2CRB 数据集概述

数据集摘要

M2CRB 包含文本和代码数据对，涵盖多种自然语言和编程语言组合，包括西班牙语、葡萄牙语、德语和法语，以及对应的 Python、Java 和 JavaScript 代码片段。数据通过自动化过滤管道从 The Stack 源文件中筛选，并经过人工验证确保语言分类准确。

支持的任务

M2CRB 是一个多语言评估数据集，适用于代码到文本和/或文本到代码模型的信息检索或条件生成评估。

当前支持的语言

自然语言：西班牙语、法语、葡萄牙语、德语
编程语言：Python、Java、JavaScript

数据集结构

数据实例

每个数据实例对应于 The Stack 中的函数/方法，这些函数/方法来自具有宽松许可证的文件，主要收集自 GitHub。

数据集维护计划

数据将通过跟踪 The Stack 的发布来保持最新。每次新发布时，我们将重新运行我们的管道，并将非重叠的新内容添加到数据集的训练和测试分区中。

更新计划

短期：
- 涵盖 CodeSearchNet 中的所有 6 种编程语言。
长期：
- 添加一个包含人工生成文本/代码对的额外测试集，以衡量野外和受控性能之间的差距。
- 包含额外的自然语言。

许可信息

M2CRB 是从 The Stack 筛选和预处理的子集，该集合包含来自具有各种许可证的仓库的源代码。M2CRB 中所有或部分代码的使用必须遵守原始许可证的条款。

5,000+

优质数据集

54 个

任务类型

进入经典数据集