Name: lt-asset/REPOCOD_Lite
Creator: lt-asset
Published: 2024-12-03 00:54:40
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/lt-asset/REPOCOD_Lite

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: repository dtype: string - name: repo_id dtype: string - name: target_module_path dtype: string - name: prompt dtype: string - name: relavent_test_path dtype: string - name: full_function dtype: string - name: function_name dtype: string - name: context-complexity dtype: string splits: - name: test num_bytes: 2406711 num_examples: 200 download_size: 937112 dataset_size: 2406711 configs: - config_name: default data_files: - split: test path: data/test-* --- # Can Language Models Replace Programmers? REPOCOD Says 'Not Yet' Large language models (LLMs) have achieved high accuracy, i.e., more than 90 pass@1, in solving Python coding problems in HumanEval and MBPP. Thus, a natural question is, whether LLMs achieve comparable code completion performance compared to human developers? Unfortunately, one cannot answer this question using existing manual crafted or simple (e.g., single-line) code generation benchmarks, since such tasks fail to represent real-world software development tasks. In addition, existing benchmarks often use poor code correctness metrics, providing misleading conclusions. To address these challenges, we create REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation. In our evaluations on ten LLMs, none of the models achieves more than 30 pass@1 on REPOCOD, disclosing the necessity of building stronger LLMs that can help developers in real-world software development. For easier evaluation, we sample 200 of the hardest problems in REPOCOD to create REPOCOD-Lite, using the product of the prompt length and canonical solution length (in terms of line count) as an indicator of difficulty. From the three categories of questions—self-contained, file-level, and repo-level—we select 66, 67, and 67 samples respectively in descending order of the scores. * For more details on data collection and evaluation results, please refer to our arxiv [preprint](https://arxiv.org/abs/2410.21647). * Examples code for downloading repositories, preparing repository snapshot, and running test cases for evaluation are propived at [code](https://github.com/lt-asset/REPOCOD) * Check our [Leaderboard](https://lt-asset.github.io/REPOCOD/) for preliminary results using SOTA LLMs with RAG. ## Usage ```python from datasets import load_dataset data = load_dataset('lt-asset/REPOCOD_Lite') print(data) DatasetDict({ train: Dataset({ features: ['repository', 'repo_id', 'target_module_path', 'prompt', 'relavent_test_path', 'full_function', 'function_name'], num_rows: 200 }) }) ``` ## Data Fields - repository: the source repository of the current sample - repo_id: the unique index of the sample in the corresponding source repository - target_module_path: the file path containing the current sample relative to the root of the source repository - prompt: the developer provided function signature and docstring - relavent_test_path: the path to the relevant test cases - full_function: the canonical solution of the current sample - function_name: the name of the target function (current sample) ## Example ``` "repository": "seaborn", # collected from seaborn "repo_id": "6", # first sample from seaborn "target_module_path": "seaborn/_base.py", # the target function is in this path "prompt": " def iter_data( self, grouping_vars=None, *, reverse=False, from_comp_data=False, by_facet=True, allow_empty=False, dropna=True, ): '''Generator for getting subsets of data defined by semantic variables. Also injects "col" and "row" into grouping semantics. Parameters ---------- grouping_vars : string or list of strings Semantic variables that define the subsets of data. reverse : bool If True, reverse the order of iteration. from_comp_data : bool If True, use self.comp_data rather than self.plot_data by_facet : bool If True, add faceting variables to the set of grouping variables. allow_empty : bool If True, yield an empty dataframe when no observations exist for combinations of grouping variables. dropna : bool If True, remove rows with missing data. Yields ------ sub_vars : dict Keys are semantic names, values are the level of that semantic. sub_data : :class:`pandas.DataFrame` Subset of ``plot_data`` for this combination of semantic values. '''", # the function signature and docstring for the target function "relevant_test_path": "/usr/src/app/target_test_cases/failed_tests_Continuous.label.txt", # Path to relevant tests for the function "full_function": " def iter_data( self, grouping_vars=None, *, reverse=False, from_comp_data=False, by_facet=True, allow_empty=False, dropna=True, ): '''Generator for getting subsets of data defined by semantic variables. Also injects "col" and "row" into grouping semantics. Parameters ---------- grouping_vars : string or list of strings Semantic variables that define the subsets of data. reverse : bool If True, reverse the order of iteration. from_comp_data : bool If True, use self.comp_data rather than self.plot_data by_facet : bool If True, add faceting variables to the set of grouping variables. allow_empty : bool If True, yield an empty dataframe when no observations exist for combinations of grouping variables. dropna : bool If True, remove rows with missing data. Yields ------ sub_vars : dict Keys are semantic names, values are the level of that semantic. sub_data : :class:`pandas.DataFrame` Subset of ``plot_data`` for this combination of semantic values. ''' if grouping_vars is None: grouping_vars = [] ...", # the full snippet of the target function, including the function signature and docstring for the target function "function_name": "VectorPlotter.iter_data" # The name of the target function ``` ## Citation ``` @misc{liang2024repocod, title={Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'}, author={Shanchao Liang and Yiran Hu and Nan Jiang and Lin Tan}, year={2024}, eprint={2410.21647}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2410.21647}, } ```

dataset_info: features: - name: repository dtype: string - name: repo_id dtype: string - name: target_module_path dtype: string - name: prompt dtype: string - name: relavent_test_path dtype: string - name: full_function dtype: string - name: function_name dtype: string - name: context-complexity dtype: string splits: - name: test num_bytes: 2406711 num_examples: 200 download_size: 937112 dataset_size: 2406711 configs: - config_name: default data_files: - split: test path: data/test-* --- # 语言模型能否替代程序员？REPOCOD给出答案：尚不能大语言模型（Large Language Model, LLM）在解决HumanEval与MBPP中的Python编码问题时，已实现超过90%的pass@1准确率。由此自然衍生出一个问题：大语言模型在代码补全任务上的性能，能否与人类开发者相匹敌？遗憾的是，现有手动构建或简单（例如单行）代码生成基准无法代表真实世界的软件开发任务，因此无法回答这一问题。此外，现有基准常使用不够严谨的代码正确性评估指标，得出的结论往往具有误导性。为解决上述挑战，我们构建了REPOCOD——一个包含980个任务的代码生成基准，其样本采集自11个主流真实世界项目，其中超过58%的任务需要文件级或仓库级上下文信息。与现有基准相比，REPOCOD拥有最长的平均标准解决方案长度（331.6个Token）以及最高的平均圈复杂度（9.00）。REPOCOD中的每个任务平均包含313.5个开发者编写的测试用例，以更精准地评估代码正确性。在我们对10款大语言模型的评估中，没有任何模型在REPOCOD上的pass@1准确率超过30%，这表明构建能够在真实软件开发中辅助开发者的更强大语言模型仍有必要。为简化评估流程，我们以提示文本长度与标准解决方案长度（按行数计算）的乘积作为难度指标，从REPOCOD中采样200个难度最高的任务，构建了REPOCOD-Lite。我们将任务分为自包含、文件级、仓库级三类，按得分降序分别选取66、67、67个样本。 * 如需了解数据采集与评估结果的更多细节，请参阅我们的arXiv预印本：https://arxiv.org/abs/2410.21647 * 用于下载仓库、制备仓库快照以及运行评估测试用例的示例代码已发布于：https://github.com/lt-asset/REPOCOD * 请访问我们的排行榜：https://lt-asset.github.io/REPOCOD/ 查看使用结合检索增强生成（Retrieval-Augmented Generation, RAG）的前沿大语言模型得到的初步结果。 ## 使用方法 python from datasets import load_dataset data = load_dataset('lt-asset/REPOCOD_Lite') print(data) DatasetDict({ train: Dataset({ features: ['repository', 'repo_id', 'target_module_path', 'prompt', 'relavent_test_path', 'full_function', 'function_name'], num_rows: 200 }) }) ## 数据字段 - repository：当前样本的来源仓库 - repo_id：该样本在对应来源仓库中的唯一索引 - target_module_path：相对于来源仓库根目录的、包含当前样本的文件路径 - prompt：开发者提供的函数签名与文档字符串 - relavent_test_path：相关测试用例的路径 - full_function：当前样本的标准解决方案代码 - function_name：目标函数（当前样本）的名称 ## 示例 "repository": "seaborn", # 采集自seaborn项目 "repo_id": "6", # 该项目的第6个样本 "target_module_path": "seaborn/_base.py", # 目标函数所在的文件路径 "prompt": " def iter_data( self, grouping_vars=None, *, reverse=False, from_comp_data=False, by_facet=True, allow_empty=False, dropna=True, ): '''Generator for getting subsets of data defined by semantic variables. Also injects "col" and "row" into grouping semantics. Parameters ---------- grouping_vars : string or list of strings Semantic variables that define the subsets of data. reverse : bool If True, reverse the order of iteration. from_comp_data : bool If True, use self.comp_data rather than self.plot_data by_facet : bool If True, add faceting variables to the set of grouping variables. allow_empty : bool If True, yield an empty dataframe when no observations exist for combinations of grouping variables. dropna : bool If True, remove rows with missing data. Yields ------ sub_vars : dict Keys are semantic names, values are the level of that semantic. sub_data : :class:`pandas.DataFrame` Subset of ``plot_data`` for this combination of semantic values. '''", # 目标函数的函数签名与文档字符串 "relevant_test_path": "/usr/src/app/target_test_cases/failed_tests_Continuous.label.txt", # 该函数的相关测试用例路径 "full_function": " def iter_data( self, grouping_vars=None, *, reverse=False, from_comp_data=False, by_facet=True, allow_empty=False, dropna=True, ): '''Generator for getting subsets of data defined by semantic variables. Also injects "col" and "row" into grouping semantics. Parameters ---------- grouping_vars : string or list of strings Semantic variables that define the subsets of data. reverse : bool If True, reverse the order of iteration. from_comp_data : bool If True, use self.comp_data rather than self.plot_data by_facet : bool If True, add faceting variables to the set of grouping variables. allow_empty : bool If True, yield an empty dataframe when no observations exist for combinations of grouping variables. dropna : bool If True, remove rows with missing data. Yields ------ sub_vars : dict Keys are semantic names, values are the level of that semantic. sub_data : :class:`pandas.DataFrame` Subset of ``plot_data`` for this combination of semantic values. ''' if grouping_vars is None: grouping_vars = [] ...", # 目标函数的完整代码片段，包含函数签名与文档字符串 "function_name": "VectorPlotter.iter_data" # 目标函数的名称 ## 引用 @misc{liang2024repocod, title={Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'}, author={Shanchao Liang and Yiran Hu and Nan Jiang and Lin Tan}, year={2024}, eprint={2410.21647}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2410.21647}, }

应用场景：