documint/DocuMint

Name: documint/DocuMint
Creator: documint
Published: 2024-05-17 21:44:00
License: 暂无描述

Hugging Face2024-05-17 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/documint/DocuMint

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - py tags: - code - documentation - python - docstring - dataset license: mit --- # DocuMint Dataset The DocuMint Dataset is a collection of 100,000 Python functions and their corresponding docstrings, extracted from popular open-source repositories in the Free and open-source software (FLOSS) ecosystem. This dataset was created to train the [DocuMint model](https://huggingface.co/documint/CodeGemma2B-fine-tuned), a fine-tuned variant of Google's CodeGemma-2B that generates high-quality docstrings for Python code functions. For more information on the model and its training procedure, please refer to the model card. ## Dataset Description The dataset consists of JSON-formatted entries, each containing a Python function definition (as the `instruction`) and its associated docstring (as the `response`). The functions were sourced from well-established and actively maintained projects, filtered based on metrics such as the number of contributors (> 50), commits (> 5k), stars (> 35k), and forks (> 10k). ### Data Sources  - **Released by:** [Bibek Poudel](https://huggingface.co/matrix-multiply), [Adam Cook](https://huggingface.co/acook46), [Sekou Traore](https://huggingface.co/Sekou79), [Shelah Ameli](https://huggingface.co/Shelah) (University of Tennessee, Knoxville) - **Repository:** [GitHub](https://github.com/Docu-Mint/DocuMint) - **Paper:** [DocuMint: Docstring Generation for Python using Small Language Models](https://arxiv.org/abs/2405.10243) ## Dataset Structure Each entry in the dataset follows this structure: ```json { "instruction": "def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):\n \"\"\"\n Creates a set of `DataLoader`s for the `glue` dataset,\n using \"bert-base-cased\" as the tokenizer.\n\n Args:\n accelerator (`Accelerator`):\n An `Accelerator` object\n batch_size (`int`, *optional*):\n The batch size for the train and validation DataLoaders.\n \"\"\"\n tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\n datasets = load_dataset(\"glue\", \"mrpc\")\n\n def tokenize_function(examples):\n # max_length=None => use the model max length (it's actually the default)\n outputs = tokenizer(examples[\"sentence1\"], examples[\"sentence2\"], truncation=True, max_length=None)\n return outputs\n\n # Apply the method we just defined to all the examples in all the splits of the dataset\n # starting with the main process first:\n with accelerator.main_process_first():\n tokenized_datasets = datasets.map(\n tokenize_function,\n batched=True,\n remove_columns=[\"idx\", \"sentence1\", \"sentence2\"],\n )\n\n # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the\n # transformers library\n tokenized_datasets = tokenized_datasets.rename_column(\"label\", \"labels\")\n\n def collate_fn(examples):\n # For Torchxla, it's best to pad everything to the same length or training will be very slow.\n max_length = 128 if accelerator.distributed_type == DistributedType.XLA else None\n # When using mixed precision we want round multiples of 8/16\n if accelerator.mixed_precision == \"fp8\":\n pad_to_multiple_of = 16\n\t elif accelerator.mixed_precision != \"no\":\n pad_to_multiple_of = 8\n\t\t else:\n pad_to_multiple_of = None\n\n return tokenizer.pad(\n examples,\n padding=\"longest\",\n max_length=max_length,\n pad_to_multiple_of=pad_to_multiple_of,\n return_tensors=\"pt\",\n )\n\n # Instantiate dataloaders.\n train_dataloader = DataLoader(\n tokenized_datasets[\"train\"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size, drop_last=True\n )\n eval_dataloader = DataLoader(\n tokenized_datasets[\"validation\"],\n shuffle=False,\n collate_fn=collate_fn,\n batch_size=EVAL_BATCH_SIZE,\n drop_last=(accelerator.mixed_precision == \"fp8\"),\n )\n\n return train_dataloader, eval_dataloader", "response": "Creates a set of `DataLoader`s for the `glue` dataset,\nusing \"bert-base-cased\" as the tokenizer.\n\nArgs:\n accelerator (`Accelerator`):\n An `Accelerator` object\n batch_size (`int`, *optional*):\n The batch size for the train and validation DataLoaders." } ``` ## Dataset Usecases The DocuMint dataset can be used for various purposes related to code documentation and natural language processing tasks. Some potential usecases include: - Training and evaluating models for automatic docstring generation - Studying the characteristics and patterns of high-quality docstrings - Analyzing the relationship between code structure and its corresponding documentation - Developing tools for assisting developers in writing effective docstrings - Conducting research on the challenges and best practices in code documentation Researchers, developers, and organizations interested in improving code documentation quality and automating the process of docstring generation can benefit from this dataset. ## Citation  **BibTeX:** ``` @article{poudel2024documint, title={DocuMint: Docstring Generation for Python using Small Language Models}, author={Poudel, Bibek and Cook, Adam and Traore, Sekou and Ameli, Shelah}, journal={arXiv preprint arXiv:2405.10243}, year={2024} } ``` ## Model Card Contact - For questions or more information, please contact: `{bpoudel3,acook46,staore1,oameli}@vols.utk.edu`

语言: - en - py 标签: - 代码 - 文档 - Python - 文档字符串（docstring） - 数据集许可证: MIT --- # DocuMint 数据集（DocuMint Dataset） DocuMint 数据集是从自由与开源软件（Free and open-source software, FLOSS）生态中的热门开源仓库中提取的10万个Python函数及其对应文档字符串（docstring）的集合。本数据集旨在训练[DocuMint 模型（DocuMint model）](https://huggingface.co/documint/CodeGemma2B-fine-tuned)，该模型是谷歌CodeGemma-2B的微调变体，可针对Python代码函数生成高质量的文档字符串。如需了解该模型及其训练流程的更多信息，请参阅其模型卡片。 ## 数据集描述本数据集采用JSON格式条目，每条均包含Python函数定义（作为`instruction`字段）及其关联的文档字符串（作为`response`字段）。所收录的函数均源自成熟且持续维护的项目，并依据贡献者数量（>50）、提交次数（>5000）、星标数（>35000）以及复刻数（>10000）等指标进行筛选。 ### 数据来源 - **发布方**：[Bibek Poudel](https://huggingface.co/matrix-multiply)、[Adam Cook](https://huggingface.co/acook46)、[Sekou Traore](https://huggingface.co/Sekou79)、[Shelah Ameli](https://huggingface.co/Shelah)（田纳西大学诺克斯维尔分校） - **代码仓库**：[GitHub](https://github.com/Docu-Mint/DocuMint) - **相关论文**：[DocuMint: 面向Python的小语言模型文档字符串生成方法](https://arxiv.org/abs/2405.10243) ## 数据集结构数据集中的每条条目均遵循如下格式： json { "instruction": "def get_dataloaders(accelerator: Accelerator, batch_size: int = 16): """ Creates a set of `DataLoader`s for the `glue` dataset, using "bert-base-cased" as the tokenizer. Args: accelerator (`Accelerator`): An `Accelerator` object batch_size (`int`, *optional*): The batch size for the train and validation DataLoaders. """ tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") datasets = load_dataset("glue", "mrpc") def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs # Apply the method we just defined to all the examples in all the splits of the dataset # starting with the main process first: with accelerator.main_process_first(): tokenized_datasets = datasets.map( tokenize_function, batched=True, remove_columns=["idx", "sentence1", "sentence2"], ) # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the # transformers library tokenized_datasets = tokenized_datasets.rename_column("label", "labels") def collate_fn(examples): # For Torchxla, it's best to pad everything to the same length or training will be very slow. max_length = 128 if accelerator.distributed_type == DistributedType.XLA else None # When using mixed precision we want round multiples of 8/16 if accelerator.mixed_precision == "fp8": pad_to_multiple_of = 16 elif accelerator.mixed_precision != "no": pad_to_multiple_of = 8 else: pad_to_multiple_of = None return tokenizer.pad( examples, padding="longest", max_length=max_length, pad_to_multiple_of=pad_to_multiple_of, return_tensors="pt", ) # Instantiate dataloaders. train_dataloader = DataLoader( tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size, drop_last=True ) eval_dataloader = DataLoader( tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE, drop_last=(accelerator.mixed_precision == "fp8"), ) return train_dataloader, eval_dataloader", "response": "Creates a set of `DataLoader`s for the `glue` dataset, using "bert-base-cased" as the tokenizer. Args: accelerator (`Accelerator`): An `Accelerator` object batch_size (`int`, *optional*): The batch size for the train and validation DataLoaders." } ## 数据集应用场景 DocuMint 数据集可用于与代码文档生成及自然语言处理任务相关的各类场景。潜在应用包括： - 训练与评估自动文档字符串生成模型 - 研究高质量文档字符串的特征与模式 - 分析代码结构与其对应文档间的关联关系 - 开发辅助开发者编写高效文档字符串的工具 - 开展代码文档撰写的挑战与最佳实践相关研究有意提升代码文档质量、自动化文档字符串生成流程的研究者、开发者与组织均可从本数据集获益。 ## 引用信息 **BibTeX格式**： bibtex @article{poudel2024documint, title={DocuMint: Docstring Generation for Python using Small Language Models}, author={Poudel, Bibek and Cook, Adam and Traore, Sekou and Ameli, Shelah}, journal={arXiv preprint arXiv:2405.10243}, year={2024} } ## 模型卡片联系方式 - 如需咨询或获取更多信息，请联系：`{bpoudel3,acook46,staore1,oameli}@vols.utk.edu`

提供机构：

documint

原始信息汇总

DocuMint Dataset 概述

数据集描述

DocuMint Dataset 是一个包含100,000个Python函数及其对应docstrings的集合，这些数据从自由和开源软件（FLOSS）生态系统中的流行开源仓库中提取。该数据集用于训练DocuMint模型，该模型是Google的CodeGemma-2B的一个微调版本，用于生成高质量的Python代码函数docstrings。

数据结构

数据集中的每个条目都是JSON格式，包含一个Python函数定义（作为instruction）和其关联的docstring（作为response）。这些函数来自具有高贡献度、多提交、高星标和多分支的项目。

数据集用途

该数据集可用于：

训练和评估自动生成docstring的模型
研究高质量docstring的特征和模式
分析代码结构与其对应文档之间的关系
开发辅助开发者编写有效docstrings的工具
进行代码文档挑战和最佳实践的研究

引用信息

BibTeX:

@article{poudel2024documint, title={DocuMint: Docstring Generation for Python using Small Language Models}, author={Poudel, Bibek and Cook, Adam and Traore, Sekou and Ameli, Shelah}, journal={arXiv preprint arXiv:2405.10243}, year={2024} }

搜集汇总

数据集介绍

背景与挑战

背景概述

DocuMint数据集包含10万条Python函数及其对应文档字符串的JSON格式数据，主要用于训练模型自动生成代码文档。数据集涵盖多种代码结构，适用于自然语言处理和代码文档生成的研究，采用MIT许可协议。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集