shailja/Verilog_GitHub

Name: shailja/Verilog_GitHub
Creator: shailja
Published: 2023-09-20 17:14:18
License: 暂无描述

Hugging Face2023-09-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/shailja/Verilog_GitHub

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit --- --- pipeline_tag: text-generation tags: - code model-index: - name: VeriGen results: - task: type: text-generation dataset: type: name: extra_gated_prompt: >- ## Model License Agreement Please read the BigCode [OpenRAIL-M license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) agreement before accepting it. extra_gated_fields: I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox --- # VeriGen ## Table of Contents 1. [Dataset Summary](##model-summary) 2. [Use](##use) 3. [Limitations](##limitations) 4. [License](##license) 5. [Citation](##citation) ## Dataset Summary - The dataset comprises Verilog modules as entries. The entries were retrieved from the GitHub dataset on BigQuery. - For training [models (https://huggingface.co/shailja/fine-tuned-codegen-2B-Verilog)], we filtered entries with no of characters exceeding 20000 and duplicates (exact duplicates ignoring whitespaces). - **Paper:** [ Benchmarking Large Language Models for Automated Verilog RTL Code Generation](https://arxiv.org/abs/2212.11140) - **Point of Contact:** [contact@shailja](mailto:shailja.thakur90@gmail.com) - **Languages:** Verilog (Hardware Description Language) ### Data Splits The dataset only contains a train split. ### Use ```python # pip install datasets from datasets import load_dataset ds = load_dataset("shailja/Verilog_GitHub", streaming=True, split="train") print(next(iter(ds))) #OUTPUT: ``` ### Intended Use The dataset consists of source code from a range of GitHub repositories. As such, they can potentially include non-compilable, low-quality, and vulnerable code. ### Attribution & Other Requirements The pretraining dataset of the model was not filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset. The code's license might require attribution and/or other specific requirements that must be respected. # License The dataset is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement). # Citation ``` @misc{https://doi.org/10.48550/arxiv.2212.11140, doi = {10.48550/ARXIV.2212.11140}, url = {https://arxiv.org/abs/2212.11140}, author = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth}, title = {Benchmarking Large Language Models for Automated Verilog RTL Code Generation}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} } ```

提供机构：

shailja

原始信息汇总

数据集概述

数据集内容

类型: Verilog模块
来源: 从GitHub数据集在BigQuery中检索
筛选条件: 排除字符数超过20000的条目及重复条目（忽略空格的完全重复）

数据集用途

训练模型: 用于训练特定的Verilog语言模型
数据集结构: 仅包含训练集

数据集特点

语言: Verilog（硬件描述语言）
潜在问题: 可能包含非编译、低质量及有漏洞的代码

许可证

类型: BigCode OpenRAIL-M v1许可证
要求: 可能需要对生成的源代码进行归属和遵守特定要求

引用信息

@misc{https://doi.org/10.48550/arxiv.2212.11140, doi = {10.48550/ARXIV.2212.11140}, url = {https://arxiv.org/abs/2212.11140}, author = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth}, title = {Benchmarking Large Language Models for Automated Verilog RTL Code Generation}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集