five

shailja/Verilog_GitHub

收藏
Hugging Face2023-09-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shailja/Verilog_GitHub
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit --- --- pipeline_tag: text-generation tags: - code model-index: - name: VeriGen results: - task: type: text-generation dataset: type: name: extra_gated_prompt: >- ## Model License Agreement Please read the BigCode [OpenRAIL-M license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) agreement before accepting it. extra_gated_fields: I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox --- # VeriGen ## Table of Contents 1. [Dataset Summary](##model-summary) 2. [Use](##use) 3. [Limitations](##limitations) 4. [License](##license) 5. [Citation](##citation) ## Dataset Summary - The dataset comprises Verilog modules as entries. The entries were retrieved from the GitHub dataset on BigQuery. - For training [models (https://huggingface.co/shailja/fine-tuned-codegen-2B-Verilog)], we filtered entries with no of characters exceeding 20000 and duplicates (exact duplicates ignoring whitespaces). - **Paper:** [ Benchmarking Large Language Models for Automated Verilog RTL Code Generation](https://arxiv.org/abs/2212.11140) - **Point of Contact:** [contact@shailja](mailto:shailja.thakur90@gmail.com) - **Languages:** Verilog (Hardware Description Language) ### Data Splits The dataset only contains a train split. ### Use ```python # pip install datasets from datasets import load_dataset ds = load_dataset("shailja/Verilog_GitHub", streaming=True, split="train") print(next(iter(ds))) #OUTPUT: ``` ### Intended Use The dataset consists of source code from a range of GitHub repositories. As such, they can potentially include non-compilable, low-quality, and vulnerable code. ### Attribution & Other Requirements The pretraining dataset of the model was not filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset. The code's license might require attribution and/or other specific requirements that must be respected. # License The dataset is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement). # Citation ``` @misc{https://doi.org/10.48550/arxiv.2212.11140, doi = {10.48550/ARXIV.2212.11140}, url = {https://arxiv.org/abs/2212.11140}, author = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth}, title = {Benchmarking Large Language Models for Automated Verilog RTL Code Generation}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} } ```
提供机构:
shailja
原始信息汇总

数据集概述

数据集内容

  • 类型: Verilog模块
  • 来源: 从GitHub数据集在BigQuery中检索
  • 筛选条件: 排除字符数超过20000的条目及重复条目(忽略空格的完全重复)

数据集用途

  • 训练模型: 用于训练特定的Verilog语言模型
  • 数据集结构: 仅包含训练集

数据集特点

  • 语言: Verilog(硬件描述语言)
  • 潜在问题: 可能包含非编译、低质量及有漏洞的代码

许可证

  • 类型: BigCode OpenRAIL-M v1许可证
  • 要求: 可能需要对生成的源代码进行归属和遵守特定要求

引用信息

@misc{https://doi.org/10.48550/arxiv.2212.11140, doi = {10.48550/ARXIV.2212.11140}, url = {https://arxiv.org/abs/2212.11140}, author = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth}, title = {Benchmarking Large Language Models for Automated Verilog RTL Code Generation}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作