shailja/Verilog_GitHub
收藏Hugging Face2023-09-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shailja/Verilog_GitHub
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
---
pipeline_tag: text-generation
tags:
- code
model-index:
- name: VeriGen
results:
- task:
type: text-generation
dataset:
type:
name:
extra_gated_prompt: >-
## Model License Agreement
Please read the BigCode [OpenRAIL-M
license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement)
agreement before accepting it.
extra_gated_fields:
I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox
---
# VeriGen
## Table of Contents
1. [Dataset Summary](##model-summary)
2. [Use](##use)
3. [Limitations](##limitations)
4. [License](##license)
5. [Citation](##citation)
## Dataset Summary
- The dataset comprises Verilog modules as entries. The entries were retrieved from the GitHub dataset on BigQuery.
- For training [models (https://huggingface.co/shailja/fine-tuned-codegen-2B-Verilog)], we filtered entries with no of characters exceeding 20000 and duplicates (exact duplicates ignoring whitespaces).
- **Paper:** [ Benchmarking Large Language Models for Automated Verilog RTL Code Generation](https://arxiv.org/abs/2212.11140)
- **Point of Contact:** [contact@shailja](mailto:shailja.thakur90@gmail.com)
- **Languages:** Verilog (Hardware Description Language)
### Data Splits
The dataset only contains a train split.
### Use
```python
# pip install datasets
from datasets import load_dataset
ds = load_dataset("shailja/Verilog_GitHub", streaming=True, split="train")
print(next(iter(ds)))
#OUTPUT:
```
### Intended Use
The dataset consists of source code from a range of GitHub repositories. As such, they can potentially include non-compilable, low-quality, and vulnerable code.
### Attribution & Other Requirements
The pretraining dataset of the model was not filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset. The code's license might require attribution and/or other specific requirements that must be respected.
# License
The dataset is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement).
# Citation
```
@misc{https://doi.org/10.48550/arxiv.2212.11140,
doi = {10.48550/ARXIV.2212.11140},
url = {https://arxiv.org/abs/2212.11140},
author = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth},
title = {Benchmarking Large Language Models for Automated Verilog RTL Code Generation},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
```
提供机构:
shailja
原始信息汇总
数据集概述
数据集内容
- 类型: Verilog模块
- 来源: 从GitHub数据集在BigQuery中检索
- 筛选条件: 排除字符数超过20000的条目及重复条目(忽略空格的完全重复)
数据集用途
- 训练模型: 用于训练特定的Verilog语言模型
- 数据集结构: 仅包含训练集
数据集特点
- 语言: Verilog(硬件描述语言)
- 潜在问题: 可能包含非编译、低质量及有漏洞的代码
许可证
- 类型: BigCode OpenRAIL-M v1许可证
- 要求: 可能需要对生成的源代码进行归属和遵守特定要求
引用信息
@misc{https://doi.org/10.48550/arxiv.2212.11140, doi = {10.48550/ARXIV.2212.11140}, url = {https://arxiv.org/abs/2212.11140}, author = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth}, title = {Benchmarking Large Language Models for Automated Verilog RTL Code Generation}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }



