five

msc-smart-contract-auditing/audits-with-reasons

收藏
Hugging Face2024-06-28 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/msc-smart-contract-auditing/audits-with-reasons
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit size_categories: - 1K<n<10K task_categories: - text2text-generation - text-classification pretty_name: Smart Contract Audits with Reasons and Recommendations dataset_info: features: - name: code dtype: string - name: description dtype: string - name: recommendation dtype: string - name: type dtype: string - name: functionality dtype: string splits: - name: train num_bytes: 10113056.522516329 num_examples: 2472 - name: test num_bytes: 1787785.4774836714 num_examples: 437 download_size: 4969396 dataset_size: 11900842 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* tags: - finance - code --- This dataset builds on top of the [base dataset](https://huggingface.co/datasets/msc-smart-contract-audition/vulnerable-functions-base) by augmenting it using the quantized [Llama3 8b instruct model by Unsloth](https://huggingface.co/unsloth/llama-3-8b-Instruct-bnb-4bit) Namely, it: 1. Expands on the level of detail of the description and recommendation. 2. Cleans-up the code by fixing formatting and removing out-of-context comments (e.g external URLs which might confuse a model) 3. Adds two new fields: functionality and type (see table for more detail) **The non-vulnerable examples only have values for `code`, `functionality` and `type='no vulnerability'`** | Field | Description | |-|-| | 1. `code` | One or mode codeblocks which contain the vulnerability | | 2. `description` | Description and explanation of the vulnerability. Includes a proof-of-concept (PoC) explaining how to take advantage of the vulnerability. | | 3. `recommendation` | One or more recommended mitigations of the vulnerability | | 4. `type`\* | Type of vulnearbility | | 5. `functionality`\*\* | Explanation in plain English of what the code does and what is its general purpose within the contract | \* - The type is not suitable for classification out of the box as the classes are not constrained to a finite-set. They are the most accurate description deemed by the model and need preprocessing to be confined by specifc number of classes (e.g. Front-running, Reentrancy, Algorithmic error, etc.) \*\* - This is useful for knowledge retrieval as the embeddings of plain description of the code are better separable than code embeddings (see Data Analysis below). # Data Analysis <img src="https://huggingface.co/datasets/msc-smart-contract-audition/audits-with-reasons/resolve/main/figures/pca-functionalities-bert-large-cased.png"> <img src="https://huggingface.co/datasets/msc-smart-contract-audition/audits-with-reasons/resolve/main/figures/pca-code-bert-large-cased.png"> 1. The first plot shows PCA on the embeddings generated by the [Large-Cased Bert Model](https://huggingface.co/google-bert/bert-large-cased) of the description of the code (i.e. functionality). 2. The second plot shows PCA on the embeddings of the **raw code itself**. Although, this is a very constrained space with a lot of loss of information, it is notable that the embeddings of the functionality descriptions are better separable. Obtaining a functionality description is not too complicated of a task so smaller LLMs can produce good results. # Additional Info - The newline characters are escaped (i.e. `\\n`)
提供机构:
msc-smart-contract-auditing
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作