five

software-ses/raven-dataset

收藏
Hugging Face2025-11-04 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/software-ses/raven-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: "Dataset for the paper: ``RAVEN: Analyzing Ethereum’s Reverted Transactions via Semantic Clustering of Failure Invariants``" license: "cc-by-4.0" language: - "en" tags: - smart-contracts - ethereum - blockchain - transaction-failures - invariants task_categories: - tabular-classification # Changed from anomaly-detection and classification size_categories: - 10K<n<100K - 100K<n<1M source_datasets: - ethereum-blockchain-transactions --- # Dataset Card for **RAVEN: Analyzing Ethereum’s Reverted Transactions via Semantic Clustering of Failure Invariants** ## Dataset Description This dataset comprises two collections (splits) of failed transactions on the Ethereum blockchain, annotated with extracted *business‑logic invariants*. The dataset was created within the research project titled HighGuard: Cross‑Chain Business Logic Monitoring of Smart Contracts, by Mojtaba Eshghie. - **Finetuning collection**: ~100,000 failed Ethereum transactions annotated with 1,932 unique invariants. - **Evaluation collection**: ~20,000 sampled failed transactions annotated with 727 unique invariants, used for clustering and categorization evaluation. Each record corresponds to a failed transaction, along with metadata such as transaction hash, block number, sender/receiver, gas used/limit, failure message, and extracted invariant condition that caused the failure. ### Key features - Focused on **business‐logic vulnerabilities**, not only low‑level errors (e.g., out‑of‑gas) but semantic violations captured via invariants. - Two distinct collections (finetuning + evaluation) for training and benchmarking. - Designed for anomaly‑detection and classification tasks in the smart‑contract security domain. ### Recommended uses - Training supervised or unsupervised models to detect business‑logic failures in smart contracts. - Clustering sampled invariants to categorize common failure types. - Benchmarking research on smart‐contract verification, transaction analysis, and runtime monitoring. ### Out‑of‑Scope uses - This dataset is **not** suitable for general cryptocurrency transaction modelling (e.g., normal transfers), since **only failed transactions** are included. - It is **not** a comprehensive dataset of all Ethereum transactions — only those with business‐logic failure annotations. --- ## Dataset Structure The dataset is provided as a `DatasetDict` with two splits/collections: | Split | Description | Approx. Size | |-----------------|-------------------------------------------------------|------------------| | `finetuning` | 100 000 failed transactions annotated with 1 932 invariants | ~100k rows | | `evaluation` | 20 000 failed transactions annotated with 727 invariants | ~20k rows | Each record has the following columns: - `tx_hash` (string): Transaction hash. - `block_number` (int64): Block number in which the transaction was included. - `from_address` (string): Sender Ethereum address. - `to_address` (string): Receiver Ethereum address. - `gas_limit` (int64): Gas limit specified for the transaction. - `gas_used` (int64): Gas used by the transaction before failure. - `failure_message` (string): The revert or failure message (if available). - `invariant_condition` (string): A high‐level invariant representing the business‐logic violation. - `invariant_id` (int64): An internal identifier for the extracted invariant cluster/category. - `timestamp` (int64): Unix timestamp of the block (optional). **File format:** The repository provides Parquet files for each split (`finetuning.parquet`, `evaluation.parquet`) and can be loaded via the `datasets` library as: ```python from datasets import load_dataset ds = load_dataset("MojtabaEshghie/raven‑dataset", split="finetuning") ```` --- ## Dataset Creation ### Curation Rationale Business‐logic failures in smart contracts are harder to detect than low‐level exceptions (e.g., out‑of‑gas) but are critically important for security. The goal of this dataset is to provide a curated collection of failed transactions with extracted invariants to enable anomaly detection, clustering, and classification research in the smart‐contract domain. ### Data Processing * Filtering of failed transactions with revert/failure messages. * Extraction of business‑logic invariants via the tool SoliDiffy and other analysis pipelines. * Deduplication of similar invariant texts and clustering of invariants to create `invariant_id`. * Serialization into Parquet format; conversion to Arrow format by the `datasets` library during upload. ### Who/When/Where * Curated by: Mojtaba Eshghie * Affiliation: KTH Royal Institute of Technology, Umeå University. * Date: Nov 2025 --- ## Considerations for Using the Data ### Limitations * **Bias toward failures only**: The dataset contains only failed transactions, so models trained on it might not generalize to normal transactions. * **Time cutoff**: Transactions are up to a certain block number. ### Ethical and Privacy Considerations * The data is sourced from a public blockchain (Ethereum), so transaction data is publicly available. * Addresses are included (sender/receiver), which are pseudonymous but publicly traceable; users should be aware of potential linking to identities through external sources. * Use responsibly: do not attempt to de‑anonymize addresses or misuse user data. ### Recommendations * If using for supervised classification, consider balancing via sampling or weighting due to potentially unbalanced invariant categories. * For anomaly detection, consider using the `finetuning` split for training and `evaluation` for benchmarking. * Always cite the dataset and the associated paper when using it in publications. --- ## Citation ```bibtex tbd ```
提供机构:
software-ses
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作