solarchive/solarchive

Name: solarchive/solarchive
Creator: solarchive
Published: 2025-12-25 00:03:56
License: 暂无描述

Hugging Face2025-12-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/solarchive/solarchive

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en tags: - finance - blockchain - solana pretty_name: SolArchive.org Solana Datasets size_categories: - n>1T --- # solarchive.org: Solana Blockchain Datasets A clean, long-term, public archive of Solana blockchain data. This dataset contains a complete historical archive of Solana blockchain transactions, accounts, and tokens, sourced from Google BigQuery's public Solana dataset and optimized for analysis. ## 🎯 What is this? Solarchive is a **free, public archive** of the entire Solana blockchain, designed for: - 🔬 **Researchers** analyzing blockchain behavior and patterns - 📊 **Data scientists** building models on blockchain data - 🏗️ **Developers** building applications that need historical Solana data - 📈 **Analysts** studying DeFi, NFTs, and token economics **Key features:** - ✅ **Complete history** from genesis to present - ✅ **Daily partitioned** for efficient querying - ✅ **Vote transactions filtered out** (cleaner dataset) - ✅ **Free to download** with zero egress fees - ✅ **Well-documented schemas** with examples - ✅ **Parquet format** for fast analytics ## 📦 Datasets This repository contains three main datasets: ### 1. Transactions (`txs/`) All Solana transactions (excluding validator votes) with complete metadata. **Coverage:** October 2020 - Present **Partitioning:** Daily (`YYYY-MM-DD`) **Format:** Parquet files with checksums **Schema:** [schemas/transactions.json](schemas/transactions.json) **Contains:** - Transaction signatures and status - Block information (slot, hash, timestamp) - Accounts involved (pubkeys, signer/writable flags) - SOL balance changes - Token balance changes (pre/post) - Compute units consumed - Program log messages - Fee information ### 2. Accounts (`accounts/`) Historical account snapshots including token accounts, program accounts, and vote accounts. **Coverage:** October 2020 - Present **Partitioning:** Daily (`YYYY-MM-DD`) **Format:** Parquet files with checksums **Schema:** [schemas/accounts.json](schemas/accounts.json) **Contains:** - Account public keys and balances - Owner programs - Token account information (mint, amount, decimals) - Vote account data (validators, votes, epoch credits) - Program account data - Account state and metadata ### 3. Tokens (`tokens/`) Token metadata including NFTs and fungible tokens. **Coverage:** October 2020 - Present **Partitioning:** Daily (`YYYY-MM-DD`) **Format:** Parquet files with checksums **Schema:** [schemas/tokens.json](schemas/tokens.json) **Contains:** - Token mint addresses - Token names and symbols - Metadata URIs - Creator information - NFT indicators - Royalty information (seller fees) - Mutability flags ## 🗂️ Repository Structure ``` solarchive/ ├── txs/ # Transactions dataset │ ├── index.json # Dataset-level index │ ├── 2020-10-24/ # Daily partition │ │ ├── index.json # Partition-level index │ │ ├── 000000000000.parquet # Data file (preserves original name) │ │ ├── 000000000000.checksum # SHA256 checksum │ │ ├── 000000000001.parquet │ │ ├── 000000000001.checksum │ │ └── ... │ ├── 2020-10-25/ │ └── ... │ ├── accounts/ # Accounts dataset │ ├── index.json │ ├── 2020-10-24/ │ │ ├── index.json │ │ ├── 000000000000.parquet │ │ └── ... │ └── ... │ ├── tokens/ # Tokens dataset │ ├── index.json │ ├── 2020-10-24/ │ │ ├── index.json │ │ ├── 000000000000.parquet │ │ └── ... │ └── ... │ ├── schemas/ # JSON schemas │ ├── transactions.json # Transaction schema │ ├── accounts.json # Account schema │ └── tokens.json # Token schema │ ├── index.json # Root-level index └── README.md # This file ``` ## 🚀 Quick Start ### Download and Read with Python ```python from huggingface_hub import hf_hub_download import pyarrow.parquet as pq # Download a specific parquet file file_path = hf_hub_download( repo_id="solarchive/solarchive", filename="txs/2024-01-01/000000000000.parquet", repo_type="dataset" ) # Read with PyArrow table = pq.read_table(file_path) df = table.to_pandas() print(f"Transactions: {len(df):,}") print(df.head()) ``` ### Download Multiple Files ```python from huggingface_hub import snapshot_download # Download entire partition (all parquet files for a day) local_dir = snapshot_download( repo_id="solarchive/solarchive", repo_type="dataset", allow_patterns="txs/2024-01-01/*.parquet" ) print(f"Downloaded to: {local_dir}") ``` ## 📊 Example Analysis ### Analyze with Pandas ```python from huggingface_hub import hf_hub_download import pyarrow.parquet as pq import pandas as pd # Download a day's data file_path = hf_hub_download( repo_id="solarchive/solarchive", filename="txs/2024-01-01/000000000000.parquet", repo_type="dataset" ) # Read and analyze df = pq.read_table(file_path).to_pandas() # Basic statistics print(f"Total transactions: {len(df):,}") print(f"Successful: {len(df[df['status'] == 'Success']):,}") print(f"Failed: {len(df[df['status'] == 'Failed']):,}") print(f"Average fee: {df['fee'].mean():.2f} lamports") ``` ### Analyze with DuckDB ```python import duckdb from huggingface_hub import snapshot_download # Download a partition local_dir = snapshot_download( repo_id="solarchive/solarchive", repo_type="dataset", allow_patterns="txs/2024-01-01/*.parquet" ) # Query with DuckDB result = duckdb.sql(f""" SELECT status, COUNT(*) as count, AVG(fee) as avg_fee, SUM(fee) as total_fees FROM read_parquet('{local_dir}/txs/2024-01-01/*.parquet') GROUP BY status """).fetchdf() print(result) ``` ## 📋 Schemas Full JSON schemas with examples are available in the `schemas/` directory: - **[schemas/transactions.json](schemas/transactions.json)** - Complete transaction schema with all fields documented - **[schemas/accounts.json](schemas/accounts.json)** - Account schema including token accounts and vote accounts - **[schemas/tokens.json](schemas/tokens.json)** - Token metadata schema for NFTs and fungible tokens ### Key Fields Reference **Transactions:** - `signature` - Unique transaction identifier - `block_slot` - Slot number where transaction was included - `block_timestamp` - ISO 8601 timestamp - `fee` - Transaction fee in lamports (1 SOL = 1B lamports) - `status` - "Success" or "Failed" - `accounts` - Array of involved accounts with signer/writable flags - `balance_changes` - SOL balance changes per account - `pre_token_balances` / `post_token_balances` - Token balance changes **Accounts:** - `pubkey` - Account public key - `lamports` - Account balance in lamports - `owner` - Program that owns this account - `mint` - For token accounts, the token mint address - `token_amount` - For token accounts, the token balance **Tokens:** - `mint` - Token mint address - `name` / `symbol` - Token name and symbol - `is_nft` - Whether this is an NFT - `creators` - Array of creator addresses with verification status - `uri` - Metadata URI ## 🔗 Links - **Website:** [solarchive.org](https://solarchive.org) - **Data API:** [data.solarchive.org](https://data.solarchive.org) ## 💾 Data Format All data is stored in **Apache Parquet** format. Each parquet file includes a corresponding checksum file: - **Data file** - `NNNNNNNNNNNN.parquet` - **Checksum** - `NNNNNNNNNNNN.checksum` (SHA256 hash for verification) ## 📜 License **CC BY 4.0 (Creative Commons Attribution 4.0 International)** This dataset is licensed under CC BY 4.0. You can: - ✅ Use commercially - ✅ Modify and redistribute - ✅ Use for any purpose - ℹ️ Attribution required: "Data from SolArchive.org" The underlying Solana blockchain data is public by nature.

提供机构：

solarchive

5,000+

优质数据集

54 个

任务类型

进入经典数据集