solarchive/solarchive
收藏Hugging Face2025-12-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/solarchive/solarchive
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
tags:
- finance
- blockchain
- solana
pretty_name: SolArchive.org Solana Datasets
size_categories:
- n>1T
---
# solarchive.org: Solana Blockchain Datasets
A clean, long-term, public archive of Solana blockchain data.
This dataset contains a complete historical archive of Solana blockchain transactions, accounts, and tokens, sourced from Google BigQuery's public Solana dataset and optimized for analysis.
## 🎯 What is this?
Solarchive is a **free, public archive** of the entire Solana blockchain, designed for:
- 🔬 **Researchers** analyzing blockchain behavior and patterns
- 📊 **Data scientists** building models on blockchain data
- 🏗️ **Developers** building applications that need historical Solana data
- 📈 **Analysts** studying DeFi, NFTs, and token economics
**Key features:**
- ✅ **Complete history** from genesis to present
- ✅ **Daily partitioned** for efficient querying
- ✅ **Vote transactions filtered out** (cleaner dataset)
- ✅ **Free to download** with zero egress fees
- ✅ **Well-documented schemas** with examples
- ✅ **Parquet format** for fast analytics
## 📦 Datasets
This repository contains three main datasets:
### 1. Transactions (`txs/`)
All Solana transactions (excluding validator votes) with complete metadata.
**Coverage:** October 2020 - Present
**Partitioning:** Daily (`YYYY-MM-DD`)
**Format:** Parquet files with checksums
**Schema:** [schemas/transactions.json](schemas/transactions.json)
**Contains:**
- Transaction signatures and status
- Block information (slot, hash, timestamp)
- Accounts involved (pubkeys, signer/writable flags)
- SOL balance changes
- Token balance changes (pre/post)
- Compute units consumed
- Program log messages
- Fee information
### 2. Accounts (`accounts/`)
Historical account snapshots including token accounts, program accounts, and vote accounts.
**Coverage:** October 2020 - Present
**Partitioning:** Daily (`YYYY-MM-DD`)
**Format:** Parquet files with checksums
**Schema:** [schemas/accounts.json](schemas/accounts.json)
**Contains:**
- Account public keys and balances
- Owner programs
- Token account information (mint, amount, decimals)
- Vote account data (validators, votes, epoch credits)
- Program account data
- Account state and metadata
### 3. Tokens (`tokens/`)
Token metadata including NFTs and fungible tokens.
**Coverage:** October 2020 - Present
**Partitioning:** Daily (`YYYY-MM-DD`)
**Format:** Parquet files with checksums
**Schema:** [schemas/tokens.json](schemas/tokens.json)
**Contains:**
- Token mint addresses
- Token names and symbols
- Metadata URIs
- Creator information
- NFT indicators
- Royalty information (seller fees)
- Mutability flags
## 🗂️ Repository Structure
```
solarchive/
├── txs/ # Transactions dataset
│ ├── index.json # Dataset-level index
│ ├── 2020-10-24/ # Daily partition
│ │ ├── index.json # Partition-level index
│ │ ├── 000000000000.parquet # Data file (preserves original name)
│ │ ├── 000000000000.checksum # SHA256 checksum
│ │ ├── 000000000001.parquet
│ │ ├── 000000000001.checksum
│ │ └── ...
│ ├── 2020-10-25/
│ └── ...
│
├── accounts/ # Accounts dataset
│ ├── index.json
│ ├── 2020-10-24/
│ │ ├── index.json
│ │ ├── 000000000000.parquet
│ │ └── ...
│ └── ...
│
├── tokens/ # Tokens dataset
│ ├── index.json
│ ├── 2020-10-24/
│ │ ├── index.json
│ │ ├── 000000000000.parquet
│ │ └── ...
│ └── ...
│
├── schemas/ # JSON schemas
│ ├── transactions.json # Transaction schema
│ ├── accounts.json # Account schema
│ └── tokens.json # Token schema
│
├── index.json # Root-level index
└── README.md # This file
```
## 🚀 Quick Start
### Download and Read with Python
```python
from huggingface_hub import hf_hub_download
import pyarrow.parquet as pq
# Download a specific parquet file
file_path = hf_hub_download(
repo_id="solarchive/solarchive",
filename="txs/2024-01-01/000000000000.parquet",
repo_type="dataset"
)
# Read with PyArrow
table = pq.read_table(file_path)
df = table.to_pandas()
print(f"Transactions: {len(df):,}")
print(df.head())
```
### Download Multiple Files
```python
from huggingface_hub import snapshot_download
# Download entire partition (all parquet files for a day)
local_dir = snapshot_download(
repo_id="solarchive/solarchive",
repo_type="dataset",
allow_patterns="txs/2024-01-01/*.parquet"
)
print(f"Downloaded to: {local_dir}")
```
## 📊 Example Analysis
### Analyze with Pandas
```python
from huggingface_hub import hf_hub_download
import pyarrow.parquet as pq
import pandas as pd
# Download a day's data
file_path = hf_hub_download(
repo_id="solarchive/solarchive",
filename="txs/2024-01-01/000000000000.parquet",
repo_type="dataset"
)
# Read and analyze
df = pq.read_table(file_path).to_pandas()
# Basic statistics
print(f"Total transactions: {len(df):,}")
print(f"Successful: {len(df[df['status'] == 'Success']):,}")
print(f"Failed: {len(df[df['status'] == 'Failed']):,}")
print(f"Average fee: {df['fee'].mean():.2f} lamports")
```
### Analyze with DuckDB
```python
import duckdb
from huggingface_hub import snapshot_download
# Download a partition
local_dir = snapshot_download(
repo_id="solarchive/solarchive",
repo_type="dataset",
allow_patterns="txs/2024-01-01/*.parquet"
)
# Query with DuckDB
result = duckdb.sql(f"""
SELECT
status,
COUNT(*) as count,
AVG(fee) as avg_fee,
SUM(fee) as total_fees
FROM read_parquet('{local_dir}/txs/2024-01-01/*.parquet')
GROUP BY status
""").fetchdf()
print(result)
```
## 📋 Schemas
Full JSON schemas with examples are available in the `schemas/` directory:
- **[schemas/transactions.json](schemas/transactions.json)** - Complete transaction schema with all fields documented
- **[schemas/accounts.json](schemas/accounts.json)** - Account schema including token accounts and vote accounts
- **[schemas/tokens.json](schemas/tokens.json)** - Token metadata schema for NFTs and fungible tokens
### Key Fields Reference
**Transactions:**
- `signature` - Unique transaction identifier
- `block_slot` - Slot number where transaction was included
- `block_timestamp` - ISO 8601 timestamp
- `fee` - Transaction fee in lamports (1 SOL = 1B lamports)
- `status` - "Success" or "Failed"
- `accounts` - Array of involved accounts with signer/writable flags
- `balance_changes` - SOL balance changes per account
- `pre_token_balances` / `post_token_balances` - Token balance changes
**Accounts:**
- `pubkey` - Account public key
- `lamports` - Account balance in lamports
- `owner` - Program that owns this account
- `mint` - For token accounts, the token mint address
- `token_amount` - For token accounts, the token balance
**Tokens:**
- `mint` - Token mint address
- `name` / `symbol` - Token name and symbol
- `is_nft` - Whether this is an NFT
- `creators` - Array of creator addresses with verification status
- `uri` - Metadata URI
## 🔗 Links
- **Website:** [solarchive.org](https://solarchive.org)
- **Data API:** [data.solarchive.org](https://data.solarchive.org)
## 💾 Data Format
All data is stored in **Apache Parquet** format.
Each parquet file includes a corresponding checksum file:
- **Data file** - `NNNNNNNNNNNN.parquet`
- **Checksum** - `NNNNNNNNNNNN.checksum` (SHA256 hash for verification)
## 📜 License
**CC BY 4.0 (Creative Commons Attribution 4.0 International)**
This dataset is licensed under CC BY 4.0. You can:
- ✅ Use commercially
- ✅ Modify and redistribute
- ✅ Use for any purpose
- ℹ️ Attribution required: "Data from SolArchive.org"
The underlying Solana blockchain data is public by nature.
提供机构:
solarchive



