five

drum998/Polymarket_data

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/drum998/Polymarket_data
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <h1>Polymarket Data</h1> <h3>Complete Data Infrastructure for Polymarket — Fetch, Process, Analyze</h3> <p style="max-width: 750px; margin: 0 auto;"> A comprehensive dataset of 1.9 billion trading records from Polymarket, processed into multiple analysis-ready formats. Features cleaned data, unified token perspectives, and user-level transformations — ready for market research, behavioral studies, and quantitative analysis. </p> <p> <b>Zhengjie Wang</b><sup>1,2</sup>, <b>Leiyu Chao</b><sup>1,3</sup>, <b>Yu Bao</b><sup>1,4</sup>, <b>Lian Cheng</b><sup>1,3</sup>, <b>Jianhan Liao</b><sup>1,5</sup>, <b>Yikang Li</b><sup>1,†</sup> </p> <p> <sup>1</sup>Shanghai Innovation Institute &nbsp;&nbsp; <sup>2</sup>Westlake University &nbsp;&nbsp; <sup>3</sup>Shanghai Jiao Tong University <br> <sup>4</sup>Harbin Institute of Technology &nbsp;&nbsp; <sup>5</sup>Fudan University </p> <p> <sup>†</sup>Corresponding author </p> </div> <p align="center"> <a href="https://huggingface.co/datasets/SII-WANGZJ/Polymarket_data"> <img src="https://img.shields.io/badge/Hugging%20Face-Dataset-yellow.svg" alt="HuggingFace Dataset"/> </a> <a href="https://github.com/SII-WANGZJ/Polymarket_data"> <img src="https://img.shields.io/badge/GitHub-Code-black.svg?logo=github" alt="GitHub Repository"/> </a> <a href="https://github.com/SII-WANGZJ/Polymarket_data/blob/main/LICENSE"> <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License"/> </a> <a href="#data-quality"> <img src="https://img.shields.io/badge/Data-Verified-green.svg" alt="Data Quality"/> </a> </p> --- ## TL;DR We provide **163GB of historical on-chain trading data** from Polymarket, containing **1.9 billion records** across 538K+ markets. The dataset is directly fetched from Polygon blockchain, fully verified, and ready for analysis. Perfect for market research, behavioral studies, data science projects, and academic research. ## Highlights - **Complete Blockchain History**: All OrderFilled events from Polymarket's two exchange contracts, with no missing blocks or gaps. Every single trade from the platform's inception is included. - **Multiple Analysis Perspectives**: 5 structured datasets at different abstraction levels — raw blockchain events, processed trades with market linkage, market metadata, and derived quantitative views — serving diverse research needs. - **Production Ready**: Clean, validated data with proper schema documentation. All trades are verified against blockchain RPC, with market metadata linked and ready to use. - **Open Source Pipeline**: Fully reproducible data collection process. Our open-source tools allow you to verify, update, or extend the dataset independently. ## Dataset Overview | File | Size | Records | Description | |------|------|---------|-------------| | `trades.parquet` | 28GB | 418.3M | **Recommended.** Processed trades with market metadata linkage | | `orderfilled.parquet` | 84GB | 689.0M | Raw blockchain events from OrderFilled logs | | `markets.parquet` | 85MB | 538,587 | Market information and metadata | | `quant.parquet` | 28GB | 418.2M | Derived: unified YES perspective (for quant research) | | `users.parquet` | 23GB | 340.6M | Derived: user-level split by maker/taker (for quant research) | **Total**: 163GB, 1.9 billion records ## Use Cases ### Market Research & Analysis - Study prediction market dynamics and price discovery mechanisms - Analyze market efficiency and information aggregation - Research crowd wisdom and forecasting accuracy ### Behavioral Studies - Track individual user trading patterns and decision-making - Study market participant behavior under different conditions - Analyze risk preferences and trading strategies ### Data Science & Machine Learning - Train models for price prediction and market forecasting - Feature engineering for time-series analysis - Develop algorithms for market analysis ### Academic Research - Economics and finance research on prediction markets - Social science studies on collective intelligence - Computer science research on blockchain data analysis ## Quick Start ### Installation ```bash # Using pip pip install pandas pyarrow # Optional: for faster parquet reading pip install fastparquet ``` ### Load Data with Pandas ```python import pandas as pd # Load trades (recommended for most users) df = pd.read_parquet('trades.parquet') print(f"Total trades: {len(df):,}") # Load market metadata markets = pd.read_parquet('markets.parquet') print(f"Total markets: {len(markets):,}") ``` ### Load from HuggingFace Datasets ```python from datasets import load_dataset # Load trades dataset = load_dataset( "SII-WANGZJ/Polymarket_data", data_files="trades.parquet" ) # Load multiple files dataset = load_dataset( "SII-WANGZJ/Polymarket_data", data_files=["trades.parquet", "markets.parquet"] ) ``` ### Download Specific Files ```bash # Download using HuggingFace CLI pip install huggingface_hub # Download a specific file hf download SII-WANGZJ/Polymarket_data quant.parquet --repo-type dataset # Download all files hf download SII-WANGZJ/Polymarket_data --repo-type dataset ``` ## File Selection Guide > **We recommend `trades.parquet` as the primary dataset for most use cases.** It preserves all original trade semantics with market metadata linked, requiring no assumptions about token normalization. `quant.parquet` and `users.parquet` are derived datasets designed for our internal quantitative research. They apply specific transformations — normalizing all trades to the YES (token1) perspective — which may not be suitable for every analysis scenario. Detailed transformation logic is documented below. ## Data Structure ### trades.parquet - Processed Trades (Recommended) Complete trade records with market metadata linkage. Preserves all original blockchain semantics — no normalization or filtering applied. **Best for:** General-purpose analysis, custom research, building your own pipelines. **Schema:** | Column | Type | Description | |--------|------|-------------| | `timestamp` | uint64 | Unix timestamp (seconds) | | `block_number` | uint64 | Polygon block number | | `transaction_hash` | string | Blockchain transaction hash | | `log_index` | uint32 | Log index within the transaction | | `contract` | string | Exchange contract address | | `market_id` | string | Polymarket market identifier | | `condition_id` | string | CTF condition ID | | `event_id` | string | Event group identifier | | `maker` | string | Maker wallet address | | `taker` | string | Taker wallet address | | `price` | float64 | Trade price (0–1) | | `usd_amount` | float64 | USD (USDC) value of the trade | | `token_amount` | float64 | Number of outcome tokens traded | | `maker_direction` | string | Maker's direction: `BUY` or `SELL` | | `taker_direction` | string | Taker's direction: `BUY` or `SELL` | | `nonusdc_side` | string | Which outcome token was traded: `token1` (YES) or `token2` (NO) | | `asset_id` | string | The non-USDC token's asset ID | ### orderfilled.parquet - Raw Blockchain Events Unprocessed `OrderFilled` events directly from Polygon blockchain logs. No decoding, no market linkage — pure on-chain data. **Best for:** Blockchain research, data verification, building custom processing pipelines from scratch. **Schema:** | Column | Type | Description | |--------|------|-------------| | `timestamp` | uint64 | Unix timestamp (seconds) | | `block_number` | uint64 | Polygon block number | | `transaction_hash` | string | Blockchain transaction hash | | `log_index` | uint32 | Log index within the transaction | | `contract` | string | Exchange contract address | | `order_hash` | string | Unique order hash | | `maker` | string | Maker wallet address | | `taker` | string | Taker wallet address | | `maker_asset_id` | string | Asset ID of maker's token | | `taker_asset_id` | string | Asset ID of taker's token | | `maker_amount_filled` | string | Amount filled for maker (wei, uint256 as string) | | `taker_amount_filled` | string | Amount filled for taker (wei, uint256 as string) | | `maker_fee` | string | Maker fee (wei, uint256 as string) | | `taker_fee` | string | Taker fee (wei, uint256 as string) | | `protocol_fee` | string | Protocol fee (wei, uint256 as string) | > Note: Amount and fee fields are stored as strings because they are uint256 values from the blockchain that exceed standard integer range. ### markets.parquet - Market Metadata Market information, outcome token details, and event grouping. **Best for:** Linking trades to market context, filtering by market attributes, understanding market outcomes. **Schema:** | Column | Type | Description | |--------|------|-------------| | `id` | string | Market identifier (join key with `market_id` in other tables) | | `question` | string | Market question text | | `slug` | string | URL slug | | `condition_id` | string | CTF condition ID | | `token1` | string | Asset ID of outcome token 1 (YES) | | `token2` | string | Asset ID of outcome token 2 (NO) | | `answer1` | string | Label for token1 outcome (e.g., "Yes") | | `answer2` | string | Label for token2 outcome (e.g., "No") | | `closed` | uint8 | 0 = active, 1 = settled | | `active` | uint8 | Whether the market is currently active | | `archived` | uint8 | Whether the market is archived | | `outcome_prices` | string | JSON array of final prices, e.g. `["0.99", "0.01"]` means answer1 won | | `volume` | float64 | Total traded volume (USD) | | `event_id` | string | Parent event identifier | | `event_slug` | string | Parent event URL slug | | `event_title` | string | Parent event title | | `created_at` | datetime | Market creation time | | `end_date` | datetime | Market end / resolution time | | `updated_at` | datetime | Last metadata update time | ### quant.parquet - Unified YES Perspective (For Quantitative Research) > **Note:** This is a derived dataset built for our own quantitative research. It normalizes all trades to the YES (token1) perspective: for trades originally on token2 (NO), the price is converted to `1 - price`, and the buy/sell direction is flipped. Contract-address trades are filtered out, keeping only real user trades. **If you need the original trade semantics, use `trades.parquet` instead.** **Schema:** | Column | Type | Description | |--------|------|-------------| | `timestamp` | uint64 | Unix timestamp (seconds) | | `block_number` | uint64 | Polygon block number | | `transaction_hash` | string | Blockchain transaction hash | | `log_index` | uint32 | Log index within the transaction | | `market_id` | string | Market identifier | | `condition_id` | string | CTF condition ID | | `event_id` | string | Event group identifier | | `price` | float64 | YES token price (0–1). For original token2 trades: `1 - original_price` | | `usd_amount` | float64 | USD value | | `token_amount` | float64 | Token amount | | `side` | string | `BUY` or `SELL` (from YES token perspective). For original token2 trades: direction is flipped | | `maker` | string | Maker wallet address | | `taker` | string | Taker wallet address | ### users.parquet - User-Level Behavior Data (For Quantitative Research) > **Note:** This is a derived dataset built for our own research. Each trade is split into two records (one for maker, one for taker), with the same token1 normalization as `quant.parquet`. All records are converted to a unified BUY direction — negative `token_amount` indicates selling. **If you need the original trade semantics, use `trades.parquet` instead.** **Schema:** | Column | Type | Description | |--------|------|-------------| | `timestamp` | uint64 | Unix timestamp (seconds) | | `block_number` | uint64 | Polygon block number | | `transaction_hash` | string | Blockchain transaction hash | | `log_index` | uint32 | Log index within the transaction | | `market_id` | string | Market identifier | | `condition_id` | string | CTF condition ID | | `event_id` | string | Event group identifier | | `user` | string | User wallet address | | `role` | string | `maker` or `taker` | | `price` | float64 | YES token price (normalized, same as quant) | | `usd_amount` | float64 | USD value | | `token_amount` | float64 | Signed amount: positive = buy, negative = sell | ## Example Analysis ### 1. Calculate Market Statistics ```python import pandas as pd df = pd.read_parquet('trades.parquet') # Market-level statistics market_stats = df.groupby('market_id').agg({ 'usd_amount': ['sum', 'mean'], # Total volume and average trade size 'price': ['mean', 'std', 'min', 'max'], # Price statistics 'transaction_hash': 'count' # Number of trades }).round(4) print(market_stats.head()) ``` ### 2. Track Price Evolution ```python import pandas as pd import matplotlib.pyplot as plt df = pd.read_parquet('trades.parquet') df['datetime'] = pd.to_datetime(df['timestamp'], unit='s') # Select a specific market market_id = 'your-market-id' market_data = df[df['market_id'] == market_id].sort_values('timestamp') # Plot price over time plt.figure(figsize=(12, 6)) plt.plot(market_data['datetime'], market_data['price']) plt.title(f'Price Evolution - Market {market_id}') plt.xlabel('Date') plt.ylabel('Price') plt.show() ``` ### 3. Market Volume Analysis ```python import pandas as pd df = pd.read_parquet('trades.parquet') markets = pd.read_parquet('markets.parquet') # Join with market metadata (markets uses 'id', trades uses 'market_id') df = df.merge(markets[['id', 'question']], left_on='market_id', right_on='id', how='left') # Top markets by volume top_markets = df.groupby(['market_id', 'question']).agg({ 'usd_amount': 'sum' }).sort_values('usd_amount', ascending=False).head(20) print(top_markets) ``` ### 4. Analyze by Token Side ```python import pandas as pd df = pd.read_parquet('trades.parquet') # Compare YES vs NO token trading activity side_stats = df.groupby('nonusdc_side').agg({ 'usd_amount': ['sum', 'mean'], 'transaction_hash': 'count' }) print(side_stats) # Filter for only YES token trades on a specific market market_id = 'your-market-id' yes_trades = df[(df['market_id'] == market_id) & (df['nonusdc_side'] == 'token1')] print(f"YES trades: {len(yes_trades):,}") ``` ## Data Processing Pipeline ``` Polygon Blockchain (RPC) ↓ orderfilled.parquet (Raw events) ↓ trades.parquet (+ Market linkage) ↓ ├─→ quant.parquet (Trade-level, unified YES perspective) │ └─→ Filter contracts + Normalize tokens │ └─→ users.parquet (User-level, split maker/taker) └─→ Split records + Unified BUY direction ``` **Key Transformations:** 1. **quant.parquet**: - Filter out contract trades (keep only user trades) - Normalize all trades to YES token perspective - Preserve maker/taker information - Result: 418.2M records (from 418.3M trades) 2. **users.parquet**: - Split each trade into 2 records (maker + taker) - Convert all to BUY direction (signed amounts) - Sort by user for easy querying - Result: 340.6M records ## Documentation - **[DATA_DESCRIPTION.md](DATA_DESCRIPTION.md)** - Comprehensive documentation - Detailed schema for all 5 files - Data cleaning and transformation process - Usage examples and best practices - Comparison between different files ## Data Quality - **Complete History**: No missing blocks or gaps in blockchain data - **Verified Sources**: All OrderFilled events from 2 official exchange contracts - **Blockchain Verified**: Cross-checked against Polygon RPC nodes - **Regular Updates**: Automated daily pipeline for fresh data - **Open Source**: Fully reproducible collection process **Contracts Tracked:** - Exchange Contract 1: `0x4bFb41d5B3570DeFd03C39a9A4D8dE6Bd8B8982E` - Exchange Contract 2: `0xC5d563A36AE78145C45a50134d48A1215220f80a` ## Collection Tools Data collected using our open-source toolkit: [polymarket-data](https://github.com/SII-WANGZJ/Polymarket_data) **Features:** - Direct blockchain RPC integration - Efficient batch processing - Automatic retry and error handling - Data validation and verification ## Dataset Statistics **Last Updated**: 2026-03-05 **Coverage**: - Time Range: Polymarket inception to 2026-03-04 - Total Markets: 538,587 - Total Trades: 418.3 million (processed), 689.0 million (raw OrderFilled) - Unique Users: [To be calculated] **Data Freshness**: Updated periodically via automated pipeline ## Contributing We welcome contributions to improve the dataset and tools: 1. **Report Issues**: Found data quality issues? [Open an issue](https://github.com/SII-WANGZJ/Polymarket_data/issues) 2. **Suggest Features**: Ideas for new data transformations? Let us know! 3. **Contribute Code**: Improve our collection pipeline via pull requests ## License MIT License - Free for commercial and research use. See [LICENSE](LICENSE) file for details. ## Contact & Support - **Email**: [wangzhengjie@sii.edu.cn](mailto:wangzhengjie@sii.edu.cn) - **Issues**: [GitHub Issues](https://github.com/SII-WANGZJ/Polymarket_data/issues) - **Dataset**: [HuggingFace](https://huggingface.co/datasets/SII-WANGZJ/Polymarket_data) - **Code**: [GitHub Repository](https://github.com/SII-WANGZJ/Polymarket_data) ## Citation If you use this dataset in your research, please cite: ```bibtex @misc{polymarket_data_2026, title={Polymarket Data: Complete Data Infrastructure for Polymarket}, author={Wang, Zhengjie and Chao, Leiyu and Bao, Yu and Cheng, Lian and Liao, Jianhan and Li, Yikang}, year={2026}, howpublished={\url{https://huggingface.co/datasets/SII-WANGZJ/Polymarket_data}}, note={A comprehensive dataset and toolkit for Polymarket prediction markets} } ``` ## Acknowledgments - **Polymarket** for building the leading prediction market platform - **Polygon** for providing reliable blockchain infrastructure - **HuggingFace** for hosting and distributing large datasets - The open-source community for tools and libraries --- <div align="center"> **Built for the research and data science community** [HuggingFace](https://huggingface.co/datasets/SII-WANGZJ/Polymarket_data) • [GitHub](https://github.com/SII-WANGZJ/Polymarket_data) • [Documentation](DATA_DESCRIPTION.md) </div>
提供机构:
drum998
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作