danjsiegel/whaledoxer-stream
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/danjsiegel/whaledoxer-stream
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
pretty_name: Whale-Stream Telemetry
size_categories:
- 10K<n<100K
tags:
- finance
- prediction-markets
- polymarket
- kalshi
- anomaly-detection
- telemetry
configs:
- config_name: default
data_files:
- split: train
path: data/whale_stream_*.parquet
---
# Whale-Stream Telemetry
This dataset contains 6-hour rolling telemetry exports from the WhaleDoxer pipeline. For more context on the background of this project check out the substack article I wrote: https://danjsiegel.substack.com/p/building-a-sub-second-middle-finger
Each file is a time-sliced Parquet export of recent `tripwire` events, joined to paper-trade scoring data when a paper trade was actually written. It is intended to answer operational questions such as:
- is the pipeline alive
- how large were the detected anomalies
- how long did the hot path and downstream scoring take
- when a paper trade was written, what composite score and scorer breakdown were attached
Files are appended every 6 hours as separate Parquet objects:
- `data/whale_stream_YYYY-MM-DDTHH00Z.parquet`
Example:
- `data/whale_stream_2026-03-06T1200Z.parquet`
## What This Dataset Contains
This export is built from the `tripwires` table for the last 6 hours, filtered to rows where `z_score > 0`, with optional columns pulled from `paper_trades` when a paper trade exists for the same trigger trade.
This is not the full raw trade stream. It is a rolling export of tripwire telemetry plus any attached paper-trade scoring data.
## Data Source and Export Semantics
- Source window: last 6 hours at export time
- Export cadence: every 6 hours
- Storage format: Parquet with Zstandard compression
- Join shape:
- base table: `tripwires`
- optional enrichment: `paper_trades`
- optional market titles: Kalshi and Polymarket market catalogs
## Privacy and Anonymization
Wallet addresses are anonymized in Python before export:
- `wallet_hash = sha256(lower(wallet) + salt)`
The following values are emitted as `null` instead of being hashed:
- empty strings
- `anonymous_lookup_required`
- `AMM_Liquidity_Pool`
- values starting with `kalshi::`
- any non-`0x` identifier
This preserves continuity for real on-chain addresses without inventing fake identities for anonymous or synthetic actors.
## Schema
| Column | Type | Description |
| :--- | :--- | :--- |
| `wallet_hash` | `string \| null` | SHA-256 hash of a lowercased on-chain wallet address. Null for sentinels and non-`0x` identities. |
| `event_timestamp` | `timestamp` | Exchange-reported timestamp for the triggering trade event. |
| `source` | `string` | Exchange source: `kalshi` or `polymarket`. |
| `market_id` | `string` | Exchange market identifier: Kalshi ticker or Polymarket condition ID. |
| `market_title` | `string \| null` | Human-readable title from the Kalshi or Polymarket market catalog when available. |
| `z_score` | `double` | Volume anomaly z-score at tripwire fire time. |
| `sniper_latency_ms` | `double \| null` | Milliseconds from WebSocket receipt to tripwire persistence on the sniper side. |
| `p_insider` | `double \| null` | Composite suspicion score from the downstream forensics scorer. Null when no paper trade was written. |
| `trigger_latency_ms` | `double \| null` | End-to-end latency from the triggering trade timestamp to paper-trade write time. Null when no paper trade was written. |
| `scorer_breakdown_json` | `string \| null` | JSON text copied from `paper_trades.scorer_breakdown` when a paper trade exists. |
| `sniper_received_at_ms` | `int64 \| null` | Unix milliseconds at trade handler entry on the sniper side. |
| `triage_completed_at_ms` | `int64 \| null` | Unix milliseconds immediately after sniper-side anomaly processing completed. |
| `tripwire_enqueued_at_ms` | `int64 \| null` | Unix milliseconds immediately before the tripwire was dispatched downstream. |
| `triage_latency_ms` | `double \| null` | Derived latency: `triage_completed_at_ms - sniper_received_at_ms`. |
| `enqueue_latency_ms` | `double \| null` | Derived latency: `tripwire_enqueued_at_ms - triage_completed_at_ms`. |
## Notes
- `p_insider`, `trigger_latency_ms`, and `scorer_breakdown_json` are nullable because not every tripwire becomes a paper trade.
- `scorer_breakdown_json` is exported as JSON text in Parquet, not as a nested Arrow or DuckDB `VARIANT` type.
- This dataset is best understood as operational and forensic telemetry, not as the full universe of exchange trades.
## Loading Example
```python
from datasets import load_dataset
ds = load_dataset("dansiegel/whale-stream-telemetry")
df = ds["train"].to_pandas()
# Rows that actually produced a paper-trade scoring decision
scored = df[df["p_insider"].notna()]
```
提供机构:
danjsiegel



