five

danjsiegel/whaledoxer-stream

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/danjsiegel/whaledoxer-stream
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit pretty_name: Whale-Stream Telemetry size_categories: - 10K<n<100K tags: - finance - prediction-markets - polymarket - kalshi - anomaly-detection - telemetry configs: - config_name: default data_files: - split: train path: data/whale_stream_*.parquet --- # Whale-Stream Telemetry This dataset contains 6-hour rolling telemetry exports from the WhaleDoxer pipeline. For more context on the background of this project check out the substack article I wrote: https://danjsiegel.substack.com/p/building-a-sub-second-middle-finger Each file is a time-sliced Parquet export of recent `tripwire` events, joined to paper-trade scoring data when a paper trade was actually written. It is intended to answer operational questions such as: - is the pipeline alive - how large were the detected anomalies - how long did the hot path and downstream scoring take - when a paper trade was written, what composite score and scorer breakdown were attached Files are appended every 6 hours as separate Parquet objects: - `data/whale_stream_YYYY-MM-DDTHH00Z.parquet` Example: - `data/whale_stream_2026-03-06T1200Z.parquet` ## What This Dataset Contains This export is built from the `tripwires` table for the last 6 hours, filtered to rows where `z_score > 0`, with optional columns pulled from `paper_trades` when a paper trade exists for the same trigger trade. This is not the full raw trade stream. It is a rolling export of tripwire telemetry plus any attached paper-trade scoring data. ## Data Source and Export Semantics - Source window: last 6 hours at export time - Export cadence: every 6 hours - Storage format: Parquet with Zstandard compression - Join shape: - base table: `tripwires` - optional enrichment: `paper_trades` - optional market titles: Kalshi and Polymarket market catalogs ## Privacy and Anonymization Wallet addresses are anonymized in Python before export: - `wallet_hash = sha256(lower(wallet) + salt)` The following values are emitted as `null` instead of being hashed: - empty strings - `anonymous_lookup_required` - `AMM_Liquidity_Pool` - values starting with `kalshi::` - any non-`0x` identifier This preserves continuity for real on-chain addresses without inventing fake identities for anonymous or synthetic actors. ## Schema | Column | Type | Description | | :--- | :--- | :--- | | `wallet_hash` | `string \| null` | SHA-256 hash of a lowercased on-chain wallet address. Null for sentinels and non-`0x` identities. | | `event_timestamp` | `timestamp` | Exchange-reported timestamp for the triggering trade event. | | `source` | `string` | Exchange source: `kalshi` or `polymarket`. | | `market_id` | `string` | Exchange market identifier: Kalshi ticker or Polymarket condition ID. | | `market_title` | `string \| null` | Human-readable title from the Kalshi or Polymarket market catalog when available. | | `z_score` | `double` | Volume anomaly z-score at tripwire fire time. | | `sniper_latency_ms` | `double \| null` | Milliseconds from WebSocket receipt to tripwire persistence on the sniper side. | | `p_insider` | `double \| null` | Composite suspicion score from the downstream forensics scorer. Null when no paper trade was written. | | `trigger_latency_ms` | `double \| null` | End-to-end latency from the triggering trade timestamp to paper-trade write time. Null when no paper trade was written. | | `scorer_breakdown_json` | `string \| null` | JSON text copied from `paper_trades.scorer_breakdown` when a paper trade exists. | | `sniper_received_at_ms` | `int64 \| null` | Unix milliseconds at trade handler entry on the sniper side. | | `triage_completed_at_ms` | `int64 \| null` | Unix milliseconds immediately after sniper-side anomaly processing completed. | | `tripwire_enqueued_at_ms` | `int64 \| null` | Unix milliseconds immediately before the tripwire was dispatched downstream. | | `triage_latency_ms` | `double \| null` | Derived latency: `triage_completed_at_ms - sniper_received_at_ms`. | | `enqueue_latency_ms` | `double \| null` | Derived latency: `tripwire_enqueued_at_ms - triage_completed_at_ms`. | ## Notes - `p_insider`, `trigger_latency_ms`, and `scorer_breakdown_json` are nullable because not every tripwire becomes a paper trade. - `scorer_breakdown_json` is exported as JSON text in Parquet, not as a nested Arrow or DuckDB `VARIANT` type. - This dataset is best understood as operational and forensic telemetry, not as the full universe of exchange trades. ## Loading Example ```python from datasets import load_dataset ds = load_dataset("dansiegel/whale-stream-telemetry") df = ds["train"].to_pandas() # Rows that actually produced a paper-trade scoring decision scored = df[df["p_insider"].notna()] ```
提供机构:
danjsiegel
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作