Skip to main content

prediction-market-analysis

This dataset was collected for and supports the analysis in The Microstructure of Wealth Transfer in Prediction Markets.

A framework for analyzing Kalshi prediction market data. Includes tools for data collection, storage, and running analysis scripts that generate figures and statistics.

The dataset was acquired from Kalshi's public REST API, and spans from 16:09 ET 2021-06-30 to 17:00 ET 2025-11-25. All market and trade data during this period is included.

Setup

Requires Python 3.9+. Install dependencies with uv:

snippet.sh
1uv sync

Running Analyses

The data is stored as compressed chunks (data.zip.*). The analysis framework handles extraction and cleanup automatically.

Run all analyses

snippet.sh
1make analysis

This will:

  1. Reassemble and extract the data archive
  2. Run all scripts in research/analysis/ in parallel
  3. Clean up the extracted data when complete

Run a single analysis

snippet.sh
1make analyze <script_name>

For example:

snippet.sh
1make analyze mispricing_by_price 2make analyze total_volume_by_price.py # .py extension is optional

Manual commands

You can also run the CLI directly:

snippet.sh
1uv run main.py setup # Extract data 2uv run main.py analysis # Run all analyses 3uv run main.py analysis mispricing_by_price # Run single analysis 4uv run main.py teardown # Clean up data

Data Schemas

Data is stored as Parquet files. When extracted, the directory structure is:

snippet.txt
1data/ 2 markets/ 3 markets_0_10000.parquet 4 markets_10000_20000.parquet 5 ... 6 trades/ 7 <TICKER>_trades.parquet 8 ...

Markets Schema

Each row represents a prediction market contract.

ColumnTypeDescription
tickerstringUnique market identifier (e.g., PRES-2024-DJT)
event_tickerstringParent event identifier, used for categorization
market_typestringMarket type (typically binary)
titlestringHuman-readable market title
yes_sub_titlestringLabel for the "Yes" outcome
no_sub_titlestringLabel for the "No" outcome
statusstringMarket status: open, closed, finalized
yes_bidint (nullable)Best bid price for Yes contracts (cents, 1-99)
yes_askint (nullable)Best ask price for Yes contracts (cents, 1-99)
no_bidint (nullable)Best bid price for No contracts (cents, 1-99)
no_askint (nullable)Best ask price for No contracts (cents, 1-99)
last_priceint (nullable)Last traded price (cents, 1-99)
volumeintTotal contracts traded
volume_24hintContracts traded in last 24 hours
open_interestintOutstanding contracts
resultstringMarket outcome: yes, no, or empty if unresolved
created_timedatetimeWhen the market was created
open_timedatetime (nullable)When trading opened
close_timedatetime (nullable)When trading closed
_fetched_atdatetimeWhen this record was fetched

Trades Schema

Each row represents a single trade execution.

ColumnTypeDescription
trade_idstringUnique trade identifier
tickerstringMarket ticker this trade belongs to
countintNumber of contracts traded
yes_priceintYes contract price (cents, 1-99)
no_priceintNo contract price (cents, 1-99), always 100 - yes_price
taker_sidestringWhich side the taker bought: yes or no
created_timedatetimeWhen the trade occurred
_fetched_atdatetimeWhen this record was fetched

Note on prices: Prices are in cents. A yes_price of 65 means the contract costs 0.65andpays0.65 and pays 1.00 if the outcome is "Yes" (implied probability: 65%). The no_price is always 100 - yes_price.

Writing Analysis Scripts

Analysis scripts live in research/analysis/ and output to research/fig/.

Basic template

snippet.txt
1#!/usr/bin/env python3 2"""Brief description of what this analysis does.""" 3 4from pathlib import Path 5 6import duckdb 7import matplotlib.pyplot as plt 8 9 10def main(): 11 # Standard path setup 12 base_dir = Path(__file__).parent.parent.parent 13 trades_dir = base_dir / "data" / "trades" 14 markets_dir = base_dir / "data" / "markets" 15 fig_dir = base_dir / "research" / "fig" 16 fig_dir.mkdir(parents=True, exist_ok=True) 17 18 # Connect to DuckDB (in-memory) 19 con = duckdb.connect() 20 21 # Query parquet files directly with glob patterns 22 df = con.execute( 23 f""" 24 SELECT 25 yes_price, 26 count, 27 taker_side 28 FROM '{trades_dir}/*.parquet' 29 WHERE yes_price BETWEEN 1 AND 99 30 LIMIT 1000 31 """ 32 ).df() 33 34 # Save data output 35 df.to_csv(fig_dir / "my_analysis.csv", index=False) 36 37 # Create visualization 38 fig, ax = plt.subplots(figsize=(10, 6)) 39 ax.bar(df["yes_price"], df["count"]) 40 ax.set_xlabel("Price (cents)") 41 ax.set_ylabel("Count") 42 ax.set_title("My Analysis") 43 44 plt.tight_layout() 45 fig.savefig(fig_dir / "my_analysis.png", dpi=300, bbox_inches="tight") 46 fig.savefig(fig_dir / "my_analysis.pdf", bbox_inches="tight") 47 plt.close(fig) 48 49 print(f"Outputs saved to {fig_dir}") 50 51 52if __name__ == "__main__": 53 main()

Common query patterns

Join trades with market outcomes:

snippet.txt
1WITH resolved_markets AS ( 2 SELECT ticker, result 3 FROM '{markets_dir}/*.parquet' 4 WHERE status = 'finalized' 5 AND result IN ('yes', 'no') 6) 7SELECT 8 t.yes_price, 9 t.count, 10 t.taker_side, 11 m.result, 12 CASE WHEN t.taker_side = m.result THEN 1 ELSE 0 END AS taker_won 13FROM '{trades_dir}/*.parquet' t 14INNER JOIN resolved_markets m ON t.ticker = m.ticker

Analyze both taker and maker positions:

snippet.txt
1WITH all_positions AS ( 2 -- Taker positions 3 SELECT 4 CASE WHEN taker_side = 'yes' THEN yes_price ELSE no_price END AS price, 5 count, 6 'taker' AS role 7 FROM '{trades_dir}/*.parquet' 8 9 UNION ALL 10 11 -- Maker positions (counterparty) 12 SELECT 13 CASE WHEN taker_side = 'yes' THEN no_price ELSE yes_price END AS price, 14 count, 15 'maker' AS role 16 FROM '{trades_dir}/*.parquet' 17) 18SELECT price, role, SUM(count) AS total_contracts 19FROM all_positions 20GROUP BY price, role 21ORDER BY price

Extract category from event_ticker:

snippet.txt
1SELECT 2 CASE 3 WHEN event_ticker IS NULL OR event_ticker = '' THEN 'independent' 4 ELSE regexp_extract(event_ticker, '^([A-Z0-9]+)', 1) 5 END AS category, 6 COUNT(*) AS market_count 7FROM '{markets_dir}/*.parquet' 8GROUP BY category

Using the categories utility

For grouping markets into high-level categories (Sports, Politics, Crypto, etc.):

snippet.txt
1from research.analysis.util.categories import get_group, get_hierarchy, GROUP_COLORS 2 3# Get high-level group 4group = get_group("NFLGAME") # Returns "Sports" 5 6# Get full hierarchy (group, category, subcategory) 7hierarchy = get_hierarchy("NFLGAME") # Returns ("Sports", "NFL", "Games") 8 9# Use predefined colors for consistent visualizations 10color = GROUP_COLORS["Sports"] # Returns "#1f77b4"

Output conventions

  • Save CSV/JSON for raw data: fig_dir / "analysis_name.csv"
  • Save PNG at 300 DPI for presentations: fig_dir / "analysis_name.png"
  • Save PDF for papers: fig_dir / "analysis_name.pdf"
  • Print a completion message: print(f"Outputs saved to {fig_dir}")

Dependencies available

Scripts have access to these libraries (see pyproject.toml):

  • duckdb - SQL queries on Parquet files
  • pandas - DataFrames
  • matplotlib - Plotting
  • scipy - Statistical functions
  • brokenaxes - Plots with broken axes
  • squarify - Treemap visualizations