prediction-market-analysis | Jonathan Becker

A framework for analyzing prediction market data, including the largest publicly available dataset of Polymarket and Kalshi market and trade data. Provides tools for data collection, storage, and running analysis scripts that generate figures and statistics.

Overview

This project enables research and analysis of prediction markets by providing:

Pre-collected datasets from Polymarket and Kalshi
Data collection indexers for gathering new data
Analysis framework for generating figures and statistics

Currently supported features:

Market metadata collection (Kalshi & Polymarket)
Trade history collection via API and blockchain
Parquet-based storage with automatic progress saving
Extensible analysis script framework

Installation & Usage

Requires Python 3.9+. Install dependencies with uv:

snippet.sh

1uv sync

Download and extract the pre-collected dataset (36GiB compressed):

snippet.sh

1make setup

This downloads data.tar.zst from Cloudflare R2 Storage and extracts it to data/.

Data Collection

Collect market and trade data from prediction market APIs:

snippet.sh

1make index

This opens an interactive menu to select which indexer to run. Data is saved to data/kalshi/ and data/polymarket/ directories. Progress is saved automatically, so you can interrupt and resume collection.

Running Analyses

snippet.sh

1make analyze

This opens an interactive menu to select which analysis to run. You can run all analyses or select a specific one. Output files (PNG, PDF, CSV, JSON) are saved to output/.

Packaging Data

To compress the data directory for storage/distribution:

snippet.sh

1make package

This creates a zstd-compressed tar archive (data.tar.zst) and removes the data/ directory.

Project Structure

snippet.txt

1├── src/
2│   ├── analysis/           # Analysis scripts
3│   │   ├── kalshi/         # Kalshi-specific analyses
4│   │   └── polymarket/     # Polymarket-specific analyses
5│   ├── indexers/           # Data collection indexers
6│   │   ├── kalshi/         # Kalshi API client and indexers
7│   │   └── polymarket/     # Polymarket API/blockchain indexers
8│   └── common/             # Shared utilities and interfaces
9├── data/                   # Data directory (extracted from data.tar.zst)
10│   ├── kalshi/
11│   │   ├── markets/
12│   │   └── trades/
13│   └── polymarket/
14│       ├── blocks/
15│       ├── markets/
16│       └── trades/
17├── docs/                   # Documentation
18└── output/                 # Analysis outputs (figures, CSVs)

Documentation

Data Schemas - Parquet file schemas for markets and trades
Writing Analyses - Guide for writing custom analysis scripts

Contributing

If you'd like to contribute to this project, please open a pull-request with your changes, as well as detailed information on what is changed, added, or improved.

For more information, see the contributing guide.

Issues

If you've found an issue or have a question, please open an issue here.

Research & Citations

Becker, J. (2026). The Microstructure of Wealth Transfer in Prediction Markets. Jbecker. https://jbecker.dev/research/prediction-market-microstructure
Le, N. A. (2026). Decomposing Crowd Wisdom: Domain-Specific Calibration Dynamics in Prediction Markets. arXiv. https://arxiv.org/abs/2602.19520

If you have used or plan to use this dataset in your research, please reach out via email or Twitter -- i'd love to hear about what you're using the data for! Additionally, feel free to open a PR and update this section with a link to your paper.