Performance Benchmarks

Across the 17 rules we benchmark, Basilisk’s cold full-file check is a median 35× faster than Pyright (up to 58×) — a typical 17 ms against Pyright’s 609 ms. These are self-measured, reproducible numbers: every release runs make bench and commits the raw CSV, so the table below is exactly what was last measured — never a hand-typed figure.

Rule basiliskbasilisk-warmpyrightmypymypy-warmtypyreflyzuban
E0001 Missing param 12 ms4 ms ✓531 ms635 ms168 ms31 ms111 ms32 ms
E0002 Missing return 16 ms4 ms ✓580 ms633 ms166 ms36 ms113 ms38 ms
E0010 Unresolved import 50 ms8 ms ✓477 ms668 ms166 ms246 ms783 ms244 ms
E0011 Explicit any 15 ms4 ms ✓530 ms622 ms164 ms30 ms108 ms30 ms
E0012 Argument type mismatch 17 ms6 ms ✓668 ms624 ms166 ms54 ms114 ms48 ms
E0014 Assignment incompatibility 12 ms8 ms ✓609 ms592 ms170 ms51 ms114 ms31 ms
E0016 Incompatible override 52 ms4 ms ✓655 ms624 ms167 ms54 ms118 ms30 ms
E0018 Undefined name 19 ms8 ms ✓493 ms641 ms169 ms49 ms549 ms35 ms
E0022 Unhashable dict key 16 ms8 ms ✓532 ms627 ms164 ms37 ms104 ms31 ms
E0023 Nonexhaustive match 13 ms6 ms ✓541 ms613 ms165 ms23 ms109 ms27 ms
E0026 Typevar single constraint 21 ms8 ms ✓737 ms596 ms166 ms39 ms109 ms33 ms
E0036 Classvar misuse 19 ms8 ms ✓616 ms624 ms166 ms56 ms132 ms32 ms
E0038 Typeddict readonly inheritance 92 ms5 ms ✓680 ms585 ms165 ms38 ms117 ms26 ms
E0050 Newtype name mismatch 14 ms8 ms ✓727 ms633 ms166 ms44 ms121 ms35 ms
E0054 Final reassignment 8 ms5 ms ✓474 ms579 ms166 ms27 ms99 ms24 ms
E0056 Readonly typeddict mutation 39 ms5 ms ✓645 ms591 ms166 ms41 ms107 ms25 ms
E0093 Typeddict invalid key 38 ms5 ms ✓640 ms588 ms166 ms37 ms110 ms26 ms
Versions benchmarked
basilisk dev (c7d11042)
pyright 1.1.408
mypy 1.19.1
ty 0.0.19
pyrefly 0.54.0
zuban 0.9.0

Mean wall-clock over 10 hyperfine runs; lower is better, ✓ = fastest. Measured on Apple M4 Max (Darwin 25.5.0, 14 cores) on 2026-07-03T17:39:13+1000. Source: benchmarks/status/darwin-arm64-apple-m4-max.csv — regenerated by make bench.

Methodology

Every number is produced by hyperfine, which runs each tool’s real command-line checker many times and reports the mean wall-clock time. We run the same harness against every tool on identical single-rule fixtures — large Python files that stress one diagnostic each — so the comparison is like-for-like.

Cold vs. warm

The main <tool> column is a cold full-file check from scratch: the time to start the tool and check the file with no prior state. Only Basilisk and mypy also have a -warm column, because they are the only two that keep a real cross-run cache (basilisk-warm = a --cache result-cache hit; mypy-warm = an incremental .mypy_cache hit, with cold mypy measured under --no-incremental). Pyright, ty and Pyrefly keep no cross-run result cache — a repeat run is a cold run — so they are measured cold-only. zuban is cold-only too, but its mypy mode reuses a ./.mypy_cache when present, so we delete that cache before each timed run to keep the measurement honestly cold.

Fair strictness

mypy runs with --strict and zuban runs as zuban mypy --strict, because the fixtures exercise strict-mode analysis — without it those tools report “no issues” on the strictness fixtures and would not be doing comparable work. Basilisk, Pyright, ty and Pyrefly are run in their default checking mode.

What this does and doesn’t show

These are cold, full-file CLI runs. They reflect batch/CI checking, not warm incremental editor latency: inside the LSP, an edit re-checks only the file you touched and its importers, so interactive re-checks do strictly less work than any number here. Throughput on a single small fixture is also not the same as throughput on a large real codebase.

Reproduce any of this yourself with make bench. The raw results are committed to the repository under benchmarks/status/<machine>.csv; this page renders the darwin-arm64-apple-m4-max machine. Competitor PEP-conformance context lives in the type-checker comparison, and Pyrefly’s own throughput figure (1.85M LOC/sec on 166-core Meta infrastructure) is published at pyrefly.org.