The Lichess database contains over 7 billion chess games played on Lichess. To make it easier to query these games, I have released the open-source Aix extension for DuckDB and the accompanying Aix-compatible Lichess database.

Introduction

Every month, the Lichess database grows about 100 million games in size. All together, the size of the compressed PGN files adds up to over 2 TB. Uncompressed, they would be over 15 TB. This is not trivial to query.

Several tools exist that can handle queries on large chess databases, such as Scoutfish, pgn-extract or CQL. These tools are very useful, but also have their limitations:

  • They require uncompressed PGN files, which can take up a lot of space.
  • They only implement filtering, other query operations (e.g., aggregations) are not supported.
  • They only execute queries on the game’s moves (and perhaps PGN tags), while clock times and engine evaluations are out of scope.

Any processing beyond the above items would need custom code with the help of libraries such as python-chess or shakmaty.

To make it easier to query such massive databases, I developed Aix. My goals were to support:

  1. Space-efficient game storage
  2. Querying on game metadata (PGN tags)
  3. Querying on moves/positions
  4. Querying on clocks and engine evaluations
  5. Query operations beyond filtering
  6. Parallelized processing

And now, with the Aix extension for DuckDB, and the Aix-compatible Lichess database, this is made possible and you can execute SQL queries over chess games. For example, to generate a heatmap of king move destinations, this query can do the job:

with king_destinations as (
    select
        move_details(movedata)
            .list_filter(lambda m: m.role = 'k')
            .apply(lambda m: m.to)
        as destinations
    from 'aix_lichess_2025-12_low.parquet'
),
unnested as (
    select unnest(destinations) as destination from king_destinations
),
aggregated as (
    select destination, count() from unnested group by 1 order by 2 desc
)

from aggregated;

Which results in:

┌─────────────┬──────────────┐
│ destination │ count_star() │
│   varchar   │    int64     │
├─────────────┼──────────────┤
│ g1          │     74020594 │
│ g8          │     71579360 │
│ g7          │     23388424 │
...

The move_details function from the Aix extension makes this possible – the other functions (such as list_filter) are core functions from DuckDB and are very powerful in combination with the Aix functions.

The Aix-compatible Lichess database offers three compression levels for the encoded chess games: Low, Medium, and High. A lower compression level provides faster decoding, so the choice between these levels enables a trade-off between speed and space usage. On the December 2025 file with Low compression (15.5 GB), the heatmap query takes about 92 seconds (AMD Ryzen Threadripper 3960X, with DuckDB thread count limited to 24). On the Medium (13.5 GB) and High (11.8 GB) compression files, this takes 104 and 241 seconds respectively.

Getting started with Aix

Start by installing DuckDB 1.4.4, then install the Aix extension. The extension is now available in the Community Extensions as aixchess, so you can install it with the following DuckDB command:

INSTALL aixchess FROM community;

(If you installed the extension earlier from the files in the GitHub release, do FORCE INSTALL aixchess FROM community;.)

Then load the extension (also do this for every new DuckDB session):

LOAD aixchess;

Now all of the extension functions are available. Download one of the Aix-compatible Lichess database files on Hugging Face and try out the above heatmap query! To experiment on a smaller file, download one of the older months or derive a smaller file from a large one using SQL’s LIMIT clause.

(At the time of writing, the dataset on Hugging Face is still very incomplete, as I’m still generating and uploading Aix files. If you want to try Aix on a month that’s not available yet, you could use pgn-to-aix to generate it already.)

The functions documentation provides further guidance, and the unit tests also provide some simple usage examples.

If you want to process the Aix database files in a manner too complex for an SQL query, you can decode Aix-encoded games using the Rust crate aix-chess-compression. Read the rows from a database using a crate such as duckdb or parquet, then use aix-chess-compression to decode the moves/positions.

The rest of this blog post provides more background and details on how Aix works.

Space-efficient storage

PGN uses storage inefficently. An easy way to save space is by using the Parquet format, a column-oriented data file format for large-scale tabular data. Each row in the file can represent a game. Lichess already offers their database as Parquet files on Hugging Face, but the moves are still represented in the PGN format.

Each move in PGN format uses almost 6 bytes on average. This is very inefficent considering that even a simple, naive encoding of the origin square, destination square, and possible promotion requires less than 2 bytes. Aix implements three compression levels for compact binary game encoding:

  • Low: 2 bytes per move. 6 bits for “from” square, 6 bits for “to” square, 1 bit to indicate capture, 1 bit to indicate promotion, 2 bits for promotion piece.
  • Medium: variable number of bits per move (7.1 on average), see Details of Medium compression.
  • High: variable number of bits per move (4.6 on average), based on Lichess’s game compression algorithm, implemented using the chess-huffman Rust crate.

The encoded game also needs 2 bits to indicate which compression level has been used.

Not only the moves need to be stored efficiently, but also the clock times and engine evaluations. Nothing too fancy is being used here, they are just represented as integer lists, which can directly be stored in Parquet. Only the evaluations have one tricky aspect: an engine evaluation can either be an advantage in centipawns or a forced mate in N moves. An evaluation is stored as a 16-bit integer, and the highest and lowest 512 values represent mate, while all other values represent an advantage in centipawns. In other words: -32,767 to -32,257 represents mate in 1 to mate in 512 for black, 32,256 to 32,767 represents mate in 512 to mate in 1 for white, everything else represents centipawns.

Metadata from PGN tags is stored in other columns of the most appropriate datatype.

Querying chess games

There are many query engines for Parquet, and while those can be useful for queries involving metadata (and in some cases even clock times or evaluations), they do not know the chess game encoding and do not support queries over moves or positions. This can be overcome: the analytical database system DuckDB can process Parquet, and provides an extension mechanism to add your own functionality (such as scalar SQL functions).

Adding custom functions through a DuckDB extension enables querying moves and positions of a game, or transformations of clock times and evaluations, while at the same time retaining all functionality that SQL offers – recall the heatmap query in the introduction of this post, where moves are filtered by piece and the destination squares are then aggregated. An overview of all available functions can be found in the Aix documentation.

With the scoutfish_query[_plies] functions, Aix even has feature parity with Scoutfish (approximately). Scoutfish is still noticeably faster because of its custom index, though. In theory, a similar function could be implemented to achieve feature parity with CQL, but because of the higher complexity of CQL, this will remain theoretical for at least a while.

The extension is implemented in C++ while all chess-related logic is in Rust. I used Diplomat to generate the FFI definitions, which was a smooth experience. Diplomat requires C++ 17 or above, and although DuckDB is written for C++ 11, it compiles fine with C++ 17 on all operating systems.

Details of Medium compression

The Medium compression level encodes each move in a variable number of bits. The first bits represent the piece that is being moved, the last bits represent how it moves. This idea is similar to an encoding scheme proposed by triplehappy, but several aspects of the execution are different.

Each side has at most 16 pieces on the board, so the moved piece can be encoded in 4 bits or fewer. If the side to move has N pieces, then ceil(log2(N)) bits are used to encode the moving piece. If only the king is left (N = 1), no bits need to be wasted on the moving piece.

The number of bits for the move depends on the piece and is determined by the maximum number of legal moves that the piece could have in any position. King and knight moves take 3 bits, rook and bishop moves 4 bits, and queen moves 5 bits. These numbers are fixed: while they may be more than needed (e.g. if a queen only has 4 moves, 2 bits would be enough in theory), reducing the number of bits requires the computationally expensive operation of calculating the legal moves.

The lowest number of bits needed is 3 (lone king), the highest is 9 (queen move when the moving side has 9 to 16 pieces left).

Try it out!

Give it a shot: install the Aix extension and try running queries on the database files. If you run into any problems, feel free to open an issue on the GitHub repository or contact me on Bluesky. Did something cool using Aix? Feel free to let me know as well!