cronokirby

(2026-05) Zero-Knowledge Proofs for Gradient Boosted Decision Trees

2026-05-08

Abstract

Gradient boosted decision trees (GBDTs) are among the most effective models for tabular data and are widely used in domains such as finance, healthcare, and risk assessment. As these models are increasingly trained and served by external providers, clients need a way to check that a prediction or a model update was produced by the claimed training pipeline. At the same time, the provider may need to keep the training data and model parameters private. This makes zero-knowledge proofs for GBDT training and inference a natural tool for accountable machine learning.

Existing constructions for proving GBDT training typically rely on generic ZKP compilers. They build a certification circuit that checks the forest against the training data, and then prove the circuit execution. This leads to high prover cost. On the other hand, a more direct approach to decompose proof of training into algebraic constraints inevitably introduces many auxiliary witnesses to assist proving. Proving these constraints separately could result in a huge amount of independent auxiliary commitments, whose committing and opening could dominate both proof size and prover time. Batching these constraints is also difficult because they come from different stages of training and have potentially different witness shapes and sizes, which are committed over different domains.

We present \textsc{Terrae}, a zero-knowledge proof system for quantized GBDT training and inference based on KZG polynomial commitments. \textsc{Terrae} avoids both dependency on proving circuit computation and proving each constraint separately by leveraging the structure of GBDT training and novelly batching the constraints. We introduce two batching techniques: domain-lifting batching for linear constraints and interleaving batching for non-linear constraints. Both techniques work over differently-sized domains and reduce many constraints to a single claim without introducing extra polynomial commitments. We also design a histogram proof that proves the correctness of converting sample-wise data into its frequency representation, which may be of independent interest. Our evaluation shows that, compared with prior approaches, \textsc{Terrae} significantly reduces proof-generation time while adding only a small proof-size overhead.