1970-01-01

(2026-03) A Resource-Efficient Hardware Accelerator for Large-Size NTT via Algorithm–Architecture Co-Design

[[Kaixuan Wang]]
[[Yifan Yanggong]]
[[Xiaoyu Yang]]
[[Chenti Baixiao]]
[[Lei Wang]]

2026-03-05

Abstract

Large-size Number Theoretic Transforms (NTTs) are key operations in modern Zero-Knowledge Proofs (ZKPs), where the NTT size often reaches millions of points and the arithmetic is over wide prime fields. To handle such NTTs on hardware, prior designs commonly rely on the decomposition algorithm, which makes large-size NTTs feasible by streaming sub-NTTs through limited on-chip buffers. However, in practical implementations, decomposition alone is insufficient to ensure high efficiency. Since coefficients and twiddle factors remain off-chip, performance tends to depend on data movement across the memory hierarchy and delivery to the processing elements (PEs). To address these remaining issues, we present RENTT, a resource-efficient accelerator for large-size NTTs with decomposition-oriented data movement schemes and memory hierarchy design.

First, we design a precomputation-based twiddle factor management scheme that feeds multiple PEs without conflicts, both reducing on-chip twiddle factor storage and avoiding on-the-fly twiddle factor generation. Second, we develop a burst-optimized transpose method that fuses coefficient reordering into the element-wise twiddle-multiplication pass, reducing the latency of off-chip accesses for the subsequent NTTs. Third, we design a decoupled, multi-banked on-chip memory hierarchy that sustains high PE utilization under off-chip streaming, while remaining configurable across PE counts and maximum supported NTT sizes. We implement RENTT on an FPGA and experimentally verify that RENTT supports NTT sizes up to $N=2^{28}$ over 256-bit fields and completes a $2^{28}$ -point NTT in 1.52 seconds with 16 processing elements. Compared with the state-of-the-art FPGA baseline SAM, RENTT provides $2.64\times$ speedup (1.52s vs. 4.02s), reduces 46.4% DSP usage (3629 vs. 6776 DSPs), and achieves a $3.58\times$ lower area-time product.