CUDA-Optimized Inference Engine for Large-Scale Language Models: Design, Kernels, and Latency Improvements

Authors

  • Mark Ouyang The Chinese University of Hong Kong, HK
  • Fengrui Zhang Worcester Polytechnic Institute, USA

Keywords:

Deberta v3, Machine learning, Chatbots

Abstract

In recent times, Large Language Model (LLM)-driven chatbots have emerged as a focal point in artificial intelligence research. These intelligent systems, leveraging state-of-the-art neural network architectures, represent a significant advancement in natural language processing capabilities. The construction of such chatbots commences with data exploration, where statistical summaries and distribution visualizations are employed to uncover hidden patterns within the dataset. Subsequently, the text undergoes an intensive preprocessing pipeline, including tokenization, stop word removal, and normalization, to ensure data quality for model training. This paper presents a GPU-accelerated inference engine for large-scale transformer language models, implemented entirely in CUDA. The critical stages—context-stage KV-cache construction, token-stage incremental decoding, attention computation with rotary position embedding, and residual/feed-forward layer fusion—are off-loaded to the GPU through a hierarchy of custom kernels. We detail the design of latency-critical kernels such as Flash-Decoding for attention, paged KV-cache management, and dynamic tensor parallelism scheduling, together with micro-optimizations (shared-memory tiling, warp-specialized pipelines, FP16/BF16 mixed-precision) that yield near-peak FLOP/s and memory bandwidth. Comprehensive benchmarks against a state-of-the-art CPU-only baseline (FP32, OpenMP-parallel) demonstrate an order-of-magnitude reduction in per-token latency and a 5–7× improvement in energy-delay-product across models ranging from 7 B to 70 B parameters.

Downloads

Published

2025-09-04

How to Cite

Ouyang, M., & Zhang, F. (2025). CUDA-Optimized Inference Engine for Large-Scale Language Models: Design, Kernels, and Latency Improvements. Journal of Theory and Practice in Engineering and Technology, 2(5), 1–9. Retrieved from https://woodyinternational.com/index.php/jtpet/article/view/291