CUDA-Optimized Inference Engine for Large-Scale Language Models: Design, Kernels, and Latency Improvements
Keywords:
Deberta v3, Machine learning, ChatbotsAbstract
In recent times, Large Language Model (LLM)-driven chatbots have emerged as a focal point in artificial intelligence research. These intelligent systems, leveraging state-of-the-art neural network architectures, represent a significant advancement in natural language processing capabilities. The construction of such chatbots commences with data exploration, where statistical summaries and distribution visualizations are employed to uncover hidden patterns within the dataset. Subsequently, the text undergoes an intensive preprocessing pipeline, including tokenization, stop word removal, and normalization, to ensure data quality for model training. This paper presents a GPU-accelerated inference engine for large-scale transformer language models, implemented entirely in CUDA. The critical stages—context-stage KV-cache construction, token-stage incremental decoding, attention computation with rotary position embedding, and residual/feed-forward layer fusion—are off-loaded to the GPU through a hierarchy of custom kernels. We detail the design of latency-critical kernels such as Flash-Decoding for attention, paged KV-cache management, and dynamic tensor parallelism scheduling, together with micro-optimizations (shared-memory tiling, warp-specialized pipelines, FP16/BF16 mixed-precision) that yield near-peak FLOP/s and memory bandwidth. Comprehensive benchmarks against a state-of-the-art CPU-only baseline (FP32, OpenMP-parallel) demonstrate an order-of-magnitude reduction in per-token latency and a 5–7× improvement in energy-delay-product across models ranging from 7 B to 70 B parameters.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Mark Ouyang, Fengrui Zhang

This work is licensed under a Creative Commons Attribution 4.0 International License.