Serving LLMs on an RTX4090 with Sequoia

GPU Bandwidth(GB/s) Target Model Draft Model TBT(s) Baseline(s) 4090 31.5 Llama2-70B Llama2-7B 0.57 4.54 4090 31.5 Vicuna-33B TinyVicuna-1B 0.35 1.78 4090 31.5 Llama2-22B TinyLlama-1.1B 0.17 0.95 4090 31.5 InternLM-20B InternLM-7B 0.17 0.77 4090 31.5 Llama2-13B TinyLlama-1.1B 0.09 0.27 2080Ti 15.8 Vicuna-33B TinyVicuna-1B 0.87 4.81 2080Ti 15.8 Llama2-22B TinyLlama-1.1B 0.53 3.04 2080Ti 15.8 Llama2-13B TinyLlama-1.1B 0.34 1.53

Sequoia can speed up LLM inference for a variety of model sizes and types of hardware. We evaluate Sequoia with LLMs of various sizes (including Llama2-70B-chat, Vicuna-33B, Llama2-22B, InternLM-20B and Llama2-13B-chat), on 4090 and 2080Ti, prompted by MT-Bench. The hardware platforms have different GPUs, CPU RAMs and CPU-GPU bandwidth. The evaluation results are listed above.

Here we show a demo for Llama2-70B inference on a single RTX-4090 (with and without Sequoia. Video plays at 4X speed).