Absortio

Email → Summary → Bookmark → Email

How to Deploy LLM Locally

Extracto

.admonition { margin: .75em 0; padding: .6rem; overflow: hidden; font-size: 12px; page-break-inside: avoid; border-left: .3rem solid #42b983; border-radius: .3rem; box-shadow: 0 0.1re

Resumen

Resumen Principal

El artículo examina la profunda transformación en la escala de los modelos de inteligencia artificial, contrastando la era pre-LLM, donde pequeñas redes neuronales convolucionales (CNNs) con menos de 1000 parámetros eran fácilmente desplegables localmente y capaces de tareas como el reconocimiento de dígitos, con la emergente Era LLM. Esta última, caracterizada por modelos masivos como GPT-3, que alcanza 175 mil millones de parámetros y requiere 350 GB de almacenamiento, presenta un desafío monumental para la implementación local. Sin embargo, el texto enfatiza la importancia crítica de desplegar LLMs de código abierto a nivel local, argumentando que esto garantiza transparencia, reduce costos, previene monopolios, elude la censura de contenido y permite el fine-tuning o personalización del modelo. Para lograr esta hazaña técnica, la guía detalla un proceso de tres pasos que involucra la selección de hardware adecuado, la descarga de los pesos del modelo y el uso de un framework de inferencia optimizado, destacando la VRAM como el factor determinante y las técnicas de cuantificación como esenciales para la viabilidad.

Elementos Clave

  • Evolución Drástica de Modelos de IA: El contenido subraya la diferencia abismal entre la simplicidad de modelos anteriores, como CNNs de menos de 1000 parámetros que se ejecutaban en segundos, y la complejidad actual de los LLMs como GPT-3, que escalan a 175 mil millones de parámetros y

Contenido

This article is currently an experimental machine translation and may contain errors. If anything is unclear, please refer to the original Chinese version. I am continuously working to improve the translation.

Content for a club presentation, archived here. LLMs evolve rapidly—this might become outdated quickly, so please use your search skills flexibly to get the latest information.


Introduction

A few years ago, before LLMs burst onto the scene, many small models could easily run on all sorts of devices.

For example, you could casually whip up a CNN with fewer than 1000 parameters, train it on MNIST in under a minute, achieve 94% accuracy on handwritten digit recognition, and—on modern GPUs—process over 10,000 images per second without any optimization effort.

A super tiny CNN with just 786 parametersA super tiny CNN with just 786 parameters

Now you’ve learned CNNs. Just make it a tiny bit deeper and wider like this, scale it up 16,000x to 13.5M parameters, and you can boost its recognition power even further. For instance, you could use it to crack CAPTCHAs.

Purely for demonstration: a CAPTCHAPurely for demonstration: a CAPTCHA

This CAPTCHA solver is from my Tampermonkey script lyc8503/ddddocr_web, powered by ONNX Wasm running directly in the browser. Even on a regular laptop CPU, it solves CAPTCHAs in under 0.2 seconds with over 95% accuracy.

Back in the pre-LLM era, state-of-the-art models like BERT or ResNet typically stayed under 100M in size.

The LLM Era

In November 2022, OpenAI launched the ChatGPT series. GPT-3 scaled up another 1000x, reaching a staggering 175B parameters—its weights alone take up 350 GB of space. This poses a massive challenge for local deployment.

Image from bbycroft.net/llmImage from bbycroft.net/llm

Why We Need Locally Deployable Open-Weight LLMs

  1. Transparency and consistency
  2. Lower costs and prevention of monopolies
  3. Bypassing online content censorship (“data security”)
  4. Ability to fine-tune or hack the model

How to Run an LLM in Three Steps (like stuffing an elephant into a fridge)

  1. Get suitable hardware
  2. Download model weights
  3. Install a proper inference framework and run it

Qwen3-Next-80B-A3B running locally, 46 tokens/sQwen3-Next-80B-A3B running locally, 46 tokens/s

Model Selection

Every model claims to be SOTA when released—so which one should you pick? I usually check the cyber-cricket-fighting leaderboard: https://lmarena.ai/leaderboard

For niche models, you can also search HuggingFace or forums like r/LocalLLama. There’s a new model popping up every day.

Models can be categorized by their activation parameters during inference:

  • Dense models
  • MoE (Mixture of Experts) models

They can also be classified by whether they have a “thinking” process:

  • Thinking
  • Instruct
  • Hybrid

Hardware Selection

Here’s a rough ranking of hardware components by importance:

  1. VRAM (Video Memory Size)
    The deciding factor. Insufficient VRAM forces offloading to system RAM or even disk, which can drastically slow things down.

    • NVIDIA H100: 80GB
    • NVIDIA A100: 40GB / 80GB
    • RTX 4090: 24GB
  2. Memory Bandwidth
    Large models need high-speed access to all parameters for efficient computation.

    • H100: ~2 TB/s
    • A100: ~1.5 TB/s
    • RTX 4090: ~1 TB/s
    • CPU DDR5: ~100 GB/s (PCIe 4.0 x16 ~32 GB/s)
  3. Compute Power (BF16/FP16)

    • H100 (BF16 Tensor): ~200 TFLOPS
    • A100 (BF16 Tensor): ~78 TFLOPS
    • RTX 4090 (FP16): ~83 TFLOPS
    • EPYC 9654: theoretical ~4 TFLOP
  4. Multi-GPU Interconnect Bandwidth (if applicable)

    • NVLink: 56 GB/s
    • PCIe: 32 GB/s, higher latency
  5. CPU RAM Bandwidth / Compute Power (when offloading to CPU)
    DDR5 >> DDR4, multi-channel >> single-channel

  6. SSD Read Speed
    PCIe 4.0 x4: 7 GB/s

If you’re building your own rig, consider “scavenging” older cards like the 2080 Ti, MI50, or V100.

Framework Selection and Common Optimization Techniques

Quantization

Models are typically trained in FP16 precision, and published weights are often in this format. During inference, reducing floating-point precision can significantly cut memory usage.

Rule of thumb: Quantization levels Q6 (6-bit) and above—such as Q7 or FP8—have negligible impact on model performance. Q5 causes slight degradation, Q4 is noticeably worse, and Q3 or below might as well be a different model entirely.

Unsloth publishes many quantized versions of popular models.

CPU Offload

Move some model layers to the CPU, allowing both CPU and GPU to compute together.

Works especially well for MoE models with sparse activation.

https://github.com/ggml-org/llama.cpp

Pros: Optimized for CPU/GPU hybrid inference, mobile-friendly, extremely popular, widely supported by models
Cons: No tensor parallelism support, poor multi-GPU utilization

https://github.com/ztxz16/fastllm

Pros: Made by Chinese developers, supports diverse hardware, good CPU/GPU hybrid inference, includes unique performance optimizations
Cons: Fewer supported models, limited documentation, occasional bugs

Fuente: Lyc8503's blog