This article is currently an experimental machine translation and may contain errors. If anything is unclear, please refer to the original Chinese version. I am continuously working to improve the translation.
Content for a club presentation, archived here. LLMs evolve rapidly—this might become outdated quickly, so please use your search skills flexibly to get the latest information.
Introduction
A few years ago, before LLMs burst onto the scene, many small models could easily run on all sorts of devices.
For example, you could casually whip up a CNN with fewer than 1000 parameters, train it on MNIST in under a minute, achieve 94% accuracy on handwritten digit recognition, and—on modern GPUs—process over 10,000 images per second without any optimization effort.
A super tiny CNN with just 786 parameters
Now you’ve learned CNNs. Just make it a tiny bit deeper and wider like this, scale it up 16,000x to 13.5M parameters, and you can boost its recognition power even further. For instance, you could use it to crack CAPTCHAs.
Purely for demonstration: a CAPTCHA
This CAPTCHA solver is from my Tampermonkey script lyc8503/ddddocr_web, powered by ONNX Wasm running directly in the browser. Even on a regular laptop CPU, it solves CAPTCHAs in under 0.2 seconds with over 95% accuracy.
Back in the pre-LLM era, state-of-the-art models like BERT or ResNet typically stayed under 100M in size.
The LLM Era
In November 2022, OpenAI launched the ChatGPT series. GPT-3 scaled up another 1000x, reaching a staggering 175B parameters—its weights alone take up 350 GB of space. This poses a massive challenge for local deployment.
Image from bbycroft.net/llm
Why We Need Locally Deployable Open-Weight LLMs
- Transparency and consistency
- Lower costs and prevention of monopolies
- Bypassing online content censorship (“data security”)
- Ability to fine-tune or hack the model
How to Run an LLM in Three Steps (like stuffing an elephant into a fridge)
- Get suitable hardware
- Download model weights
- Install a proper inference framework and run it
Qwen3-Next-80B-A3B running locally, 46 tokens/s
Model Selection
Every model claims to be SOTA when released—so which one should you pick? I usually check the cyber-cricket-fighting leaderboard: https://lmarena.ai/leaderboard
For niche models, you can also search HuggingFace or forums like r/LocalLLama. There’s a new model popping up every day.
Models can be categorized by their activation parameters during inference:
- Dense models
- MoE (Mixture of Experts) models
They can also be classified by whether they have a “thinking” process:
- Thinking
- Instruct
- Hybrid
Hardware Selection
Here’s a rough ranking of hardware components by importance:
VRAM (Video Memory Size)
The deciding factor. Insufficient VRAM forces offloading to system RAM or even disk, which can drastically slow things down.- NVIDIA H100: 80GB
- NVIDIA A100: 40GB / 80GB
- RTX 4090: 24GB
Memory Bandwidth
Large models need high-speed access to all parameters for efficient computation.- H100: ~2 TB/s
- A100: ~1.5 TB/s
- RTX 4090: ~1 TB/s
- CPU DDR5: ~100 GB/s (PCIe 4.0 x16 ~32 GB/s)
Compute Power (BF16/FP16)
- H100 (BF16 Tensor): ~200 TFLOPS
- A100 (BF16 Tensor): ~78 TFLOPS
- RTX 4090 (FP16): ~83 TFLOPS
- EPYC 9654: theoretical ~4 TFLOP
Multi-GPU Interconnect Bandwidth (if applicable)
- NVLink: 56 GB/s
- PCIe: 32 GB/s, higher latency
CPU RAM Bandwidth / Compute Power (when offloading to CPU)
DDR5 >> DDR4, multi-channel >> single-channelSSD Read Speed
PCIe 4.0 x4: 7 GB/s
If you’re building your own rig, consider “scavenging” older cards like the 2080 Ti, MI50, or V100.
Framework Selection and Common Optimization Techniques
Quantization
Models are typically trained in FP16 precision, and published weights are often in this format. During inference, reducing floating-point precision can significantly cut memory usage.
Rule of thumb: Quantization levels Q6 (6-bit) and above—such as Q7 or FP8—have negligible impact on model performance. Q5 causes slight degradation, Q4 is noticeably worse, and Q3 or below might as well be a different model entirely.
Unsloth publishes many quantized versions of popular models.
CPU Offload
Move some model layers to the CPU, allowing both CPU and GPU to compute together.
Works especially well for MoE models with sparse activation.
Recommended Framework: llama.cpp
https://github.com/ggml-org/llama.cpp
Pros: Optimized for CPU/GPU hybrid inference, mobile-friendly, extremely popular, widely supported by models
Cons: No tensor parallelism support, poor multi-GPU utilization
Recommended Framework: fastllm
https://github.com/ztxz16/fastllm
Pros: Made by Chinese developers, supports diverse hardware, good CPU/GPU hybrid inference, includes unique performance optimizations
Cons: Fewer supported models, limited documentation, occasional bugs