Here is everything we’re going to run
brew install cmake
curl -LsSf https://hf.co/cli/install.sh | bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DGGML_METAL=ON
cmake --build llama.cpp/build --config Release -j --clean-first
--target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
hf download unsloth/Qwen3.5-35B-A3B-GGUF \
--local-dir models/Qwen3.5-35B-A3B-GGUF \
--include "*UD-Q4_K_XL*"
./llama.cpp/llama-server \
--model /models/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_K_M.gguf \
--alias "Qwen3.5-35B-A3B" \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--port 8001 \
--kv-unified \
--n-gpu-layers 0 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on --fit on
Step by step
brew install cmake
install cmake
CMake is the de-facto standard for building C++ code
Imagine you’re building a big LEGO robot (llama.cpp). cmake is the instruction sheet that explains how to put it together.
curl -LsSf https://hf.co/cli/install.sh | bash
install huggingface cli
This tool allows you to interact with the Hugging Face Hub directly from a terminal
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Clone the llama.cpp repository and change-directory into it.
LLM inference in C/C++
cmake -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF
-DBUILD_SHARED_LIBS=OFF controls how the code is packaged
There are two ways to build code libraries:
Shared Libraries (-DBUILD_SHARED_LIBS=ON)
Think of this like:
“The robot uses LEGO pieces stored in a shared box somewhere else.”
- Multiple programs can share the same library file.
- Smaller executable file.
- But you must have those shared files installed on the system.
Static Libraries (-DBUILD_SHARED_LIBS=OFF)
This means:
“Glue all the LEGO pieces directly into the robot.”
- Everything gets bundled inside the final binary.
- No external library files needed at runtime.
- Bigger file size.
- More portable (easy to move to another machine).
-DGGML_CUDA=OFF controls whether the program uses your GPU (CUDA).
When ON:
“Use NVIDIA GPU acceleration.”
- Much faster on supported GPUs.
- Requires CUDA toolkit installed.
- Only works with NVIDIA GPUs.
When OFF:
“Use only the CPU.”
- Slower than GPU.
- But works on any machine.
- No CUDA dependencies needed.
Our build says:
- Make one big self-contained program
- Use only the CPU
- Don’t rely on NVIDIA GPU
cmake --build llama.cpp/build --config Release -j --clean-first \
--target llama-cli llama-mtmd-cli llama-server llama-gguf-split
“Okay, actually build the robot — and build these specific versions of it.”
Let’s break it down simply.
cmake --build llama.cpp/build
“Go to the build folder and build whatever was configured there.”
--config Release
There are different “modes” you can build in:
- Debug (
--config Debug)- Slower
- Extra debugging info
- Easier to troubleshoot
- Release (
--config Release)- Optimized for speed
- No debug overhead
- What you want for real use
We chose --config Release
“Build the fast, optimized version.”
-j
“Use multiple CPU cores at once.”
Instead of building one file at a time, it builds many in parallel.
--clean-first
“Delete old compiled pieces before building again.”
Sometimes old build artifacts cause weird bugs. This ensures a fresh rebuild.
--target
“Only build these specific programs.”
We want:
llama-cliSimple command-line interface for chatting with a model.llama-mtmd-cliMulti-modal CLI (text + images for supported models).llama-serverRuns an OpenAI-compatible HTTP server.llama-gguf-splitTool to split large .gguf model files into smaller chunks.
hf download unsloth/Qwen3.5-9B-GGUF —local-dir models/Qwen3.5-9B-GGUF —include “UD-Q4_K_XL”
hf download unsloth/Qwen3.5-35B-A3B-GGUF \
--local-dir models/Qwen3.5-35B-A3B-GGUF \
--include "*UD-Q4_K_XL*"
Download the model from huggingface.
./llama.cpp/llama-server \
--model path-to-model/model.gguf \
--alias "model" \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--port 8001 \
--kv-unified \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on --fit on \
Imagine the AI is a super smart parrot that tries to guess the next word in a sentence. These settings tell the parrot how to guess and how fast to think.
The “How Creative Should I Be?” Settings
--temp 0.6 Temperature = how creative the parrot feels.
- Low (like 0.2) → very serious, boring, predictable.
- Medium (like 0.7–1.0) → balanced.
- High (like 1.5+) → silly, creative, sometimes weird.
--top-p 0.95 Top-p = how many good guesses the parrot is allowed to consider.
Imagine the parrot has 100 possible next words ranked from best to worst.
0.95 means: “Only consider the top words that together make up 95% of the most likely answers.” So it ignores the really bad guesses, but still allows variety.
--min-p 0.00 Min-p = don’t use super unlikely words.
If a word is too unlikely (below 0.01 in your case), the parrot throws it away.
It’s like saying: “Don’t say anything THAT random.”
The “How Fast and Efficient?” Settings
Now we tell the parrot how to use its memory and brain efficiently.
--kv-unified
The AI remembers previous words using little memory boxes called K (key) and V (value).
Normally they’re separate.
--kv-unified says: “Put them together in one organized box.”
--cache-type-k q8_0
--cache-type-v q8_0
“Compress memory a bit to save space, but don’t make it too fuzzy.”
- More compression → uses less VRAM
- Less compression → better quality but heavier
-flash-attn on Flash Attention = turbo mode
It’s a smarter math trick that makes long-context thinking much faster.
--fit on “Try your best to squeeze everything into my memory.”
We told the AI:
- Be moderately creative
- Don’t say super weird words
- Use turbo math
- Remember up to a giant amount of text
- Compress memory smartly