Skip to content
Go back

Build llama.cpp from source

Build llama.cpp from source

Here is everything we’re going to run

brew install cmake
curl -LsSf https://hf.co/cli/install.sh | bash
git clone https://github.com/ggml-org/llama.cpp

cd llama.cpp
cmake -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DGGML_METAL=ON

cmake --build llama.cpp/build --config Release -j --clean-first
    --target llama-cli llama-mtmd-cli llama-server llama-gguf-split

cp llama.cpp/build/bin/llama-* llama.cpp
hf download unsloth/Qwen3.5-35B-A3B-GGUF \
    --local-dir models/Qwen3.5-35B-A3B-GGUF \
    --include "*UD-Q4_K_XL*"
./llama.cpp/llama-server \
    --model /models/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_K_M.gguf \
    --alias "Qwen3.5-35B-A3B" \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.00 \
    --port 8001 \
    --kv-unified \
		--n-gpu-layers 0 \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --flash-attn on --fit on

Step by step

brew install cmake

install cmake

CMake is the de-facto standard for building C++ code

Imagine you’re building a big LEGO robot (llama.cpp). cmake is the instruction sheet that explains how to put it together.


curl -LsSf https://hf.co/cli/install.sh | bash

install huggingface cli

This tool allows you to interact with the Hugging Face Hub directly from a terminal


git clone https://github.com/ggml-org/llama.cpp

cd llama.cpp

Clone the llama.cpp repository and change-directory into it.

LLM inference in C/C++


cmake -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF

-DBUILD_SHARED_LIBS=OFF controls how the code is packaged

There are two ways to build code libraries:

Shared Libraries (-DBUILD_SHARED_LIBS=ON)

Think of this like:

“The robot uses LEGO pieces stored in a shared box somewhere else.”

Static Libraries (-DBUILD_SHARED_LIBS=OFF)

This means:

“Glue all the LEGO pieces directly into the robot.”


-DGGML_CUDA=OFF controls whether the program uses your GPU (CUDA).

When ON:

“Use NVIDIA GPU acceleration.”

When OFF:

“Use only the CPU.”

Our build says:


cmake --build llama.cpp/build --config Release -j --clean-first \
    --target llama-cli llama-mtmd-cli llama-server llama-gguf-split

“Okay, actually build the robot — and build these specific versions of it.”

Let’s break it down simply.

cmake --build llama.cpp/build

“Go to the build folder and build whatever was configured there.”

--config Release

There are different “modes” you can build in:

We chose --config Release

“Build the fast, optimized version.”

-j

“Use multiple CPU cores at once.”

Instead of building one file at a time, it builds many in parallel.

--clean-first

“Delete old compiled pieces before building again.”

Sometimes old build artifacts cause weird bugs. This ensures a fresh rebuild.

--target

“Only build these specific programs.”

We want:


hf download unsloth/Qwen3.5-9B-GGUF —local-dir models/Qwen3.5-9B-GGUF —include “UD-Q4_K_XL

hf download unsloth/Qwen3.5-35B-A3B-GGUF \
    --local-dir models/Qwen3.5-35B-A3B-GGUF \
    --include "*UD-Q4_K_XL*"

Download the model from huggingface.


./llama.cpp/llama-server \
    --model path-to-model/model.gguf \
    --alias "model" \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.00 \
    --port 8001 \
    --kv-unified \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --flash-attn on --fit on \

Imagine the AI is a super smart parrot that tries to guess the next word in a sentence. These settings tell the parrot how to guess and how fast to think.

The “How Creative Should I Be?” Settings

--temp 0.6 Temperature = how creative the parrot feels.

--top-p 0.95 Top-p = how many good guesses the parrot is allowed to consider.

Imagine the parrot has 100 possible next words ranked from best to worst.

0.95 means: “Only consider the top words that together make up 95% of the most likely answers.” So it ignores the really bad guesses, but still allows variety.

--min-p 0.00 Min-p = don’t use super unlikely words.

If a word is too unlikely (below 0.01 in your case), the parrot throws it away.

It’s like saying: “Don’t say anything THAT random.”

The “How Fast and Efficient?” Settings

Now we tell the parrot how to use its memory and brain efficiently.

--kv-unified

The AI remembers previous words using little memory boxes called K (key) and V (value).

Normally they’re separate. --kv-unified says: “Put them together in one organized box.”

--cache-type-k q8_0 --cache-type-v q8_0

“Compress memory a bit to save space, but don’t make it too fuzzy.”

-flash-attn on Flash Attention = turbo mode

It’s a smarter math trick that makes long-context thinking much faster.

--fit on “Try your best to squeeze everything into my memory.”

We told the AI:


Share this post on:

Next Post
Building an AI Coding Agent with Vercel AI SDK and Ollama