Table shows TPS (tokens per second) / TTFT (time to generate first token) for different context sizes. TTFT excludes initial model load, and TPS is the rate after the first token is generated.
Tests were run with llama-cpp-python on Ubuntu server 24.04 using instruction tuned GGUF models downloaded from Huggingface (mostly from Bartowski). Data are approximations extrapolated from context sizes of varying length.
"Token factor" is the difference in context size for the same input. For a context of 2k Phi's tokenizer parsed 1.24x as many values, effectively making it slower.
"Subjective quality" is my own assessment of model quality and does not take performance into consideration. A total of 18 prompts were used, and most were for creative outputs like stories and poetry. See details about this in "Observations about model output" below.
Notes
By default llama-cpp-python only used 2 cpu cores, with n_threads set to 4 it used all cores which improved performance by about 30%. Default values were used for other model parameters.
RAM usage didn't exceed around 2GB for any model even for large contexts, which was surprising.
ARM processor optimizations significantly improved performance. TTFT was 3 times as fast and TPS about 1.3x.
TTFT scaled linearly and predictably. TPS had minor degradation but for context sizes <2k the difference was small
Different models sometimes showed very different context sizes in tokens for the same inputs (Gemma was around double that of Llama for contexts less than 300, which is why N/A is in the table for context size of 50)
On my M1 Macbook pro generation was about 10x faster and ttft was more than 20x faster for non "ARM optimized" models. (ARM models threw a seg fault on my Macbook for some reason)
Llama 1B was the best fit for my project, where model output needs to be fast enough to be used with PiperTTS (which uses a fair amount of processing power)
Observations about model output
Llama was a notch above the rest. In particular, it was the best at adopting personas and characters. For prompts like "you are a brash businessman" or "you are a French chef" it did a fantastic job at playing the part. Phi was also decent at this. Other models usually had a more conventional tone, or didn't attempt to adopt the character. (Nemotron in particular was incapable of this and always responded like a "helpful assistant".)
I generally preferred output from 4 bit quantized models over 8 bit. The only explanation I can think of for this is that lower quantizations introduce more randomness, and this may have been better for creativity. Phi was the only model where I preferred the 8 bit quantized model output.
Two prompts were large portions of the Frost-Nixon interview where the model was Nixon. None of the models continued in a way that was similar to the transcript (either literally or conceptually) which was somewhat surprising.
One prompt was mostly gibberish but contained rhymes. Smaller models did better with this prompt, continuing in a similar way and often with the rhyme scheme. Larger models tried to explain the input (which was intentionally nonsensical), or sort of half went along.
One prompt asked for a "long elaborate joke" and both Phi and Llama made jokes about Schrödinger's cat.
For some reason the larger Qwen and OpenELM models were worse than the smaller ones. OpenELM 3B would very frequently just repeat tokens, Qwen 3B occasionally repeated output too, but overall it was inconsistent. Sometimes it gave very good responses.