The First Run

Bottom-Up, Top-Down

Now that our setup was complete and we could both SSH into the machine, it was time to explore. We’d already spent a good amount of time studying the fundamentals from the bottom up — following series like 3Blue1Brown’s Neural Network videos and Andrej Karpathy’s Zero to Hero playlist — but now it was finally time to see how things actually work in practice: how people run these models day to day, and how to hook into the existing ecosystem.

At the core of all this are the weights and the runners. The weights are the trained parameters — the compressed representation of everything the model has learned. The runner is the engine that handles the inference — the actual process of creating the output. It turned out that the two main runners worth caring about are Llama.cpp and vLLM. Both serve the same purpose — running large language models locally — but take very different approaches.

Llama.cpp is great when you’re working with limited hardware, like a laptop or mid-range GPU. It uses a variety of low-level tricks to squeeze performance out of constrained setups. The outputs are usually shorter, less detailed, and sometimes lose a bit of coherence on longer reasoning tasks.

vLLM, on the other hand, really shines when you have heavier hardware. Most importantly for us, it supports Mixture-of-Experts (MoE) models — architectures where only a subset of specialized “expert” networks activate for each query instead of the entire model firing at once. This selective activation drastically reduces computation and memory load, making it possible to run massive models like GPT-OSS-120B even on a single GPU — which is exactly why we chose vLLM as our main runner.

Even though the RTX 5090 isn’t a data-center-grade GPU, it’s powerful enough to make vLLM viable — a middle ground between research-scale hardware and real-world accessibility.

Lots of Options

Our main goal was to find a model that actually made use of our hardware, produced solid answers across different use cases, and didn’t choke on longer prompts. It didn’t have to be lightning fast, just usable — something that could serve as a reliable foundation for both experiments and creative work.

We also wanted a model that wasn’t overly constrained — one that could answer directly, without excessive hedging, moralizing, or built-in ego tampering. Something that could be guided toward genuine objectivity through fine-tuning if needed, rather than constantly fighting overly cautious built-in guardrails.

Now, let’s take a look at the models we’ve checked out:

Mixtral-8x22B: When it came out, Mixtral made a big splash — an open Mixture-of-Experts model that brought near-frontier performance into the public domain. It was a natural first stop for us, but a few rounds of inference, a noticeable number of the answers felt off — inconsistent or slightly detached — so we decided to ditch it for the time being.
Hermes models: These models are built on a general-purpose base, but fine-tuned to add more objectivity and friction — aiming for responses that feel more grounded and less performative. It piqued our interest because its main appeal lies in being a bit more direct and blunt than most.
Qwen2.5-72B: A tricky one, coming from China. It has some alternative views on history, but it’s very capable. Even though it has biases, it has a different set of biases than its Western counterparts. So, handled with care, it might come up with insights that other ones won’t be able to show you.
MythoMax-13B: This one is optimized for NSFW content. We’ve tried it out — it does what it advertises — but since our aim for the time being isn’t writing erotica for shady subreddits, we didn’t investigate it further. Still, if someone really wants no filters, this is the way to go.
GPT-OSS-120B: A Mixture-of-Experts model that makes full use of our resources. It performs well across different tasks and feels like the right fit for what we’re trying to do — so this is the one we’re rolling with for now.

Benchmarks

Now that we’ve settled on GPT-OSS-120B, it’s time to see how well our setup actually handles real-world inference.

After a certain point, balancing all the factors that go into choosing a model becomes more of an art than a science. You’re always juggling three dials:

Model intelligence (size / precision) — how “smart” and nuanced the answers are.
Speed — how fast until you get your output.
Context length — how much conversation history fits before the model starts forgetting or truncating

And your VRAM budget and FLOP capacity define the constraints these dials live in. (Reality, of course, is a bit more complex — these factors overlap in subtle ways, though they can often be tuned somewhat independently.)

We went with a middle ground: high response quality and medium speed, at the cost of very large context windows. That means you can ask longer, detailed questions, but you should aim for only a few conversational turns before things start to fade.

Designing a truly science-grade benchmark would take weeks, but for our purposes, we just rolled with a few parameter variations and the built-in vLLM benchmarking utility.

First, we looked at how batch size affects throughput. This shows how much work the model can handle at once — larger batches improve parallelization and overall efficiency up to a point, but gains quickly plateau. In our case, performance stabilized around a batch size of 2048, with only marginal improvements beyond that:

Next, we tested how performance scales with the number of CPU threads — with no GPU involvement at all. Throughput increased steadily up to around 16 threads, where each physical core was fully utilized. Beyond that point, performance flattened and then dipped slightly as simultaneous multithreading introduced additional communication and scheduling overhead. Prompt processing barely changed across thread counts — it quickly hit a RAM bandwidth limit, where adding more threads no longer helped because the CPU was already pulling data from memory as fast as possible:

Finally, we looked at how performance changes as more layers are offloaded to the GPU. Throughput improved consistently across the board — both prompt processing and token generation scaled almost linearly with the number of GPU layers. This isn’t surprising, since each additional layer moved more computation onto the much faster GPU. We could offload up to 18 layers before hitting VRAM limits — the maximum our setup could handle:

Qualitative Measures

Benchmarking the semantic quality of responses is a bit trickier, but the answers we got were surprisingly good — and absolutely usable. This shows that you can have a fully self-owned, end-to-end setup with roughly the same level of investment we made. While there are standardized tests for measuring response quality (such as MMLU or MT-Bench), we took a shortcut and used three different types of our own prompts to see how our system responds.

Question 1

Explain why the sky is blue.

A short and deterministic question — essentially a factual recall task with minimal reasoning. It’s a good test of consistency, since the model should converge on roughly the same phrasing and structure every time if it’s stable.

Check out the actual prompt and five sample answers here. The temperature — which controls how random or creative the model’s output is — was set to 0.7, a moderate value that keeps responses focused while still allowing some variation. Note that GPT-OSS also provides a brief summary of its internal reasoning before the final answer. An answer like this is generally computed in a few seconds.

Question 2

A train leaves Alpha City at 3:00 PM traveling toward Beta City at 90 km/h. Another leaves Beta City at 4:00 PM traveling toward Alpha City at 120 km/h. The distance between them is 450 km, and the route is straight with no stops. At what time do they meet, and how far are they from each city when that happens? Explain your reasoning clearly and simply. Do not use equations, explain everything in plain English. Do not use any Markdown, bullet points, asterisks, LaTeX, or special formatting.

This one introduces multi-step reasoning and light arithmetic. It requires the model to correctly interpret relative motion, set up a simple equation, and perform basic calculations while maintaining internal logic. It’s a good measure of reasoning depth, coherence, and precision — especially since a single math slip or misinterpretation of direction can break the whole chain.

This question isn’t rocket science, but it’s not entirely trivial either. Still, the model returned correct results with a clear reasoning chain in all five runs. Note that one earlier run wasn’t included here, as it ran out of context after producing excessively verbose reasoning.

Question 3

Write a short essay arguing for both sides of the statement: “Technology makes us more human.” Do not use formatting or bullet points.

This is a free-form, creative task that tests the model’s stylistic range and argument-building. It’s subjective, so we don’t expect identical responses, but we can measure things like response length, structural consistency, and how well the reasoning chain holds together without looping or collapsing.

The answers were pretty creative and cohesive. I did have to restart one run after it exceeded the context window.

Summary

All in all, the results were surprisingly good. Our self-hosted solution can handle various types of prompts and consistently produces usable answers. The main limitation remains the context size — if you want to maintain high result quality in long-running conversations, you’ll still need to rely on a proprietary service. Still, there’s something extremely neat about owning the entire stack end-to-end and knowing you have full control over privacy and hyperparameters.

Of course, prompting directly from the terminal isn’t ideal, and it would be useful to have features like web browsing and other built-in tools. So next time, we’ll move on to setting up a user interface that combines these capabilities.