Our End-to-End LLM Setup

Published
Our End-to-End LLM Setup

Let me write about the evolution of our setup.

First off, people often just run everything in their terminal on a command prompt line. We found that using Docker containers instead has several advantages. It provides a convenient, self-documenting way for building and running each tool, defining the network between containers and helps automate processes around them.

We used llama.cpp for hosting the model, which was initially gptoss-120B. We tested lots of different models, and many failed on subjective quality against gptoss-120B, even models released significantly later.

For the UI, we used the web-based OpenWebUI, which is a very flexible, but not extremely intuitive UI. It is the right tool if you want to play around with all sorts of bells and whistles around LLMs, but the learning curve is non-trivial. If you want the easy plug-and-play experience, you get that using LMStudio. Its usage is remarkably easy, you can simply run it even on a Windows machine or a MacBook Pro.

By the way, a MacBook Pro with larger amounts of memory is surprisingly good at running LLMs. Of course it is not as powerful as a strong GPU, but the large amount of unified memory available there is very useful for running relatively larger models.

On the other hand, using OpenWebUI and TailScale for VPN, it was relatively straightforward to make the model running on the big desktop at home accessible from our laptops and phones, which is quite cool.

A basic limitation that you quickly find when hosting your own LLM is that the training cutoff date cuts a lot deeper than usual, since it doesn't have access to the Internet. For example, it didn't believe me that it is not GPT-4 hosted by OpenAI, but gptoss-120B, hosted on my own machine.

So we wanted to give it web access. For that we have set up a container with SearXNG, which provides Internet search capability in a relatively straightforward, convenient manner. The key to the setup is to go to Workspace/Tools and add a new tool that exposes web search and web fetch functionality. There are even multiple solutions already available which you can download.

We took such a solution and modified it slightly. The key point is that certain tools only do a single web search, add it to the context, and then call the LLM. This is suboptimal. It is significantly more powerful to enable native function calling and thereby give the LLM the capability to decide freely about when and how many times it calls those tools. To enable the tool with native function calling, under Admin Panel -> Settings -> Models -> <specific model> do the following: 1) enable the tool itself by marking the checkbox beside its name under Tools; and 2) under Advanced Params set Function Calling to Native.

Once this is done, you can test the web search capability with a hard analysis problem that definitely requires multiple web accesses for a good result. My go-to question for this kind of test is asking if the frequency of "quick bathroom break" has increased in Lex Fridman's podcasts recently. Some models get into infinite loops researching this, making dozens of tool calls and never arriving at a result. gptoss-120B does a relatively good job if we set a good system prompt in OpenWebUI.

My personal favorite currently is the Qwen3.5-27B model, specifically an Opus-distilled variant of it, which has some distilled reasoning goodness added to it, taken from Claude Opus answers. This fits completely on the GPU, unlike gptoss-120B, and makes a convincing analysis of the situation very quickly, even without setting a system prompt.

Qwen3.5 is a great model, and several variants of it are worth the attention. The 35B MoE (Mixture of Experts) version also fits on our GPU completely and provides lightning fast results. The 27B dense model is slightly slower, but with this Opus distill variant we got some very nice results in Claude Code, so that's our current main workhorse model, and so far it is really impressive.

Next time we will take a look at image and video generation.