Update on on-premise LLM service

In a previous news article we announced the availability of a local LLM with no connection to big tech. We haven’t really advertised it further, but we can say it has proven popular.

Part of the reason to launch was to discover bottlenecks, well, we’ve discovered bottlenecks.

Currently there are ~200 registrations and ~10 people online concurrently. Heavy users may have noticed that the service can be slow at times.

Plans

In order to make more efficient use of our server, we will:

Move to vLLM, which should be 10/20x faster then the sequential handling that ollama does, although the setup seems slightly more complicated.
Keep Open WebUI as that seems quite capable.
Virtually split our GPUs into two, so we have 4 GPUs in total which should hopefully mean increased throughput.

Other stuff we found:

The open source models are already quite powerful.
Things are hard to monitor, i.e. something simple as exporting a queue length is not implemented.
GPU usage isn’t monitored by the tooling we currently have.
A lot of this “AI” tooling is “vibe coded”, and is full of interesting bugs (Open WebUI claims to support OpenID Connect, but it’s not compliant with the standard as an example).

Next Steps

At some point (will be announced with a CPK in advance) we will bring down the current setup and then deploy the vLLM variant.

With luck we can keep on using the Open WebUI’s database, meaning everything should keep on working as it did. However we can not guarantee that, if things fail, we’ll start with a clean slate and delete all data on the machine (chat history, registrations and API keys).

Plans#

Next Steps#

Plans

Next Steps