u/Aviation2025

the mess of using a local LLM on android app-kotlin
▲ 34 r/Kotlin+3 crossposts

the mess of using a local LLM on android app-kotlin

Taking the trip down to productivity apps etc I started with a simple goal, make an app that uses voice-to-text (or also just text) to help me send notes.

The idea would be that this can expand into multiple things, but as a demo the first milestone was to have it use a local llm and extract the relationship of the people mentioned in my notes aka "my grandfather's father name was Bob".

The road is full of holes...

AICore

My device is a pixel 8 which is the minimum device that has the AICore enabled so we can leverage Gemini Nano via ML Kit.

The coding of it was not that complex, you take advantage of `com.google.mlkit:genai-prompt` and it communicates with the system's service core, labeled as Feature 636.

Unfortunately, regardless how simple it seems, the feature is heavily gated still. The user of the application needs to enable the AICore feature via their system preferences. This is not a big hurdle, quite understanble from all the years of working with experimental features, however there were more. It still requires Google Group membership, and specific Play Store AICore versions which in no way or form is acceptable for anyone to expect every single user to do this.

The error message is good enough however, it mentions the feature 636 is not available from the start so it wasnt that tough to find out what is happening.

LiteRT-LM

The next approach was to use liteRT runtime (litertlm-android:0.11.0) and run inteference bypassing the AICore. This of course required to download the model and store it on the device. Model downloaded from CDN as a .litertlm file (Gemma 4 E2B, 2.59 GB) but others would be applicable as well as long as they are .litertlm

CPU

It is fairly simple to use the LLM on the CPU of the phone and LiteRT is built towards GPU but this proved to be rather not possible atm (more bellow).

Therefore using Backend.CPU() on pixel 8 I tested 2 models

Model Size tok/s
Gemma 4 E2B (gemma-4-E2B-it.litertlm) 2.59 GB 4–5
Gemma 3 1B int4 (gemma3-1b-it-int4.litertlm) 584 MB 3

GPU

Unfortunately I could not get Backend.GPU() to work. The is related with the Tensor G3 chip availability of drivers.

Failure chain:

  1. Runtime tries to load libLiteRtGpuAccelerator.so (Vulkan-based) → not found in any public AAR. Does not exist in litertlm-androidlitert, or litert-gpu artifacts.
  2. Falls back to libLiteRtClGlAccelerator.so (OpenCL/GL).
  3. OpenCL not supported on Tensor G3 → falls back to OpenGL.
  4. OpenGL fails: CreateSharedMemoryManager is not implemented — the EGL context is missing on the init thread.
  5. CPU fallback triggered silently.

libLiteRtGpuAccelerator.so (the Vulkan path) exists only in Google's internal builds. It is not shipped in any Maven artifact as of May 2026.

Llama.cpp

Integrate llama.cpp as a git submodule alongside whisper.cpp, compile both into the same sanctuary-jni.so, and use a GGUF-format model (gemma-3-1b-it-q4_0.gguf, 1 GB) from Google's official QAT release.

Now here again I got low tokens per sec but by switching it to use all 8 cores I reached 6.

As another approach I tried to use Vulkan drivers to enable GPU but the perfomance was the worst possible with 1 token per sec

Comparison with LiteRT-LM CPU: Identical — both top out at 5 tok/s on Tensor G3 for a 1B-parameter model. The theoretical advantage of llama.cpp's hand-tuned GGML ARM NEON kernels did not materialise with the q4_0 quantization format on this chip.

Verdict: No performance advantage over LiteRT-LM. The ceiling for 1B models on Tensor G3 CPU is ~5 tok/s regardless of inference engine. For entity extraction (~18 tokens output), this is ~3.5 seconds 

Summary

I am sure the newer phones with dedicated cores etc will perform much better therefore I am not too worried about this, however I was quite annoyed by how gated the whole technology is still on mobile phones.

I am not sure if I missed something but LiteRT is probably the most reasonable approach atm.

u/Aviation2025 — 5 days ago

So I have been using different productivity apps etc and noone of them click because I am not sure regarding the speech features.
Some force it, others make it a side thing - what is your take on this?

Do you prefer to speak to your phone to do things or to type/use actions?

reddit.com
u/Aviation2025 — 10 days ago

Hello all,

This is not the first analytics platform made but I wanted to share my experience and explain how I architected it as personally I would love to have read this before I started.

- There is 1 ingestion endpoint on a dedicated service, this way we can always have reliable performance and processing is defferred to another service which tbh can have way worst reliability even a home laptop. As long as the event reaches our ingestion endpoint, the rest can be replayed or have a delay of a few minutes in case of peak traffic.
- The ingestion endpoint will probably be moved to something on-edge in the future to reduce latency worldwide.

- After the event is in a queue it gets picked up but our Medallion pipeline
- A Medallion pipeline splits the processing into Bronze, Silver and Gold. Bronze is more or less raw data, Silver is there we have done the majority of transformations and Gold is the final layer that will be presented
- Querying gold for things like P Metrics or Geo info is not ideal so we offload this to HLL/Sketch Continuous Aggregate tables powered by timescaledb.
This gives us a few features:

  1. Reduce the in-code query complexity as the CA query already merges the CA table info and the "hot" data that are not yet merged into the CA
  2. compression of tables that comes built in with TimescaleDB
  3. No longer need our previous cronjobs that calculated the aggregates

- As everything is self hosted on Hetzner it was a debate whether to use Clickhouse or not. As personally I have no experience with it, I preferred to not add a whole new tool that I have to maintain and beefed up the existing PGSQL instance.

This is it in a gist, I hope it helps and would be great if people have comments or interested to hear more!

reddit.com
u/Aviation2025 — 13 days ago

Hello all,

This is not the first analytics platform made but I wanted to share my experience and explain how I architected it as personally I would love to have read this before I started.

- There is 1 ingestion endpoint on a dedicated service, this way we can always have reliable performance and processing is defferred to another service which tbh can have way worst reliability even a home laptop. As long as the event reaches our ingestion endpoint, the rest can be replayed or have a delay of a few minutes in case of peak traffic.
- The ingestion endpoint will probably be moved to something on-edge in the future to reduce latency worldwide.

- After the event is in a queue it gets picked up but our Medallion pipeline
- A Medallion pipeline splits the processing into Bronze, Silver and Gold. Bronze is more or less raw data, Silver is there we have done the majority of transformations and Gold is the final layer that will be presented
- Querying gold for things like P Metrics or Geo info is not ideal so we offload this to HLL/Sketch Continuous Aggregate tables powered by timescaledb.
This gives us a few features:

  1. Reduce the in-code query complexity as the CA query already merges the CA table info and the "hot" data that are not yet merged into the CA
  2. compression of tables that comes built in with TimescaleDB
  3. No longer need our previous cronjobs that calculated the aggregates

- As everything is self hosted on Hetzner it was a debate whether to use Clickhouse or not. As personally I have no experience with it, I preferred to not add a whole new tool that I have to maintain and beefed up the existing PGSQL instance.

This is it in a gist, I hope it helps and would be more than glad to answer any further questions!

reddit.com
u/Aviation2025 — 13 days ago