A security-hardened fork of Google AI Edge Gallery β with on-device image generation, voice mode (speech-to-speech AI chat), voice input, document analysis, vision AI, biometric lock, encrypted chat history, llama.cpp support, and GGUF model import.
Box is an independent community fork of Google AI Edge Gallery and is not affiliated with or endorsed by Google LLC. Google branding has been replaced throughout. All credit for the underlying platform goes to Google and the original contributors β this fork simply builds on top of their work.
Repository: github.com/jegly/box
Built OfflineLLM first β a privacy-first Android chat app with a llama.cpp backend.
This project (Box) forks Google's AI Edge Gallery to create a hybrid LiteRT / llama.cpp experience. Integrating llama.cpp here was easier than adding LiteRT to OfflineLLM.
β Try the OfflineLLM app for pure llama.cpp on-device chat.
Box is an Android app for running AI entirely on-device β chat, voice mode, image generation, speech-to-text, document analysis, and vision, all without a network connection. It inherits the full feature set of the upstream Google AI Edge Gallery and layers on top: encrypted conversations, biometric lock, hard offline mode, and three additional native inference engines (llama.cpp, stable-diffusion.cpp, whisper.cpp) alongside LiteRT.
What makes Box unique? You can sit at your desk, tap two buttons, and have a real flowing voice conversation with an AI β no wake word, no account, no server, no subscription. It listens, thinks, and speaks back sentence by sentence before it's even finished generating. Point the camera at something and ask about it out loud. The AI sees it and answers. All of it runs on the phone in your hand, completely offline, faster than you'd expect.
I have now created a separate branch called custom-rom-support, along with a corresponding release section specifically for users on third-party operating systems. If you are using a custom ROM, please use the custom-rom-support branch/release instead of the main branch. This branch supports TPU/NPU acceleration on Tensor devices; however, Snapdragon acceleration remains untested. Please expect broken features if you are using a custom ROM and running the current release or branch from main. A separate APK and branch (custom-rom-support) are now available for users on third-party Android operating systems, including but not limited to LineageOS, GrapheneOS, and CalyxOS. Note: The primary reason for these limitations is that third-party operating systems typically lack AICore and system-level Text-to-Speech (TTS) components. As a result, features such as voice-to-voice mode and NPU/GPU acceleration are either unavailable or significantly impaired on these ROMs.
Box is a fork of Google AI Edge Gallery. The upstream project is excellent β Box just layers on additional capabilities:
| Area | What Box adds |
|---|---|
| Inference engines | llama.cpp (GGUF LLMs), stable-diffusion.cpp (image gen), whisper.cpp (STT) alongside LiteRT |
| Model import | Import any local GGUF file β not limited to the curated download list |
| NPU / TPU | All Snapdragon / Tensor / MediaTek variants bundled in one APK (upstream ships per-SoC) |
| Voice mode / Vision mode | Free talk (continuous hands-free loop) and Vision talk (live camera + voice) |
| Image generation | On-device Stable Diffusion via GGUF |
| Speech-to-text | On-device Whisper STT |
| Document analysis | Attach text files directly in chat |
| Chat history | Persisted to a SQLCipher-encrypted Room database, resumable across sessions |
| Security | Biometric app lock, hard offline mode, prompt sanitisation, audit log |
| Agent skills | 20 built-in skills (upstream has 9) |
| Math rendering | LaTeX expressions rendered as Unicode in chat |
Multi-turn conversations with on-device LLMs. Import any GGUF model or download LiteRT models from the built-in list. Supports Thinking Mode on compatible models. Full markdown rendering with LaTeX math support β Greek letters, operators, fractions, and notation are rendered as Unicode symbols. Conversations are persisted and resumable.
Recommended models: We highly recommend Gemma 4 E2B or Gemma 4 E4B (LiteRT) as your primary models β best-tested, support vision, voice, and documents, and run efficiently with GPU/NPU acceleration. Available to download directly in the app.
With Gemma 4 E2B / E4B selected, the chat input expands to a full multimodal interface:
- π Attach documents (
.txt,.md,.csv,.json,.py,.kt, and more) β content is injected into context automatically - π Record an audio clip or pick a WAV file to speak your question
- π· Take a photo or pick from album for visual Q&A
On-device image generation powered by stable-diffusion.cpp. Runs Stable Diffusion 1.5 in GGUF format fully offline β no API key, no cloud. Configurable steps, CFG scale, seed, and image size presets. Save generated images directly to your gallery. Import your own GGUF diffusion models.
On-device speech-to-text using whisper.cpp. Tap to record, tap to transcribe. Copy or clear results. Supports Whisper Tiny through Small models in multiple languages. Audio never leaves the device.
Tap the mic and the speaker. That's it. Box listens to you, sends your words to the AI, and speaks the reply back β then immediately starts listening again. No tapping between turns. No waiting for a full response before it starts speaking. Just sit there and talk to it like a person.
On Gemma 4 E2B it keeps up in real time. The first sentence of the reply is already being spoken while the model is still generating the rest.
- "Explain quantum entanglement like I'm five" β speaks the answer, listens for your follow-up
- "Actually, go deeper on that last point" β multi-turn, completely hands-free
- "Help me think through a problem I'm having at work" β back and forth, no typing ever
- "What should I cook for dinner tonight? I've got chicken and not much else" β practical daily use
It feels like having an AI sitting across from you. Entirely offline. Nothing leaves the device.
Three toggles in AI Chat control it:
- π€ Mic β tap once to enter free talk mode, tap again to stop
- π Speaker β AI replies spoken aloud, sentence by sentence as they generate
- πΉ Camera β live vision mode (see below)
Enable Real-time voice reply in Settings for sentence-by-sentence speech as the model generates. Works out of the box with Android's built-in speech and TTS β load a Whisper or Piper model for higher quality.
De-Googled ROMs (GrapheneOS, CalyxOS, LineageOS without GApps): Google TTS is not pre-installed on these devices. Install a TTS engine from F-Droid (e.g. RHVoice or eSpeak NG) and set it as your default in Android Settings β Accessibility β Text-to-speech. The app will use it automatically.
Tap the camera toggle to stream your back camera directly to the AI. Point it at anything and ask β the AI sees the current frame alongside your question and speaks its answer back. All offline, no cloud.
Things you can do:
- Point at a plant β "What species is this and how do I care for it?"
- Point at food in your fridge β "What can I cook with what's here?"
- Point at a label or sign in another language β "What does this say?"
- Point at a circuit board β "What component is this and what does it do?"
- Point at your code on a laptop screen β "What's wrong with this function?"
- Point at a meal β "Roughly how many calories is this?"
- Point at a maths problem β "Walk me through how to solve this"
Combine with mic + speaker for a fully hands-free vision conversation β speak your question, AI sees the scene, speaks the answer, listens for the next question. Requires a vision-capable model (Gemma 4 E2B or E4B).
When mic is off, camera mode sends a frame every 3 seconds automatically with "What do you see?" β useful for passive scene description.
Ask questions about images using on-device vision models. Powered by LiteRT with Gemma 4 E2B / E4B β GPU-accelerated, up to 32K context.
Enable an optional biometric lock from Settings. The app re-locks automatically every time it is backgrounded. Unlock via fingerprint or face authentication before any content is shown.
All conversations are stored in a SQLCipher-encrypted Room database. History persists across sessions and is resumable from the Chat History screen. Swipe to delete individual conversations, or wipe all at once.
All Qualcomm Hexagon NPU variants (Snapdragon 8 Gen 2 / 8 Gen 3 / 8 Elite / newer), Google Tensor TPU (Pixel 8β10), and MediaTek NPU are bundled in a single APK β no separate builds per device. Select NPU/TPU in the model's accelerator dropdown; Box auto-detects the chip and loads the right runtime. Uses LiteRT JIT compilation on-device, so no pre-compiled model files are needed.
Supported hardware:
- Snapdragon 8 Gen 2 (SM8550, Hexagon V69)
- Snapdragon 8 Gen 3 (SM8650, Hexagon V73)
- Snapdragon 8 Elite (SM8750, Hexagon V75)
- Snapdragon 8 Elite for Galaxy (SM8850, Hexagon V79)
- Snapdragon next-gen (Hexagon V81)
- Google Tensor G3 / G4 / G5 (Pixel 8 / 9 / 10)
- MediaTek Dimensity (MT6989, MT6991)
Import any GGUF model file from local storage. At import time set the display name and choose the accelerator (CPU, GPU via OpenCL/Vulkan, or NPU via QNN delegate). Stable Diffusion GGUF models can also be imported for image generation.
A toggle in Settings forces the app into a fully airgapped state β all download attempts throw an exception and no network calls are made.
- Android 16+
- ~4 GB of free storage for a typical quantised LLM
git clone --recurse-submodules https://github.com/jegly/box
cd box/Android
./gradlew :app:assembleDebugThe --recurse-submodules flag is required to pull llama.cpp, stable-diffusion.cpp, and whisper.cpp submodules. The first build compiles all three native libraries from source β expect 15β25 minutes. Subsequent builds are fast.
Open Android/ in Android Studio (Ladybug or newer) and run on a physical device for best performance.
- Copy a
.gguffile to your device (Downloads, USB, etc.) - Open the app β Model Manager in the drawer
- Tap Import and pick your file
- Set a display name and choose CPU / GPU / NPU
- The model appears in AI Chat
| Mechanism | Details |
|---|---|
| Database encryption | SQLCipher via androidx.room β AES-256 at rest |
| Biometric gate | BiometricPrompt API, re-prompts on each foreground |
| Offline mode | OfflineMode singleton blocks DownloadWorker and network calls |
| Prompt sanitisation | SecurityUtils.sanitizePrompt() strips control characters before inference and persistence |
| Tapjacking protection | filterTouchesWhenObscured set on the chat scaffold |
| Audit log | SecurityAuditLog writes security events to a local append-only log |
- Kotlin + Jetpack Compose β UI
- Hilt β dependency injection
- Room + SQLCipher β encrypted persistence
- LiteRT-LM β LiteRT inference runtime for LLMs (GPU + NPU/TPU)
- Qualcomm QNN / QAIRT 2.41 β Hexagon NPU runtime (V69βV81, bundled)
- LiteRT NPU dispatch β auto-selects Qualcomm / Google Tensor / MediaTek at runtime
- llama.cpp β GGUF LLM inference (git submodule)
- stable-diffusion.cpp β GGUF image generation (git submodule)
- whisper.cpp β on-device speech-to-text (git submodule)
- Firebase Analytics β anonymous usage stats (disabled in Offline Mode)
Box would not exist without the work of the teams and individuals behind the projects it builds on.
Google AI Edge Gallery β the upstream project this fork is based on. The Google AI Edge team built an exceptionally well-structured, open-source Android app and made it available under the Apache 2.0 licence. Everything in Box starts from their foundation. Upstream changes are periodically merged and any improvements we make that are appropriate to contribute back will be.
llama.cpp β Georgi Gerganov and the llama.cpp contributors for making high-performance on-device LLM inference accessible to everyone.
stable-diffusion.cpp β leejet and contributors for the C++ Stable Diffusion implementation that powers on-device image generation.
whisper.cpp β Georgi Gerganov and contributors for the Whisper speech-to-text port.
LiteRT / TensorFlow Lite β the Google teams behind LiteRT (formerly TFLite) and the NPU/GPU delegate infrastructure.
Thank you to everyone who has opened issues, tested builds, or contributed to any of these projects. On-device AI is a community effort.
Licensed under the Apache License, Version 2.0. See the LICENSE file for details.
















