What Is NIJI?
NIJI (日本語で "rainbow") is a cross-platform chat application where every message is processed locally on the device. There is no server, no subscription, and no network request made when you hit send. The model runs in-process — on your CPU or NPU — and the response is generated entirely within the app sandbox.
The idea originated from a simple frustration: every AI assistant available demanded either a cloud account, a paid API key, or both. For users who handle sensitive conversations — medical, legal, personal — that model is a non-starter. NIJI was built to prove the alternative is not only possible, but polished.
The Problem with Cloud AI
Most AI chat applications send your prompts to a remote server — typically OpenAI, Anthropic, or Google. This means every message you type is:
- Transmitted over the network (interception risk)
- Stored on a third-party server (retention policy risk)
- Potentially used for model training (privacy policy risk)
- Subject to outages, rate limits, and paid usage caps
For enterprise users, healthcare workers, or anyone handling confidential data, cloud AI is simply not an option. The privacy tradeoff is too large.
Architecture: Flutter + Gemma
NIJI is built with Flutter for the UI and cross-platform targeting, and Google's Gemma as the underlying language model. Gemma's smaller variants (2B parameters) are specifically designed for on-device inference — optimized for mobile CPU and NPU execution with quantization support.
- Flutter (Dart) — cross-platform UI
- Custom chat interface components
- Streaming token output UI
- Conversation history management
- Google Gemma 2B (INT4 quantized)
- MediaPipe LLM Inference API
- On-device model loading
- Native CPU/NPU acceleration
- Xcode — iOS packaging & signing
- TestFlight — beta distribution
- App Store Connect — production
- Core ML acceleration support
- Android Studio — APK/AAB builds
- Play Console — internal testing
- GPU/NPU delegate config
- ProGuard & R8 optimization
The key integration point is MediaPipe's LLM Inference API, which handles the heavy lifting of model loading, tokenization, and inference scheduling. Flutter communicates with the native runtime via platform channels, keeping the Dart layer focused on UI and conversation logic while the native layer manages the model lifecycle.
Cross-Platform Deployment
One of Flutter's core promises is "write once, run anywhere" — but on-device AI adds real complexity to that equation. Each platform has its own inference backend, hardware acceleration model, and performance constraints.
iOS Deployment
On iOS, the model runs via Core ML and the Apple Neural Engine (ANE) on devices with A12 Bionic or newer. The quantized Gemma 2B model fits within the memory budget for most modern iPhones and processes tokens at a comfortable rate for real-time chat. The app was distributed through TestFlight for beta testing before submission to the App Store.
Android Deployment
Android inference is handled through MediaPipe's GPU delegate on capable devices, falling back to CPU inference on older hardware. The APK packaging required careful ProGuard configuration to ensure the native inference libraries were not stripped during optimization. Performance varies significantly across the Android device landscape — a known challenge with on-device AI.
Privacy Benefits
- Zero network dependency: The app functions identically with no internet connection — on a plane, in a secure facility, or in airplane mode.
- No telemetry or logging: No analytics SDKs, no crash reporting that transmits data, no session tokens sent to external services.
- Air-gap capable: Once the model is loaded, the app can run on a device that has never connected to the internet — suitable for high-security environments.
- Conversation isolation: All history is stored in local app storage, encrypted by the OS sandbox. Deleting the app removes all data completely.
- No account required: No sign-up, no email, no identity attached to usage.
Performance Considerations
On-device LLMs are not without trade-offs. The Gemma 2B model, while compact by LLM standards, still requires:
- ~1.5 GB of RAM for model weights in INT4 quantized form
- 3–6 seconds for initial model load on mid-range hardware
- ~15–30 tokens/sec generation speed on modern iPhones with ANE
- Thermal throttling during extended sessions on thinner devices
These constraints are real but manageable. For the conversational use case NIJI targets, the response latency is acceptable — and it is a latency that comes with complete privacy rather than exposing data to a cloud provider.