Private AI · On-device · Zero Cloud

JARVIS

Your Private AI. On Your Hardware.

Chat, voice, vision, RAG, agents — running entirely on your NPU+GPU+CPU. 94 tok/s on an AMD Strix Halo laptop. No cloud. No subscriptions. No data leaves your machine.

Install JARVIS → See capabilities
NPU: XDNA2 · 94 tok/s GPU: Radeon 8060S · 312 tok/s Voice: Whisper STT + Piper TTS Vision: Qwen3-VL · 3.2B RAG: Open Knowledge Format
JARVIS — NPU+GPU+CPU Fused live · 94 tok/s
$ curl -X POST http://jarvis.local:8080/api/chat \ -F "message=What can you do?" { "response": "I'm JARVIS — your private AI assistant. I can chat, see images, hear your voice, search your documents, write code, and control your system. Everything runs locally on your NPU+GPU+CPU.", "model": "qwen3:0.6b (NPU)", "latency": "94 tok/s", "power": "~15W" } # decode=94 tok/s ttft=513ms (FLM proxy · XDNA2 NPU) # no cloud — no data leaves your machine
Capabilities

Full AI stack. One machine. Zero cloud.

Everything you expect from a modern AI assistant — all running locally. No API keys. No subscriptions. No data leaving your network.

VOICEWhisper-v3 + Piper

Speech In & Out

Talk to JARVIS naturally. Whisper-v3 on NPU converts speech to text. Piper TTS reads responses aloud. Push-to-talk from the web UI.

NPU-powered STT
VISIONQwen3-VL-4B

See & Understand

Upload images, screenshots, or photos. JARVIS describes, analyzes, and answers questions about what it sees. Runs on NPU at 11 tok/s.

11 tok/s on NPU
RAGOpen Knowledge

Document Intelligence

Upload PDFs, text files, or notes. JARVIS indexes them locally and answers questions from your knowledge base. All files are human-readable markdown.

Transparent format
AGENTSTool Calling

Tool-Using Agent

Calculator, Python execution, file operations, system control. JARVIS decides when to use tools and explains what it found.

Python · calc · files
WEBFastAPI + WebSocket

Web UI & API

Full chat interface with streaming markdown, voice recording, image upload, and drag-drop file upload. Works in any browser. OpenAI-compatible API.

Any browser
MOBILEFlutter · iOS · Android

JARVIS on Your Phone

Native Flutter app with the same JARVIS theme. Connect via ngrok tunnel with QR code pairing. JARVIS in your pocket.

App Store + Play Store
The Hardware

Three processors. One unified stack.

JARVIS dynamically dispatches work across NPU, GPU, and CPU — whichever is fastest for each operation.

NPUXDNA 2 · 32 AIE2P tiles
94tok/s · 50 TOPS INT8

The NPU AMD shipped disabled on consumer silicon. We drive it through FLM proxy at 94 tok/s. INT8 GEMM via XRT xclbin kernels. ~15W power envelope.

Qwen3-0.6B: 10.6 ms/tok · 94 tok/s Qwen3-VL-4B: 93 ms/tok · 11 tok/s Llama-3.1-8B: 100 ms/tok · 10 tok/s Gemma4-E2B: 62 ms/tok · 16 tok/s Qwen3-8B: 127 ms/tok · 8 tok/s
GPURadeon 8060S · 32 CUs · Vulkan
312tok/s · 1-bit quantized

llama.cpp on Vulkan. IQ1_S and Q1_0 quantized models at 1.06-1.25 bpw. 381 tok/s on 0.5B, 312 tok/s on 0.8B, 122 tok/s on 4B. ~45W power envelope.

Qwen2 0.5B IQ1_S: 381 tok/s · 296 MB Qwen3.5-0.8B Q1_0: 312 tok/s · 268 MB gemma3 4B IQ1_S: 122 tok/s · 1.05 GB Qwen3.5-9B Q1_0: 70 tok/s · 1.82 GB Nemo 8B IQ1_S: 79 tok/s · 1.97 GB
Benchmarks

Measured on-device. Verified.

NPU Inference94 tok/sQwen3-0.6B · FLM
GPU 1-bit381 tok/s0.5B IQ1_S · 296 MB
NPU Vision11 tok/sQwen3-VL-4B · 3.2 GB
TTS Latency50 msPiper · ~50ms first word
STT Latency1.5 sWhisper-v3 · real-time
Context Length32Ktokens · RadixAttention
Multi-Context7.9×8 HW contexts · 64 req/s
Power (NPU)15 WEntire inference stack
Power (GPU)45 WFull GPU decode
Architecture

How JARVIS works — end to end.

Every component runs locally. The fused engine dispatches per-operation to NPU, GPU, or CPU.

System Architecture
┌──────────────────────────────────────────────────────────────────────┐ │ JARVIS Web UI (any browser) │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │ │ Chat │ │ Voice │ │ Vision │ │ RAG │ │ Status │ │ │ │ (stream) │ │ (mic/tts)│ │ (upload) │ │ (search) │ │ (sys) │ │ │ └────┬─────┘ └─────┬────┘ └────┬─────┘ └─────┬────┘ └───┬────┘ │ └───────┼──────────────┼────────────┼───────────────┼───────────┼────────┘ │ │ │ │ │ ┌───────▼──────────────▼────────────▼───────────────▼───────────▼────────┐ │ JARVIS Orchestrator (Python, :8080) │ │ ┌──────────┐ ┌──────────────┐ ┌───────┐ ┌──────────┐ ┌────────┐ │ │ │ Agent │ │ Open │ │ TTS │ │ Tool │ │ Conv │ │ │ │ (LLM) │ │ Knowledge │ │ (Piper)│ │ Executor │ │ Memory │ │ │ └────┬─────┘ └──────┬───────┘ └───┬───┘ └────┬─────┘ └────┬───┘ │ └───────┼────────────────┼──────────────┼────────────┼──────────────┼──────┘ │ │ │ │ │ ┌───────▼────────────────▼──────────────▼────────────▼──────────────▼──────┐ │ Unified API Layer (port 9090) │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ LLM │ │ Whisper │ │ Embed │ │ Vision │ │ │ │ (FLM) │ │ (FLM) │ │ (FLM) │ │ (FLM) │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ └───────┼──────────────┼────────────┼──────────────┼──────────────────────┘ │ │ │ │ ┌───────▼──────────────▼────────────▼──────────────▼──────────────────────┐ │ Fused Engine (NPU+GPU+CPU) │ │ XDNA2 NPU (94 tok/s) ←→ Radeon 8060S (381 tok/s) ←→ Zen 5 CPU │ │ H2O KV Cache · RadixAttention · 8 Dispatch Policies │ └─────────────────────────────────────────────────────────────────────────┘
Open Knowledge Format

Your knowledge. Your format. No lock-in.

Every fact JARVIS learns is a human-readable .md file with YAML frontmatter. You can read, edit, add, or delete with any text editor. No proprietary databases. No vendor lock-in.

# 📁 /home/bcloud/jarvis/data/knowledge/ ├── index.json # Auto-generated search index ├── facts/ # Structured facts JARVIS learned │ ├── npu_benchmark.md │ ├── model_config_qwen3.md │ └── hardware_specs.md ├── documents/ # Your uploaded docs (RAG) │ ├── project_notes.md │ └── research_paper.txt ├── conversations/ # Chat history logs │ └── 2026-07-05_session.md └── tools/ # Tool output snapshots └── python_result.py # Example entry → npu_benchmark.md --- type: fact created: 2026-07-05T18:30:00Z tags: [npu, benchmark, qwen3] source: measurement confidence: 0.95 --- # NPU Inference Speed Qwen3-0.6B on XDNA2 NPU: - FLM proxy: 94 tok/s (10.6 ms/tok) - C++ v12: 97 tok/s (10.3 ms/tok) - 32 AIE2P tiles · INT8 GEMM · 50 TOPS
Human-readable — Plain markdown. Open with any text editor.
Git-friendly — Version your knowledge. Diff. Rollback. Branch.
No lock-in — Move your knowledge anywhere. No export needed.
Structured metadata — YAML frontmatter for type, tags, confidence.
Full-text search — Built-in keyword search across all entries.
Confidence scoring — JARVIS tracks how certain it is about each fact.
Multimodal

Talk, see, search. All offline.

JARVIS processes speech, images, and documents — all on-device, all private.

┌─────────────────────┐ │ 🎤 Voice Input │ └────────┬────────────┘ ┌────────▼────────────┐ │ Whisper-v3 (NPU) │ │ Speech → Text │ │ ~1.5s real-time │ └────────┬────────────┘ ┌────────▼────────────┐ │ JARVIS responds │ └────────┬────────────┘ ┌────────▼────────────┐ │ Piper TTS (CPU) │ │ Text → Speech │ │ ~50ms first word │ └─────────────────────┘

Push-to-talk from web UI. Natural voice conversation. Auto-TTS on response.

┌─────────────────────┐ │ 🖼 Image Upload │ └────────┬────────────┘ ┌────────▼────────────┐ │ Qwen3-VL-4B (NPU) │ │ Vision Encoder │ │ → Image features │ └────────┬────────────┘ ┌────────▼────────────┐ │ LLM decodes │ │ description │ │ → 11 tok/s │ └─────────────────────┘

Upload images, screenshots, or photos. JARVIS describes what it sees. Works in the chat.

┌─────────────────────┐ │ 📄 Upload Doc │ └────────┬────────────┘ ┌────────▼────────────┐ │ Open Knowledge │ │ → Index into .md │ │ → Full-text search │ └────────┬────────────┘ ┌────────▼────────────┐ │ RAG: Search + │ │ Context injection │ │ → Grounded answers │ └─────────────────────┘

Upload files, notes, or entire directories. JARVIS indexes and answers from your knowledge.

Get Started

Run JARVIS in 3 commands.

Prerequisites: AMD Strix Halo (Ryzen AI Max+ 395) with NPU drivers and FLM installed.

1. Start the NPU backend

sudo flm serve qwen3:0.6b \ --port 52625 --pmode turbo → NPU running at 94 tok/s

FLM (FastFlowLM) runs the model on the XDNA2 NPU. One command, zero config.

2. Start the JARVIS server

source jarvis-env/bin/activate cd jarvis && python3 server.py → JARVIS running on :8080

Python FastAPI server. Orchestrator, agent, voice I/O, RAG, knowledge base — all in one process.

3. Open the web UI

open http://localhost:8080/chat → JARVIS interface loads

Full chat UI with streaming, voice, vision, and file upload. Works in Chrome, Firefox, Safari.

Mobile access

curl -fsSL 1bit.systems/mobile.sh | sh → ngrok tunnel + QR code

Expose JARVIS via ngrok. Scan the QR code with JARVIS Mobile (iOS/Android) to chat from anywhere.

curl -fsSL https://1bit.systems/jarvis/install.sh | sh # Coming soon: one-command JARVIS install

Your private JARVIS. On your machine.

94 tok/s. Zero cloud. Zero subscriptions. Full voice, vision, and RAG. Open source.

Strix Halo required Open source · MIT Zero Python in inference