Private AI · On-device · Zero Cloud

JARVIS

Your Private AI. On Your Hardware.

Chat, voice, vision, RAG, agents — running entirely on your NPU+GPU+CPU. 94 tok/s on an AMD Strix Halo laptop. No cloud. No subscriptions. No data leaves your machine.

Install JARVIS → See capabilities

NPU: XDNA2 · 94 tok/s GPU: Radeon 8060S · 312 tok/s Voice: Whisper STT + Piper TTS Vision: Qwen3-VL · 3.2B RAG: Open Knowledge Format

JARVIS — NPU+GPU+CPU Fused live · 94 tok/s

$ curl -X POST http://jarvis.local:8080/api/chat \ -F "message=What can you do?" { "response": "I'm JARVIS — your private AI assistant. I can chat, see images, hear your voice, search your documents, write code, and control your system. Everything runs locally on your NPU+GPU+CPU.", "model": "qwen3:0.6b (NPU)", "latency": "94 tok/s", "power": "~15W" } # decode=94 tok/s ttft=513ms (FLM proxy · XDNA2 NPU) # no cloud — no data leaves your machine

Capabilities

Full AI stack. One machine. Zero cloud.

Everything you expect from a modern AI assistant — all running locally. No API keys. No subscriptions. No data leaving your network.

VOICEWhisper-v3 + Piper

Speech In & Out

Talk to JARVIS naturally. Whisper-v3 on NPU converts speech to text. Piper TTS reads responses aloud. Push-to-talk from the web UI.

NPU-powered STT

VISIONQwen3-VL-4B

See & Understand

Upload images, screenshots, or photos. JARVIS describes, analyzes, and answers questions about what it sees. Runs on NPU at 11 tok/s.

11 tok/s on NPU

RAGOpen Knowledge

Document Intelligence

Upload PDFs, text files, or notes. JARVIS indexes them locally and answers questions from your knowledge base. All files are human-readable markdown.

Transparent format

AGENTSTool Calling

Tool-Using Agent

Calculator, Python execution, file operations, system control. JARVIS decides when to use tools and explains what it found.

Python · calc · files

WEBFastAPI + WebSocket

Web UI & API

Full chat interface with streaming markdown, voice recording, image upload, and drag-drop file upload. Works in any browser. OpenAI-compatible API.

Any browser

MOBILEFlutter · iOS · Android

JARVIS on Your Phone

Native Flutter app with the same JARVIS theme. Connect via ngrok tunnel with QR code pairing. JARVIS in your pocket.

App Store + Play Store

The Hardware

Three processors. One unified stack.

JARVIS dynamically dispatches work across NPU, GPU, and CPU — whichever is fastest for each operation.

NPUXDNA 2 · 32 AIE2P tiles

94tok/s · 50 TOPS INT8

The NPU AMD shipped disabled on consumer silicon. We drive it through FLM proxy at 94 tok/s. INT8 GEMM via XRT xclbin kernels. ~15W power envelope.

Qwen3-0.6B: 10.6 ms/tok · 94 tok/s Qwen3-VL-4B: 93 ms/tok · 11 tok/s Llama-3.1-8B: 100 ms/tok · 10 tok/s Gemma4-E2B: 62 ms/tok · 16 tok/s Qwen3-8B: 127 ms/tok · 8 tok/s

GPURadeon 8060S · 32 CUs · Vulkan

312tok/s · 1-bit quantized

llama.cpp on Vulkan. IQ1_S and Q1_0 quantized models at 1.06-1.25 bpw. 381 tok/s on 0.5B, 312 tok/s on 0.8B, 122 tok/s on 4B. ~45W power envelope.

Qwen2 0.5B IQ1_S: 381 tok/s · 296 MB Qwen3.5-0.8B Q1_0: 312 tok/s · 268 MB gemma3 4B IQ1_S: 122 tok/s · 1.05 GB Qwen3.5-9B Q1_0: 70 tok/s · 1.82 GB Nemo 8B IQ1_S: 79 tok/s · 1.97 GB

Benchmarks

Measured on-device. Verified.

NPU Inference94 tok/sQwen3-0.6B · FLM

GPU 1-bit381 tok/s0.5B IQ1_S · 296 MB

NPU Vision11 tok/sQwen3-VL-4B · 3.2 GB

TTS Latency50 msPiper · ~50ms first word

STT Latency1.5 sWhisper-v3 · real-time

Context Length32Ktokens · RadixAttention

Multi-Context7.9×8 HW contexts · 64 req/s

Power (NPU)15 WEntire inference stack

Power (GPU)45 WFull GPU decode

Architecture

How JARVIS works — end to end.

Every component runs locally. The fused engine dispatches per-operation to NPU, GPU, or CPU.

System Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                       JARVIS Web UI (any browser)                      │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────┐  │
│  │ Chat     │  │ Voice    │  │ Vision   │  │ RAG      │  │ Status │  │
│  │ (stream) │  │ (mic/tts)│  │ (upload) │  │ (search) │  │ (sys)  │  │
│  └────┬─────┘  └─────┬────┘  └────┬─────┘  └─────┬────┘  └───┬────┘  │
└───────┼──────────────┼────────────┼───────────────┼───────────┼────────┘
        │              │            │               │           │
┌───────▼──────────────▼────────────▼───────────────▼───────────▼────────┐
│                    JARVIS Orchestrator (Python, :8080)                   │
│  ┌──────────┐  ┌──────────────┐  ┌───────┐  ┌──────────┐  ┌────────┐  │
│  │ Agent    │  │ Open         │  │ TTS   │  │ Tool     │  │ Conv   │  │
│  │ (LLM)    │  │ Knowledge    │  │ (Piper)│  │ Executor │  │ Memory │  │
│  └────┬─────┘  └──────┬───────┘  └───┬───┘  └────┬─────┘  └────┬───┘  │
└───────┼────────────────┼──────────────┼────────────┼──────────────┼──────┘
        │                │              │            │              │
┌───────▼────────────────▼──────────────▼────────────▼──────────────▼──────┐
│                    Unified API Layer (port 9090)                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│  │ LLM      │  │ Whisper  │  │ Embed    │  │ Vision   │               │
│  │ (FLM)    │  │ (FLM)    │  │ (FLM)    │  │ (FLM)    │               │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘               │
└───────┼──────────────┼────────────┼──────────────┼──────────────────────┘
        │              │            │              │
┌───────▼──────────────▼────────────▼──────────────▼──────────────────────┐
│                     Fused Engine (NPU+GPU+CPU)                          │
│  XDNA2 NPU (94 tok/s)  ←→  Radeon 8060S (381 tok/s)  ←→  Zen 5 CPU   │
│  H2O KV Cache · RadixAttention · 8 Dispatch Policies                   │
└─────────────────────────────────────────────────────────────────────────┘

Open Knowledge Format

Your knowledge. Your format. No lock-in.

Every fact JARVIS learns is a human-readable .md file with YAML frontmatter. You can read, edit, add, or delete with any text editor. No proprietary databases. No vendor lock-in.

# 📁 /home/bcloud/jarvis/data/knowledge/ ├── index.json # Auto-generated search index ├── facts/ # Structured facts JARVIS learned │ ├── npu_benchmark.md │ ├── model_config_qwen3.md │ └── hardware_specs.md ├── documents/ # Your uploaded docs (RAG) │ ├── project_notes.md │ └── research_paper.txt ├── conversations/ # Chat history logs │ └── 2026-07-05_session.md └── tools/ # Tool output snapshots └── python_result.py # Example entry → npu_benchmark.md --- type: fact created: 2026-07-05T18:30:00Z tags: [npu, benchmark, qwen3] source: measurement confidence: 0.95 --- # NPU Inference Speed Qwen3-0.6B on XDNA2 NPU: - FLM proxy: 94 tok/s (10.6 ms/tok) - C++ v12: 97 tok/s (10.3 ms/tok) - 32 AIE2P tiles · INT8 GEMM · 50 TOPS

✓Human-readable — Plain markdown. Open with any text editor.

✓Git-friendly — Version your knowledge. Diff. Rollback. Branch.

✓No lock-in — Move your knowledge anywhere. No export needed.

✓Structured metadata — YAML frontmatter for type, tags, confidence.

✓Full-text search — Built-in keyword search across all entries.

✓Confidence scoring — JARVIS tracks how certain it is about each fact.

Multimodal

Talk, see, search. All offline.

JARVIS processes speech, images, and documents — all on-device, all private.

┌─────────────────────┐ │ 🎤 Voice Input │ └────────┬────────────┘ │ ┌────────▼────────────┐ │ Whisper-v3 (NPU) │ │ Speech → Text │ │ ~1.5s real-time │ └────────┬────────────┘ │ ┌────────▼────────────┐ │ JARVIS responds │ └────────┬────────────┘ │ ┌────────▼────────────┐ │ Piper TTS (CPU) │ │ Text → Speech │ │ ~50ms first word │ └─────────────────────┘

Push-to-talk from web UI. Natural voice conversation. Auto-TTS on response.

┌─────────────────────┐ │ 🖼 Image Upload │ └────────┬────────────┘ │ ┌────────▼────────────┐ │ Qwen3-VL-4B (NPU) │ │ Vision Encoder │ │ → Image features │ └────────┬────────────┘ │ ┌────────▼────────────┐ │ LLM decodes │ │ description │ │ → 11 tok/s │ └─────────────────────┘

Upload images, screenshots, or photos. JARVIS describes what it sees. Works in the chat.

┌─────────────────────┐ │ 📄 Upload Doc │ └────────┬────────────┘ │ ┌────────▼────────────┐ │ Open Knowledge │ │ → Index into .md │ │ → Full-text search │ └────────┬────────────┘ │ ┌────────▼────────────┐ │ RAG: Search + │ │ Context injection │ │ → Grounded answers │ └─────────────────────┘

Upload files, notes, or entire directories. JARVIS indexes and answers from your knowledge.

Get Started

Run JARVIS in 3 commands.

Prerequisites: AMD Strix Halo (Ryzen AI Max+ 395) with NPU drivers and FLM installed.

1. Start the NPU backend

sudo flm serve qwen3:0.6b \ --port 52625 --pmode turbo → NPU running at 94 tok/s

FLM (FastFlowLM) runs the model on the XDNA2 NPU. One command, zero config.

2. Start the JARVIS server

source jarvis-env/bin/activate cd jarvis && python3 server.py → JARVIS running on :8080

Python FastAPI server. Orchestrator, agent, voice I/O, RAG, knowledge base — all in one process.

3. Open the web UI

open http://localhost:8080/chat → JARVIS interface loads

Full chat UI with streaming, voice, vision, and file upload. Works in Chrome, Firefox, Safari.

Mobile access

curl -fsSL 1bit.systems/mobile.sh | sh → ngrok tunnel + QR code

Expose JARVIS via ngrok. Scan the QR code with JARVIS Mobile (iOS/Android) to chat from anywhere.

curl -fsSL https://1bit.systems/jarvis/install.sh | sh # Coming soon: one-command JARVIS install