← Blog
explainx / blog

Gemma Chat: offline vibe coding with Gemma 4 and MLX on Mac

Electron app runs Gemma 4 on Apple Silicon with MLX-LM: build + chat modes, model sizes, setup, when offline helps vs when you still need the network. MIT: github.com/ammaarreshi/gemma-chat

15 min readYash Thakker
GemmaLocal LLMApple SiliconMLXOpen sourceVibe coding

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

Gemma Chat: offline vibe coding with Gemma 4 and MLX on Mac

Per its README, Gemma Chat is a local-first desktop app: Electron + Vite + React 19 + TypeScript + Tailwind on the surface, MLX-LM underneath for Gemma 4 on Apple Silicon, with optional Ollama compatibility called out in the repo description. The project bills itself as “vibe code without the internet” after the initial model pull—no API keys in the local narrative, MIT license.

This article is an ExplainX field guide: stack, model sizing, how the agent loop is described upstream, and what to validate if you fork it for your team.

TL;DR

QuestionShort answer
What is it?Desktop chat + coding agent for Gemma 4, running via MLX on Mac (Apple Silicon).
Why care?A concrete open-source reference for offline-capable assistant UX tied to Google’s open Gemma line and Apple’s MLX runtime.
Primary sourcegithub.com/ammaarreshi/gemma-chat
Creator signalAmmaar Reshi—public launch thread and Google Gemma account amplification (April 2026); star/fork counts change—check the repo badge row.
LicenseMIT (per repository LICENSE).
Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.

What shipped

The README frames two modes:

  1. Build mode — A coding agent with a live preview: the model writes multi-file HTML/CSS/JS-style trees into a sandboxed workspace while the UI streams updates.
  2. Chat modeConversational use with tools (upstream mentions web search, URL fetch, calculator, bash in feature list).

Supporting pieces called out there include model switching across several Gemma 4 variants, voice input via Whisper (transformers.js / WASM in-browser path per stack table), and first-run automation: Python venv + MLX provisioning.

How the agent loop is described

The README’s architecture section is worth reading directly. In Build mode the story is:

  • Stream tokens from a local MLX server.
  • Parse XML <action> blocks from the stream (upstream notes small models behaving more reliably with XML than JSON tool calls).
  • Execute actions (file writes, bash, etc.) and feed results back—up to ~40 rounds per user message in the documented design.
  • Flush partial file writes on a timer so the preview iframe can reload while generation is in flight.

That pattern—stream → parse imperative actions → mutate workspace → loop—is the same family of “local Codex-style” loops teams are standardizing on in 2026; here it is bound to Gemma + MLX instead of a hosted API.

Models and memory (from upstream table)

The project’s README publishes a simple matrix. Paraphrased here—re-verify on the repo before you buy hardware:

Variant (as labeled upstream)Approximate sizeNotes
Gemma 4 E2B~1.5 GBFaster, lighter tasks
Gemma 4 E4B~3 GBRecommended balance in README
Gemma 4 27B MoE~8 GBStronger reasoning; 16 GB+ RAM class machine
Gemma 4 31B~18 GBHeaviest; 32 GB+ RAM class machine

Community replies on X have asked the same question your laptop will ask: which row is “enough” for acceptable latency on your thermal budget—there is no substitute for local profiling on the exact chip and cooling you ship with.

Getting started (upstream commands)

From the README's Getting Started block:

git clone https://github.com/ammaarreshi/gemma-chat.git
cd gemma-chat
npm install
npm run dev

Note: Some README snapshots on the web have referenced alternate clone URLs; use the repository you intend to fork and verify default branch and package scripts in package.json before documenting runbooks internally.

Packaging:

npm run dist

Upstream states this yields a .dmg for drag-to-Applications installs.

First-run experience and what to expect

The README frames a progressive setup flow on first launch: Python environment provisioning, MLX dependency installation, and model weight download. That last step is where most "my install is frozen" reports trace back to—~3 GB for the recommended E4B variant means several minutes (or longer on slower connections) with minimal progress feedback in early releases.

Practical timeline from community threads (not a warranty):

  • npm install: 1–3 minutes depending on Node cache
  • First npm run dev: Python venv creation and MLX install can add 5–10 minutes
  • Model download (E4B): 3–15 minutes depending on network
  • Subsequent launches: near-instant once weights are cached

If the app appears stuck during model pull, check disk space (models land in a .models or similar cache directory per upstream code) and network logs before assuming a crash. The Issues tracker has pinned guidance for common first-run blockers—read those before filing duplicates.

Hardware reality check

The repository's own table is honest about memory footprint, but sustained performance depends on thermal design and concurrent workload. Here's what early adopters report (paraphrased from GitHub discussions and social threads):

Mac model classRecommended variantNotes
M1 / M2 (8 GB unified)E2B (1.5 GB)E4B can run but may swap under load; avoid 27B/31B
M1 Pro / M2 Pro (16 GB)E4B (3 GB) or 27B MoE (8 GB)Comfortable for typical sessions; 27B is usable
M1 Max / M2 Max (32 GB+)27B MoE or 31BFull capability; watch for thermal throttling on long runs
M3 / M3 Pro / M3 MaxSame as M1/M2 equivalentImproved efficiency may help sustained throughput

The upstream README is clear: if you want the 31B variant to feel fast, 32 GB+ RAM is not optional. Teams evaluating this for multiple developers should also consider concurrent model instances—three people running E4B locally is ~9 GB of model weights alone, before OS and application overhead.

Real-world use: when offline vibe coding actually helps

The "vibe code without the internet" tagline is catchy, but operational reality is more nuanced. Here's where the offline model story is strongest—and where it still needs a network.

Where local-first wins

  1. Privacy-sensitive prototyping — If you're sketching an internal tool, debugging sensitive logic, or iterating on code with confidential business rules, keeping the model weights and inference entirely local removes one exfiltration vector. Your editor, dependencies, and CI may still phone home, but the LLM layer does not.

  2. Travel and unreliable connectivity — Flights, trains, cafes with flaky Wi-Fi, or remote job sites where bandwidth is metered—cached models mean you can still generate boilerplate, refactor helpers, or iterate on UI mockups without stalling on API timeouts.

  3. Cost control for experimentation — If you're doing rapid prompt iteration or UI rewrites, local inference has zero per-token cost once weights are downloaded. That's meaningful for hobbyists, students, or small teams exploring ideas before committing to a hosted plan.

  4. Latency-sensitive workflows — On a well-specced Mac, local MLX inference can deliver lower time-to-first-token than round-tripping to a cloud API—especially for short prompts where network overhead dominates. That responsiveness matters in interactive build loops where every 200ms saved compounds over dozens of iterations.

Where you still need the network

  • Package installs and updatesnpm install, pip install, system updates, and framework docs all assume connectivity.
  • External tools and APIs — The README mentions web search, URL fetch, calculator in the tool list; those obviously require live data sources unless you mock them.
  • Deployment and CI — Pushing to Git, running tests in GitHub Actions, deploying previews—all network-dependent in standard engineering workflows.
  • Model updates and new weights — Gemma 4 variants evolve; pulling new checkpoints when Google or the community ships improvements still means a download.

The honest framing: Gemma Chat gives you offline inference; it does not give you a hermetically sealed development environment. Treat it as one layer of your stack running locally, not a claim that every dependency is air-gapped.

Comparing Gemma Chat to other local LLM stacks

If you're deciding between Gemma Chat, Ollama, LM Studio, or custom MLX setups, here's how they differ in practice:

DimensionGemma ChatOllamaLM StudioDIY MLX-LM
UI paradigmElectron app, build + chat modesCLI-first, server-orientedDesktop GUI, model libraryScript or notebook
Model scopeGemma 4 family (opinionated)Broad model zooBroad model libraryAny MLX-compatible weights
Setup complexityMedium (npm + Python venv + download)Low (single binary)Low (installer)High (manual dependencies)
Tool/workflow integrationBuilt-in build mode, live previewMCP and external tool friendlyPlugin ecosystemFully custom
Update cadenceDepends on maintainer activityFrequent, vendor-backedFrequent, commercial supportYou own it

When to pick Gemma Chat: You want a batteries-included coding assistant specifically for Gemma 4 with workspace sandboxing and a UI optimized for iterative builds—and you're willing to debug Electron + MLX integration quirks.

When to pick Ollama: You value broad model support, server-style deployment, and MCP ecosystem compatibility over a bespoke UI. Ollama's model library lets you swap between Gemma, Llama, Mistral, and others with one CLI command.

When to pick LM Studio: You want a polished desktop experience with minimal command-line friction and a curated model library backed by a team shipping frequent updates and support channels.

When to go DIY with MLX-LM: You need full control over inference parameters, custom quantization, or research-oriented experiments—and you're comfortable maintaining Python environments and debugging Metal shaders.

Tradeoffs practitioners are already naming

  • Offline inference ≠ offline everything. Installing npm dependencies, reading live API docs, and shipping CI/CD still want a network—even when the model weights never leave the machine. That distinction matters for security reviews ("data never hits OpenAI") vs program reality ("the loop still phones home for packages").
  • First-run downloads are the fragile step: public replies mention crashes during model download—triage via Issues and pinned guidance rather than assumptions.
  • Ecosystem routing: Comments ask for tighter integration with existing local weight stores (for example pointing at Ollama or LM Studio). The repo description already mentions Ollama; whether that satisfies "use my existing cache" is an integration detail to confirm in code and docs.
  • Speech-to-text: A reply thread references MLX-VLM-style server paths for STT—interesting for forks, not something to assert without matching commit and IPC in this repo.
  • Electron overhead: Some users note the app footprint (memory, battery) is higher than a pure CLI tool—expected for an Electron shell, but worth accounting for if you're on an older machine or running many concurrent processes.
  • Limited model family: If you want to experiment with non-Gemma models (Llama 3, Mistral, Qwen), you'll need to fork the codebase or switch tools. The project is opinionated about Google's Gemma line.

Extending and forking: what teams should know

The MIT license means you can fork, modify, and redistribute—subject to license terms. Here are patterns early adopters discuss:

Custom model variants

The repository's model-loading logic lives in Python backend scripts that call MLX-LM. If you want to add a different Gemma checkpoint (e.g., a fine-tuned version for your domain), you'll modify:

  • Model selection UI (Electron frontend)
  • Backend model registry (Python service)
  • Download/cache paths

Check the src/ and server/ directories (names vary by repo structure) for where model IDs map to Hugging Face or local paths.

Tool and MCP integration

The README mentions tools (web search, URL fetch, calculator, bash). If you want to wire MCP servers for internal APIs or databases:

  • Identify the tool calling protocol the app uses (likely JSON-based, similar to OpenAI function calling)
  • Add MCP client stubs in the backend
  • Surface new tools in the UI's capability list

This is not plug-and-play today; treat it as a fork-and-extend project if MCP is a hard requirement.

Workspace sandboxing

The build mode writes files into a sandboxed directory. For production use cases where code must persist across sessions or integrate with Git:

  • Review the sandbox implementation for filesystem isolation
  • Add export/import flows if you need to version generated code
  • Consider security boundaries: the model can write arbitrary files in the sandbox—safe for prototyping, risky if untrusted prompts can inject paths

Deployment to teams

If you want to distribute Gemma Chat internally:

  • Use npm run dist to build platform-specific installers
  • Document the model download step and consider pre-seeding weights in an internal mirror
  • Set up update channels if you modify the codebase and want to push fixes
  • Review Electron's code signing and notarization requirements for macOS distribution at scale

Performance benchmarks and real-world speed

While the upstream README focuses on capabilities, community threads discuss practical latency. Here's what users report (not official benchmarks—treat as anecdotal):

Tokens per second (approximate, E4B variant)

Mac configurationReported TPS rangeNotes
M1, 8 GB8–15 TPSAcceptable for chat; slower for long code generation
M2, 16 GB12–20 TPSComfortable for most workflows
M1 Pro, 16 GB15–25 TPSGood balance of speed and responsiveness
M3 Max, 32 GB+ (27B model)10–18 TPSLarger model trades throughput for quality

Context: 20 TPS means a 200-token code snippet generates in ~10 seconds. For interactive build mode, that's fast enough to see incremental progress; for long rewrites, you may step away while it runs.

Build mode preview lag

The live preview iframe reloads when the model flushes partial writes. Early community feedback notes occasional flicker or stale state if the model generates HTML faster than the preview can reconcile—similar UX to hot-reload in Vite or webpack. Not a blocker, but something to expect if you're generating complex multi-file UIs.

Security and privacy: what the local story means

One reason teams explore Gemma Chat is the pitch that inference stays local. Here's what that does and does not guarantee:

What you get

  • Model weights never leave your machine once cached—no API keys, no per-request logging by a third-party vendor.
  • Prompts and generated code stay in RAM and on disk; they don't transit a cloud service for inference.
  • Filesystem sandboxing in build mode isolates generated code from your main workspace (subject to your review of the sandbox implementation).

What you don't automatically get

  • Network isolation — The app can still make HTTP requests for tools (web search, URL fetch). If you run untrusted prompts, a sophisticated injection could exfiltrate data via those tools.
  • Dependency provenancenpm install and pip install pull from public registries; supply-chain risk is orthogonal to where inference happens.
  • Audit logs — By default, the app doesn't generate compliance-ready logs of who prompted what, when, and with which data. If you need that for governance, you'll add logging yourself.
  • Secret leakage protection — If your code contains API keys or credentials, the model can read and echo them unless you scrub inputs. Local inference does not mean automatic PII/secret detection.

For regulated industries or sensitive projects, local weights are a useful data residency primitive—but you still need standard secret scanning, code review, and access control around the tool.

Future directions and community roadmap

As of this writing, active GitHub Issues and Discussions suggest a few evolving priorities:

  1. Ollama integration — The description mentions Ollama; users want tighter coupling so they can point Gemma Chat at existing Ollama-managed weights instead of dual-downloading.
  2. Model library expansion — Requests for Llama 3, Mistral, and other MLX-compatible families—currently the project is Gemma-first.
  3. Voice input stability — The README lists Whisper (transformers.js) for speech-to-text; early adopters report it works but is experimental.
  4. Build mode robustness — File writes, partial flushes, and preview sync are the trickiest parts of the UX; ongoing PRs aim to reduce flicker and improve error recovery.
  5. Windows and Linux support — The README targets macOS on Apple Silicon; community interest in Windows (x86 + Arm) and Linux exists, but cross-platform MLX is still maturing.

Check the Issues and Milestones on the repo for current status—open-source roadmaps shift as contributors prioritize.

Why ExplainX readers should care

ExplainX indexes skills, tools, agents, and MCP servers for teams that ship with assistants. Gemma Chat is a reference for one slice of that map: desktop shell + local weights + tool protocol + workspace sandbox. Whether you adopt it directly or borrow patterns, the artifact is inspectable in MIT-licensed source.

The broader lesson: local-first LLMs are no longer a research curiosity. With projects like Gemma Chat, Ollama, LM Studio, and raw MLX setups, teams can run capable models on commodity hardware—trading some convenience for privacy, cost control, and offline resilience. That shift matters for regulated industries, cost-conscious startups, and developers in low-connectivity regions.

Related on ExplainX

Sources

  • Repository: github.com/ammaarreshi/gemma-chat
  • Gemma (Google DeepMind open models): positioning and ecosystem context via Google Gemma on X and official Gemma documentation—use those for model policy and license nuance beyond this app.
  • MLX: Apple's machine learning research materials on MLX / MLX-LM for runtime semantics.

Star counts, default models, and README clone URLs drift quickly after a viral launch. Reconcile any numbers in this post with the live GitHub page and Issues before budgeting hardware or support.

Related posts