From PyPI
pip install drydock-cli
drydock
Local-first • Terminal-native • Open-source
____ ____ _
| _ \ _ __ _ _| _ \ ___ ___| | __
| | | | '__| | | | | | |/ _ \ / __| |/ /
| |_| | | | |_| | |_| | (_) | (__| <
|____/|_| \__, |____/ \___/ \___|_|\_\
|___/
Chart your course. Execute with precision. DryDock is a focused TUI for exploring, modifying, building, and testing code with your own local LLM — primary target Gemma 4 26B on llama.cpp. No accounts, no telemetry, no cloud.
$ drydock
Detected local model: gemma4 @ :8000
Tools online: 6
Ready.
What it is
DryDock is a TUI coding assistant designed to work with local LLMs. It gives you a conversational interface to your codebase — explore, modify, build, and test projects through natural language and a focused set of tools.
No data leaves your machine. No API keys. No per-token billing. Just your laptop, your model, and your code.
Install
DryDock v3 is an original, clean-room, Apache-2.0 codebase owned end to end — no upstream fork, no telemetry, no cloud. It ships on PyPI as drydock-cli.
pip install drydock-cli
drydock
git clone https://github.com/fbobe321/drydock.git
cd drydock
pip install -e .
drydock
Recommended serving stack
DryDock is tested and optimized for Gemma 4 26B-A4B served by llama.cpp with --jinja, the chat-template fix that prevents tool-call loops. Other OpenAI-compatible providers such as Ollama, LM Studio, Mistral, OpenAI, and Anthropic can work, but are not as thoroughly tested.
One command from a fresh box with Docker + an NVIDIA GPU. Downloads the GGUF (~13 GB, one-time, resumable) into ~/.cache/drydock/models/, starts the official ghcr.io/ggml-org/llama.cpp:server-cuda container on port 8000 with all the load-bearing flags baked in.
# Docker + nvidia-container-toolkit required.
curl -sSL https://drydock.pages.dev/setup-llm.sh | bash
# Then point drydock at it:
export DRYDOCK_LOCAL_URL=http://localhost:8000/v1
export DRYDOCK_LOCAL_MODEL=gemma4
drydock
Knobs: QUANT=Q4_K_M for higher quality, VISION=1 for image input, PORT=8001, CONTEXT=16384 to shrink the KV cache. Default context is 64k with q8_0 KV-cache quantization, and the Q3_K_M model fits on a single 16 GB card.
Tuned for small local models
DryDock's loop is built around the realities of a 26B local model: non-streaming tool turns so tool-call JSON stays intact, thinking-token stripping, advisory loop nudges instead of hard circuit breakers, adaptive reasoning effort per turn, and two-tier context compaction. Reliable tool calling without a cloud model behind it.
What's in the box
Textual-powered terminal UI with slash commands, plan/edit modes, and session history.
Read, Write, Edit, Bash, Glob, and Grep — a focused, predictable tool set the model can drive reliably.
OpenAI-compatible endpoint support for llama.cpp, Ollama, and LM Studio. No cloud required.
Non-streaming tool turns, thinking-token stripping, loop nudges, and two-tier compaction for stable local runs.
Per-turn reasoning effort: HIGH for planning, OFF for routine writes, LOW for recovery.
Nothing phones home. No accounts, no API keys, no per-token billing — just your machine, model, and code.
Architecture
Tested hardware
GPU: 2× NVIDIA RTX 4060 Ti 16GB — the Q3_K_M model fits on a single 16GB card, so each GPU runs a full independent instance (two cards = two parallel instances for throughput, not tensor-split)
RAM: 64GB recommended, 32GB minimum
Model: Gemma-4-26B-A4B-it (Unsloth Q3_K_M GGUF) via ghcr.io/ggml-org/llama.cpp:server-cuda — 26B MoE, 4B active params per token
Context: 64k (65536) with q8_0 KV-cache quantization
Performance: ~64 tok/s decode (~94 tok/s prompt) with llama.cpp Q3_K_M
OS: Ubuntu 22.04 / 24.04, kernel 6.8+
Minimum: a single 16GB+ VRAM card runs it