I've worked on production systems for years. Most of the time I'm not starting from zero — I'm refactoring legacy backend code, fixing performance bottlenecks, or modernizing a UI that's starting to feel dated. That's a fundamentally different problem than greenfield development, and it demands a different approach.
Raw Claude Code is powerful, but pointed at an existing codebase without guardrails it can introduce subtle bugs or make inconsistent changes that compound in ways that are hard to untangle later. The model doesn't know your implicit conventions, your undocumented architectural decisions, or the landmines left by that sprint three years ago.
That's why I've settled on a combination of two tools that sit on top of Claude Code: gstack — Garry Tan's virtual engineering team of specialist roles — and Superpowers — Jesse Vincent's structured seven-phase agentic workflow. They complement each other without overlap. gstack handles the "thinking" layer: expert audits, adversarial challenges, cross-model second opinions. Superpowers handles the "doing" layer: TDD discipline, git worktrees, and phased execution that prevents sloppy refactors on code that's already in production.
I keep everything coordinated through a single TODO.md file that acts as my living, prioritized backlog. No fancy project setup. Claude Code is pointed at my repo root, CLAUDE.md is in place, and the session starts from there.
Here is exactly how the flow works in practice.
understand and prioritize backend issues
Every improvement cycle starts with targeted discovery using gstack's specialist commands. These replace vague brainstorming with actionable analysis rooted in the actual code.
The first command is /plan-eng-review — a full engineering review that reads the codebase and surfaces performance bottlenecks, scalability problems, tech debt, and unreliable patterns. The prompt I give it is direct:
"This is an existing production backend. Run a full engineering review. Identify performance bottlenecks, scalability issues, tech debt, and unreliable patterns. Read the codebase and update TODO.md under ## Backend Issues with P0/P1/P2 priorities, severity, reproduction steps, and suggested fix categories."
From there I run /cso, gstack's Chief Security Officer role, for a security and compliance pass using OWASP and STRIDE. The output goes into the same TODO.md, appended as prioritized findings. Then I add a third voice with /codex, gstack's independent OpenAI Codex CLI reviewer running in adversarial mode — its job is to try to break the assumptions the earlier analysis made, challenge edge cases, and add anything the first pass missed.
"Run adversarial review mode on the current backend findings. Try to break assumptions, challenge edge cases, and add any unique risks to TODO.md. Compare with gstack's earlier analysis for cross-model insights."
The combination of three distinct perspectives — engineering manager, security officer, independent adversary — surfaces things that any single pass would miss. Once the list is in TODO.md, I optionally run /plan-ceo-review if I need business-impact ranking layered on top of the technical priorities.
audit and improve UX and UI
With the backend issues catalogued, I switch focus to the frontend. The same specialist approach applies.
/design-review or /plan-design-review runs a structured review against the current UI for accessibility, visual hierarchy, mobile responsiveness, and patterns that have aged poorly. The key constraint in the prompt is respecting the existing design system — I don't want a blank-slate redesign, I want prioritized gaps with clear before/after descriptions.
"Review the current UI/UX for accessibility, visual hierarchy, mobile responsiveness, and outdated patterns. Respect my existing design system. Summarize gaps and add prioritized items to TODO.md under ## UX Improvements with clear before/after descriptions."
After that, /design-consultation or /design-shotgun brings in modern, production-grade suggestions at the component level. These go into TODO.md as specific, actionable changes — not aspirational notes.
At this point TODO.md has two clean sections: backend issues and UX improvements, both fully prioritized. That's when execution begins.
safe execution, one issue at a time
This is where Superpowers takes over, while gstack stays in the loop as a quality gate. The handoff feels natural: gstack audits and identifies what needs doing; Superpowers figures out how to do it safely without touching anything it shouldn't.
For each P0 or high-priority item in TODO.md, the cycle is four moves.
First, structured planning. I run /brainstorm followed by /write-plan with a prompt that focuses Superpowers on minimal safe changes to the existing codebase — not a rewrite, not an opportunity to clean up everything adjacent. Just the target issue, planned in TDD style. I always review the plan before approving it. That review step is not optional; it's where I catch misunderstandings before they become commits.
"Based on the top item in TODO.md, brainstorm options then write a detailed TDD-style plan. Focus on minimal safe changes to the existing codebase."
Second, isolated execution. Superpowers automatically spins up a git worktree so my main branch stays completely untouched. It runs the full TDD loop — red, green, refactor — and keeps changes atomic. The worktree isolation is the thing I trust most about this setup. Sloppy agentic changes on main are how you get a bad afternoon.
Third, quality gates. Before anything merges, it runs through: a specialist check via /plan-eng-review or /design-review depending on what changed; a standard code review via /review; adversarial mode via /codex specifically targeting the new changes; and a test or browser check via /qa. The adversarial pass on completed changes is the gate I've found most valuable — it's much easier for the model to attack a specific diff than to exhaustively audit a full codebase.
Fourth, close the loop. When Superpowers finishes, the session ends with a prompt to mark the task complete in TODO.md, note any new issues discovered during the fix, and run /retro so Claude learns my preferences over time. The retro isn't ceremonial — it's how the system gets better at my codebase specifically.
the weekly rhythm that actually sticks
The workflow only works if it has a cadence. What I've found that holds is this: one discovery day, two fix days, hard stops at the end of each.
Discovery day is a quick pass with /plan-eng-review, /cso, /design-review, and /codex to refresh TODO.md. This takes less than an hour. The whole point is to walk into fix days with a prioritized list rather than making judgment calls in the moment about what to tackle.
Fix days cover a maximum of one to two issues — typically one backend item and one UX item. Never more. The limit is not about time; it's about quality. Every additional issue in a session adds coordination overhead and increases the chance that the quality gates get rushed. Two well-executed fixes with clean tests and a proper retro are worth more than five half-finished ones.
Before any merge or deploy: /qa and /ship, gstack's release checklist. It's become automatic.
when to skip pieces of this
Not every change needs the full stack. There are two patterns I reach for when the full workflow is more than the problem warrants.
For small, low-risk changes — a single function, a CSS adjustment, a config tweak — I skip Superpowers and just use gstack's /autoplan plus /review plus /codex. The adversarial review still happens because even small changes in production code can have unexpected surface area, but I don't need a git worktree and a TDD cycle for a three-line fix.
For sessions that are purely backend refactoring with no UI changes, I lean heavier on Superpowers' built-in planning without pulling in as many gstack roles. The specialist perspectives add the most value at the audit and review stages; for pure execution on a well-understood problem, Superpowers' own structure is enough.
The rule I would give anyone thinking about this: do not add the full workflow prematurely. For early-stage or exploratory code, one model and one env var is the right answer. Add the layers when you can name the specific failure mode you're trying to prevent — and when you're working on code that already has users depending on it.
why this combination holds together
The reason gstack and Superpowers work together without friction is that they were designed for different problems. gstack's roles — engineering manager, designer, CSO, reviewer, Codex — exist to provide expert perspective and adversarial challenge. They're good at the things that one model misses when it's both the author and the reviewer of its own work. Superpowers exists to enforce process discipline on execution: TDD, git safety, phased delivery. These are orthogonal concerns. The tools don't compete; they hand off cleanly.
What the system actually delivers is reliable forward progress on production code without regression surprises. Backend performance improvements that hold under load. UX changes that users notice for the right reasons. And none of what I've started calling "AI slop" — plausible-looking changes that pass a surface read but introduce subtle incorrectness at the edges.
That last thing is the hardest to defend against with a single-model, single-pass approach. The adversarial layers exist precisely because the model that generates the change shares the same blind spots as the model that reviews it. Different models, different perspectives, structured quality gates — that's the actual defense.
If you're working on a real production codebase, the place to start is /plan-eng-review and /design-review in your next session. See what surfaces. Drop your stack or a specific pain point in the comments — I'm happy to share the exact prompt sequence I would run for it.