mcp-graph: why I built an open-source tool to kill vibe coding
How execution graphs, local RAG, and multi-agent coordination transform AI-augmented development - and why discipline matters more than prompts.
Table of Contents
Vibe coding is killing software quality. And I say this as someone who uses AI to code every single day.
There’s a huge difference between using AI with discipline and simply throwing a prompt and hoping the code works. Over the past few months, I’ve noticed a worrying pattern: developers pasting entire PRDs into an AI agent’s chat, receiving hundreds of lines of code with no tests, no traceability, no persistent context. Next session? The agent forgets everything. Back to square one. More tokens burned, more rework, more frustration.
I decided to fix this. I built mcp-graph - an open-source tool that transforms how AI agents work on real software projects. It’s not another prompt wrapper. It’s infrastructure for structured AI-augmented development.
The real problem with vibe coding
If you work with AI agents on a daily basis, you probably recognize these scenarios:
PRDs become disconnected tasks
You import a requirements document and tasks land on a board with no connection between them. Dependencies between features? Priorities calculated based on impact and risk? Acceptance criteria traceable to the original requirement? None of that exists. The agent treats each task as an island, without understanding how it connects to the whole.
The result is predictable: tasks implemented out of order, broken dependencies discovered too late, and a backlog that doesn’t reflect the project’s reality.
Every session starts from scratch
The agent doesn’t remember what it did yesterday. You spend tokens re-explaining context, re-reading files, re-describing the architecture. It’s like training an intern who has amnesia every morning - except this intern costs tokens per message.
In practice, between 30% and 50% of tokens spent in agent sessions are repeated context. Information the agent already processed but didn’t retain. This isn’t just inefficient - it’s expensive. In large projects, we’re talking hundreds of thousands of wasted tokens per week.
Every AI tool is an island
Your code assistant doesn’t talk to your impact analysis tool. Your documentation search doesn’t feed context into the task planner. The agent that writes code doesn’t know what the dependency analysis agent discovered. Zero coordination between tools.
This isolation means each tool operates with a partial view of the project. The code assistant suggests a pattern that conflicts with an architectural decision documented in another tool. The task planner doesn’t know that a critical dependency has an imminent breaking change. Information exists, but it’s fragmented.
No discipline is enforced
Perhaps the most insidious problem: the agent generates code without tests, without strict typing, without following the project’s patterns. And the developer accepts it because “it works” - until the day it breaks in production, and nobody knows why, because there’s no traceability between the original requirement, the implementation decision, and the generated code.
Vibe coding isn’t using AI to program. Vibe coding is using AI without discipline, without structure, without traceability. It’s the modern equivalent of cowboy coding, just with a language model instead of the cowboy.
What is mcp-graph
mcp-graph is a local-first TypeScript CLI that converts PRDs (Markdown, TXT, PDF, HTML) into persistent execution graphs stored in SQLite.
In practice, you import a requirements document and the system automatically generates a hierarchical task tree with 9 node types: epics, tasks, subtasks, requirements, constraints, milestones, acceptance criteria, risks, and decisions. The system infers dependencies between nodes, estimates priorities based on impact and complexity, and creates a navigable graph that any AI agent can query.
But the graph is just the beginning. mcp-graph is a complete ecosystem:
- 26 MCP tools that any AI assistant can use - Claude, Copilot, Cursor, Windsurf, Zed
- 44 REST endpoints with a full API for programmatic integration
- Interactive dashboard built with React 19 + React Flow with 6 analytical tabs
- 100% local RAG pipeline - no calls to external embedding APIs
- Zero external infrastructure - runs entirely on your machine
The central idea is simple: the graph is the project’s source of truth. All tools query and enrich the same graph. All decisions are traceable. All context is persistent.
The 3 technical differentiators
1. Token Economy - 70-85% context reduction
This is the differentiator that has the most day-to-day impact, and what motivated me to build the entire system.
When an agent needs context about your project, the naive approach is to dump everything into the prompt: entire files, complete history, all requirements. This works for small projects but scales terribly. In real projects, the context window overflows before you can include everything that’s relevant.
mcp-graph solves this with 3-tier compression:
- Tier 1 (Summary): approximately 20 tokens per node - just ID, type, title, and status. Perfect for the agent to understand the overall project structure without consuming context.
- Tier 2 (Standard): approximately 150 tokens per node - full context with resolved dependencies and relevant knowledge snippets via BM25. The ideal level for most interactions.
- Tier 3 (Deep): 500+ tokens per node - everything from Tier 2 plus complete documents, expanded acceptance criteria, and decision history. Used only when the agent needs full depth on a specific node.
The token budget is allocated intelligently: 60% for graph data, 30% for contextual knowledge, 10% for navigation metadata. This allocation is dynamic - if the agent is focused on a specific subtask, more budget goes to knowledge related to that node.
The practical result? 70-85% fewer tokens compared to the “paste everything into the prompt” approach. Faster responses, lower costs, and context windows that don’t overflow in the middle of an important conversation.
To give a concrete dimension: in a project with 200 tasks, the naive approach would consume about 100,000 tokens just to load context. With mcp-graph on Tier 1, the same overview consumes less than 4,000 tokens. The difference is an order of magnitude.
2. Multi-Agent Intelligence Mesh
mcp-graph doesn’t work alone. It coordinates 5 MCPs through a reactive EventBus that ensures all tools share information in real time:
- mcp-graph → project source of truth (execution graph + knowledge store)
- Serena → code analysis, semantic navigation, persistent agent memory
- GitNexus → code intelligence, change impact analysis, blast radius calculation
- Context7 → up-to-date library and framework documentation, cached locally
- Playwright → browser validation, screenshot capture, interface testing
When you import a PRD, the system fires events that automatically trigger reindexation in GitNexus, documentation sync via Context7, and embedding rebuilds in the knowledge store. All coordinated by the EventBus, with no manual intervention.
The result is a living knowledge graph that gets smarter with every interaction. When GitNexus detects that a critical module has been modified, that information automatically appears in the context of related tasks. When Serena identifies a recurring pattern in the code, that information enriches future suggestions.
This coordination isn’t just convenience - it’s what enables agents to make informed decisions. An agent that only sees code is limited. An agent that sees code, dependencies, requirements, risks, and up-to-date documentation makes radically better decisions.
3. 100% Local Knowledge Pipeline
While most RAG solutions depend on paid embedding APIs (OpenAI, Cohere), mcp-graph implements everything locally. This was a conscious decision, motivated by three reasons: cost, privacy, and latency.
The knowledge pipeline combines multiple search strategies:
- Keyword search via SQLite FTS5 with BM25 ranking - fast, precise for technical terms
- Semantic search via TF-IDF with approximately 10 MB of overhead, versus the 400+ MB typical of transformer models
- Hybrid mode that combines both strategies with automatic relevance-based deduplication
- 5 indexed sources: project uploads, Serena memories, source code, external documentation, and web captures
- SHA-256 deduplication - the system never stores duplicate content
In practice, this means: no API keys to configure, no per-query costs, no data leaving your environment, no dependency on external services that might go down or change prices. Everything runs on your machine, with sufficient performance for real-time use during development sessions.
The search quality is surprisingly good. BM25 combined with TF-IDF covers the vast majority of use cases in software projects, where terminology is consistent and search patterns are predictable. For the rare cases where deep semantic search would be superior, the system compensates with the graph’s richness - the connections between nodes provide context that pure embedding doesn’t capture.
The Anti-Vibe-Coding methodology
mcp-graph isn’t just a tool - it embeds a methodology I call Anti-Vibe-Coding, inspired by Extreme Programming (XP) principles adapted for AI-augmented development.
The central principle is simple: discipline over intuition, structure before code.
TDD enforced
Every feature requires a test written BEFORE the implementation. Red → Green → Refactor. It’s not a suggestion - it’s a requirement. If the agent suggests code without a corresponding test, the correct response is to refuse and ask for the test first.
This might seem slow, but in practice it accelerates development. Tests written before code serve as executable specifications. The agent knows exactly what it needs to implement, without ambiguity. The result is code that works on the first attempt much more frequently.
In mcp-graph, each task node in the graph can have acceptance criteria linked to specific tests. When the test passes, the criterion is automatically marked. Complete traceability from requirement to test.
Skeleton & Organs
The human defines the architecture - the skeleton. The AI implements with discipline - the organs. Never the other way around.
You never ask “build me a project management SaaS.” You define: stack (TypeScript + Fastify + PostgreSQL), services (auth, billing, projects), domain (entities, value objects, aggregates), infrastructure (Docker, CI/CD, monitoring). The agent receives a clear skeleton and fills in each module following the defined patterns.
This separation of responsibilities is crucial. The human brings architectural vision, domain knowledge, and judgment about trade-offs. The AI brings implementation speed, consistency, and the ability to process large volumes of code. Each does what they do best.
Anti-one-shot
Never generate an entire system in a single prompt. Decompose into atomic tasks - each estimated at less than a day’s work - tracked in the graph, with explicit dependencies between them.
This decomposition isn’t bureaucracy. It’s what allows the agent to work with manageable context, errors to be isolated and fixed quickly, and progress to be measured objectively. A graph with 200 atomic tasks is infinitely more manageable than a 5,000-word prompt asking to “build everything.”
8-phase cycle
Development follows a structured cycle with 8 phases, each with dedicated tools and specific agent coordination:
ANALYZE → understand the problem, collect requirements, identify risks DESIGN → define architecture, interfaces, contracts between modules PLAN → decompose into atomic tasks, establish dependencies, estimate effort IMPLEMENT → write code following TDD, using the graph as a guide VALIDATE → run tests, verify acceptance criteria, check coverage REVIEW → analyze impact, verify blast radius, validate against project standards HANDOFF → document decisions, update the graph, prepare for the next iteration LISTENING → await new demands, consolidate learnings
Each phase feeds the next. The graph records everything. The CLAUDE.md file works as an evolutionary specification - every error, pattern, or architectural decision is documented to cumulatively train the agent. The agent improves with every iteration. It doesn’t start from scratch.
Numbers that matter
mcp-graph is at version 7.0, published on npm as @mcp-graph-workflow/mcp-graph, under MIT license.
Some numbers that reflect intentional engineering decisions:
- 910+ tests across 105 files, totaling 1,337+ test cases
- 1,017 symbols indexed by GitNexus for semantic navigation
- 2,650 relationships mapped in the code graph
- 67 execution flows tracked end-to-end
- 26 MCP tools + 44 REST endpoints
- 155 multimodal skills (audio, video, computer vision, autonomous orchestration)
- 6 analytical tabs in the interactive dashboard
- TypeScript strict mode - zero
any, zeroconsole.login production code
Each of these numbers represents a deliberate choice. The 910+ tests don’t exist because an agent generated generic tests - they exist because each feature was developed with TDD, each edge case was identified and covered, each regression was prevented.
TypeScript strict mode with zero any isn’t vanity - it’s the guarantee that the type system works in the developer’s favor, not against it. When the compiler complains, it’s because it found a real bug, not because the type was relaxed to “anything goes.”
What changed in v7: Harness Engineering
v7.0 represented the project’s biggest architectural evolution. mcp-graph went from being just a support tool to becoming a hardened execution environment, designed to be auditable and controllable by machines.
Harnessability Score
Unlike just measuring test coverage, v7 calculates a composite score across 4 dimensions that measures how “harnessable” (safe for agents to manipulate) the code is:
- Type Coverage (30%): absence of
anyin TypeScript. If the agent doesn’t know the type, it hallucinates. - Test Coverage (30%): presence of unit tests for each module.
- Fitness Score (20%): passing Architecture Fitness Functions (barrel integrity, dependency direction).
- Docs Coverage (20%): presence of CLAUDE.md, rules, and clear documentation.
Unified Gate System
In v6.3, there were multiple wrappers (lifecycle, code intelligence) that caused deadlock bugs and redundancy. v7 centralized everything into a single gate, reducing 400 lines of code and eliminating IO bottlenecks. The result: 50% less overhead per MCP call and 50% fewer database reads.
Knowledge Autoprune and Token Economy
The knowledge base now has a “token budget.” In v6, it grew indefinitely, degrading RAG performance. v7 introduced autoprune that deletes documents based on Quality, Usage, and Age, keeping RAG consistently fast (under 20ms for 500 documents).
Performance: v6.3 vs v7.0
| Operation | v6.3 | v7.0 | Improvement |
|---|---|---|---|
| Overhead per MCP call | 2 wrappers (100ms) | 1 Unified Gate (50ms) | -50% |
| Database reads (SQLite) | 2x per call | 1x per call | -50% |
| Search FTS5 (10k nodes) | ~350ms | under 200ms | ~43% |
| Determinism (AI-Free) | ~80% | 100% | Total |
DORA Metrics (Elite Level)
On a real project with 415 nodes and 923 edges:
- Deployment Frequency: 25.4 tasks/day (Elite)
- MTTR: 0 hours (thanks to local RAG and self-healing)
- Change Failure Rate: 0% (guaranteed by the Unified Gate)
Where this is going
I believe structured agentic workflows will become the industry standard soon. Not because AI agents are trendy, but because complex software demands traceability, persistent context, and coordination between tools. The same reasons that made CI/CD, automated testing, and code review become standard.
Vibe coding will die the same way cowboy coding died: not for lack of productivity, but for lack of sustainability. Code generated without discipline works in the demo but breaks in production. It works in the prototype but doesn’t scale. It works when one person understands the system but fails when the team grows.
mcp-graph is open-source because I believe this kind of infrastructure needs to belong to the community. There shouldn’t be vendor lock-in on how you organize your AI workflow. There shouldn’t be a paid API between you and the ability to give persistent context to your agent.
If you work with AI agents and feel that your workflow lacks structure, I invite you to try it. The repository is on GitHub: github.com/DiegoNogueiraDev/mcp-graph-workflow
Contributions, issues, and feedback are welcome. The project grows with the community.
Conclusion
The question isn’t whether you should use AI to code - the answer is obviously yes. The question is how you use it.
Vibe coding is the easy path: paste the prompt, accept the result, move on. It works until it doesn’t. And when it fails, it fails spectacularly, with no traceability to understand what went wrong.
The path I propose is more disciplined: decompose requirements into graphs, persist context between sessions, coordinate tools, enforce TDD, track decisions. It’s more work upfront, but the ROI is exponential as the project grows.
mcp-graph is the tool I wished existed when I started working with AI agents. Now it exists, it’s open-source, and it’s ready to use.
The question remains: will you keep vibe coding, or will you structure your AI workflow?