When AI Gives Garbage Output: A Systematic Troubleshooting Guide
Scenario
Context: AI output isn't working - tests fail, output is wrong, or behavior is unexpected
Goal: Systematically diagnose AI failure modes using a decision tree approach
Anti-pattern: Repeatedly prompting 'fix it' without understanding the root cause
Conversation
AI Failure Triage
When AI output is wrong, the problem falls into one of five categories. This guide helps you identify which one and route to the right fix.
| Symptom | Likely Cause | Section |
|---|---|---|
| Hallucinated APIs, fake URLs, non-existent methods | Hallucination | A |
| Security holes, obvious bugs in generated code | Vibe Coding (insufficient review) | B |
| "Works" in dev, fails in prod | Silent Substitution | C |
| Output quality degrades over long session | Context/Review Fatigue | D |
| Confidently wrong about logic or requirements | Reasoning Limits | E |
Identify your symptom, then jump to the relevant section.
Section A: Hallucination
Symptoms:
- Method names that don't exist in the library
- URLs that 404
- Citations to papers/articles that can't be found
- API parameters that aren't valid
- Package names that don't exist
Quick Test:
Ask: "Can you show me the documentation link for [method]?"
If AI corrects itself or provides a broken link, it was hallucinating.
Deep Dive: See conversation: catching_hallucination
The fs.watchDirectory() method you suggested doesn't exist
You're right, I apologize. There is no fs.watchDirectory() in Node.js. The correct method is fs.watch() for watching files/directories, or fs.watchFile() for polling-based watching of individual files.
Here's the corrected code:
fs.watch('./data', (eventType, filename) => {
console.log(`${eventType}: ${filename}`);
});
Hallucination Prevention:
- Verify method names in official docs before using
- For unfamiliar APIs, ask AI to provide documentation links
- Test in REPL before integrating
- Use TypeScript/IDE autocomplete to catch non-existent methods
Section B: Vibe Coding (Insufficient Review)
Symptoms:
- Security vulnerabilities in generated code (SQL injection, XSS)
- Obvious logic bugs that would be caught by reading the code
- Hardcoded secrets or credentials
- Missing error handling
Root Cause: Accepting AI changes without reading the diff.
Deep Dive: See conversation: vibe_coding_failures
I accepted 200 lines of login code and now I'm seeing SQL injection warnings in my security scan
Let me review the login code for SQL injection vulnerabilities.
I found the issue on line 47:
// VULNERABLE
db.query(`SELECT * FROM users WHERE email = '${email}'`)
This should use parameterized queries:
// FIXED
db.query('SELECT * FROM users WHERE email = $1', [email])
I also found 3 other security issues in this code:
- Password comparison using
===(timing attack vulnerable) - JWT secret hardcoded as
'secret123' - No rate limiting on login attempts
Would you like me to fix all of these?
Vibe Coding Prevention:
- For security-sensitive code, always read the full diff
- Ask AI to review its own code for OWASP Top 10 before accepting
- Large diffs (100+ lines) need proportionally more review time
- Run security scanners (Snyk, Semgrep) on AI-generated code
Section C: Silent Substitution
Symptoms:
- Tests pass locally but fail in production
- AI says "I notice X isn't available, so I..."
- Mocked dependencies that should be real
- Environment-specific behavior
Root Cause: AI observes environment state and makes silent architectural decisions.
Deep Dive: See conversation: silent_substitution
My tests pass but production crashes saying 'database connection failed'
Let me check your database connection code and test setup.
I found the issue. Your tests are using an in-memory mock:
// In tests
jest.mock('./db', () => ({
query: jest.fn().mockResolvedValue({ rows: [] })
}));
But your production code expects a real PostgreSQL connection. The mock was added because the test environment doesn't have database credentials.
Options:
- Use a test database with Docker
- Create integration tests that run against a real DB
- Add a startup check that fails fast if DB is unavailable
Silent Substitution Red Flag:
"When AI says 'I notice X isn't available, so I...' - STOP."
This phrase indicates the AI is making architectural decisions based on environment state. Always ask: "What did you substitute and why?"
Section D: Context/Review Fatigue
Symptoms:
- Output quality was good at session start, now degrading
- AI seems to forget earlier instructions
- Repeated similar mistakes
- You're rubber-stamping changes without reading them
Root Cause: Long sessions exhaust context window OR human attention.
Deep Dive: See conversation: review_fatigue
We've been working on this for 3 hours and now the AI keeps suggesting changes that break things we already fixed
After a long session, both AI context and human attention degrade. Let's reset:
- Summarize current state: What works, what's broken?
- Start fresh session: Clear context, provide summary
- Take a break: Your review quality matters too
Would you like me to generate a summary of our current progress that you can paste into a fresh session?
Fatigue Prevention:
- Use the Document-Clear-Resume pattern for multi-day work
- Commit working checkpoints before long refactoring sessions
- Take breaks - your review quality is part of the system
- For critical changes, review in a fresh mental state
Section E: Reasoning Limits
Symptoms:
- AI confidently explains incorrect logic
- Off-by-one errors, boundary condition bugs
- Misunderstanding of requirements
- Plausible-sounding but wrong explanations
Root Cause: LLMs predict plausible text, not correct logic.
No deep dive link - this section covers the pattern.
The AI insists this sorting algorithm is O(n log n) but I think it's O(n^2)
You're right to question this. Let me trace through the algorithm:
for i in range(n): # O(n)
for j in range(i, n): # O(n) in worst case
if arr[j] < arr[i]:
arr[i], arr[j] = arr[j], arr[i]
This is a nested loop where the inner loop runs approximately n times for each outer iteration. That's O(n * n) = O(n^2).
I apologize for the error. This is selection sort, not a divide-and-conquer algorithm. The O(n log n) claim was incorrect.
Reasoning Limits Prevention:
- For algorithms/math, trace through examples by hand
- Ask AI to "show your work" step by step
- Be skeptical of confident explanations for complex logic
- Use tests to verify behavior, not AI explanations
Quick Checks Checklist
Before accepting AI output, run through this checklist:
- Hallucination: Did I verify the API/method exists?
- Security: Did I review for OWASP Top 10 issues?
- Substitution: Did AI mention "I notice X isn't available"?
- Freshness: Is my review quality still high?
- Logic: Does the explanation match what the code does?
One "no" answer = investigate before accepting.
Key Takeaways
- AI failures fall into 5 categories: hallucination, vibe coding, silent substitution, fatigue, and reasoning limits
- Each category has distinct symptoms and prevention strategies
- "When AI says 'I notice X isn't available, so I...' - STOP" is the key silent substitution heuristic
- Repeated 'fix it' prompts without understanding indicate you're in a failure loop
- Your review quality is part of the system - fatigue affects both AI and human
Try It Yourself
Prompt Template
Next time AI output doesn't work, identify which of the 5 categories it falls into before asking for a fix. Variations to Try
- For hallucinations: 'Show me the documentation link for [method]'
- For security: 'Review this for OWASP Top 10 before I accept'
- For substitution: 'What assumptions did you make about my environment?'
- For reasoning: 'Trace through this step by step with example input'
Related Content
Sources
October 2025 Update: GenAI Code Security Report
Primary source for AI code security statistics: 45% overall failure rate, 72% for Java specifically. The 'bigger models ≠ more secure code' finding is critical for model_routing - security scanning is needed regardless of model. Java's 72% rate makes it the riskiest language for AI-generated code.
- AI-generated code introduced risky security flaws in 45% of tests
- Java was the riskiest language with 72% security failure rate
- XSS (CWE-80) defense failed in 86% of relevant code samples
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
This is the most rigorous 2025 study on AI coding productivity. The RCT methodology (16 experienced developers, 246 tasks, $150/hr compensation) makes this highly credible. The 39-44 percentage point gap between perceived and actual productivity is the key insight for our trust_calibration dimension. This directly supports recommendations about not over-trusting AI suggestions and maintaining verification practices.
- Experienced developers were 19% slower with AI
- Developers perceived 20% speedup (39-44 percentage point gap)
- Self-reported productivity may not reflect reality
The 'Trust, But Verify' Pattern For AI-Assisted Engineering
This article provides the conceptual framework for our trust_calibration dimension. The three principles (Blind Trust is Vulnerability, Copilot Not Autopilot, Human Accountability Remains) directly inform our survey questions. The emphasis on verification over speed aligns with METR findings. Practical guidance includes starting conservatively with AI on low-stakes tasks.
- Blind trust in AI-generated code is a vulnerability
- AI tools function as 'Copilot, Not Autopilot'
- Human verification is the new development bottleneck