Build a Golden Dataset from Git History

Build intermediate 40 min typescript

Sources verified Dec 23

Learn to mine git commit history for evaluation test cases, creating a robust dataset that captures real-world code patterns and edge cases.

1. Understand the Scenario

You're building an AI code review tool. You need test cases to evaluate whether it catches real bugs. Your git history contains years of bug fixes - each one is a potential test case where the 'before' is buggy code and 'after' is the fix.

Learning Objectives

Extract before/after code pairs from bug-fix commits
Structure test cases with metadata (difficulty, category, source)
Generate synthetic edge cases for rare scenarios
Balance dataset difficulty for meaningful evaluation
Version and manage dataset evolution

Concepts You'll Practice

Golden Dataset Curation TDD for AI Agents

2. Follow the Instructions

The Challenge: Finding Good Test Cases

Your AI code review tool needs test cases that represent real bugs. Unit tests with contrived examples won't catch the subtle bugs that actually happen in production.

The Solution: Mine your git history. Every bug fix commit contains:

Before: The buggy code (input)
After: The fixed code (expected output)
Message: Description of the bug (metadata)

Step 1: Find Bug-Fix Commits

Search for commits that likely contain bug fixes.

step1_find_commits.sh
 #!/bin/bash
# find_bug_fixes.sh - Extract bug-fix commits

# Search for common bug-fix patterns in commit messages
git log --oneline --all \
  --grep='fix' \
  --grep='bug' \
  --grep='issue' \
  --grep='patch' \
  --grep='resolve' \
  --since='2023-01-01' \
  > /tmp/bug_commits.txt

echo "Found $(wc -l < /tmp/bug_commits.txt) potential bug-fix commits"

# Show sample
head -20 /tmp/bug_commits.txt 

Step 2: Extract Before/After Pairs

For each bug-fix commit, extract the code before and after the change.

step2_extract_pairs.ts
 import { execSync } from 'child_process';
import * as fs from 'fs';

interface CodePair {
  commitSha: string;
  message: string;
  filePath: string;
  before: string;
  after: string;
  linesChanged: number;
}

function extractCodePairs(commitSha: string): CodePair[] {
  const pairs: CodePair[] = [];
  
  // Get commit message
  const message = execSync(
    `git log -1 --format='%s' ${commitSha}`,
    { encoding: 'utf-8' }
  ).trim();
  
  // Get list of modified files
  const files = execSync(
    `git diff-tree --no-commit-id --name-only -r ${commitSha}`,
    { encoding: 'utf-8' }
  ).trim().split('\n').filter(f => f.endsWith('.ts') || f.endsWith('.js'));
  
  for (const filePath of files) {
    try {
      // Get file content BEFORE commit (parent)
      const before = execSync(
        `git show ${commitSha}^:${filePath}`,
        { encoding: 'utf-8' }
      );
      
      // Get file content AFTER commit
      const after = execSync(
        `git show ${commitSha}:${filePath}`,
        { encoding: 'utf-8' }
      );
      
      // Count lines changed
      const diffStat = execSync(
        `git diff --stat ${commitSha}^ ${commitSha} -- ${filePath}`,
        { encoding: 'utf-8' }
      );
      const linesChanged = parseInt(diffStat.match(/\d+/)?.[0] || '0');
      
      pairs.push({
        commitSha,
        message,
        filePath,
        before,
        after,
        linesChanged
      });
    } catch (e) {
      // File might not exist in parent (new file) - skip
    }
  }
  
  return pairs;
} 

Step 3: Structure as Golden Dataset

Convert raw pairs into structured test cases with metadata.

step3_structure_dataset.ts
 interface GoldenExample {
  id: string;
  input: {
    code: string;
    context: string;
  };
  expected: {
    hasIssue: boolean;
    issueType?: string;
    fixedCode?: string;
  };
  metadata: {
    source: 'git_history' | 'synthetic' | 'production';
    difficulty: 'easy' | 'medium' | 'hard';
    category: string;
    commitSha?: string;
    addedDate: string;
  };
}

function pairToGoldenExample(pair: CodePair): GoldenExample {
  // Classify difficulty based on lines changed
  const difficulty = pair.linesChanged < 10 ? 'easy' 
    : pair.linesChanged < 50 ? 'medium' 
    : 'hard';
  
  // Extract issue type from commit message
  const issueType = extractIssueType(pair.message);
  
  return {
    id: `git_${pair.commitSha.slice(0, 8)}_${pair.filePath.replace(/\//g, '_')}`,
    input: {
      code: pair.before,
      context: `File: ${pair.filePath}\nReview this code for potential issues.`
    },
    expected: {
      hasIssue: true, // Bug-fix commits always have issues
      issueType,
      fixedCode: pair.after
    },
    metadata: {
      source: 'git_history',
      difficulty,
      category: issueType,
      commitSha: pair.commitSha,
      addedDate: new Date().toISOString().split('T')[0]
    }
  };
}

function extractIssueType(message: string): string {
  const lower = message.toLowerCase();
  if (lower.includes('null') || lower.includes('undefined')) return 'null_safety';
  if (lower.includes('type') || lower.includes('typescript')) return 'type_error';
  if (lower.includes('security') || lower.includes('xss') || lower.includes('injection')) return 'security';
  if (lower.includes('performance') || lower.includes('memory')) return 'performance';
  if (lower.includes('race') || lower.includes('async')) return 'concurrency';
  return 'general';
} 

Step 4: Generate Synthetic Edge Cases

For rare scenarios, generate synthetic test cases using a strong model.

step4_synthetic_generation.ts
 import OpenAI from 'openai';

const openai = new OpenAI();

async function generateSyntheticEdgeCases(
  existingExamples: GoldenExample[],
  category: string,
  count: number
): Promise<GoldenExample[]> {
  // Find examples in this category for context
  const categoryExamples = existingExamples
    .filter(e => e.metadata.category === category)
    .slice(0, 3);
  
  const prompt = `You are generating test cases for an AI code review tool.

Category: ${category}

Here are some real examples from git history:
${categoryExamples.map(e => `Input:\n${e.input.code.slice(0, 500)}\n\nIssue: ${e.expected.issueType}`).join('\n\n---\n\n')}

Generate ${count} NEW synthetic examples that are:
1. Challenging but realistic
2. Different from the examples above
3. Cover edge cases not represented

For each example, provide:
- buggy_code: The code with the bug
- fixed_code: The corrected version
- issue_description: What's wrong
- difficulty: easy/medium/hard

Respond as JSON array.`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }],
    response_format: { type: 'json_object' }
  });
  
  const generated = JSON.parse(response.choices[0].message.content!);
  
  return generated.examples.map((ex: any, i: number) => ({
    id: `synthetic_${category}_${Date.now()}_${i}`,
    input: {
      code: ex.buggy_code,
      context: 'Review this code for potential issues.'
    },
    expected: {
      hasIssue: true,
      issueType: category,
      fixedCode: ex.fixed_code
    },
    metadata: {
      source: 'synthetic' as const,
      difficulty: ex.difficulty,
      category,
      addedDate: new Date().toISOString().split('T')[0]
    }
  }));
} 

Step 5: Balance and Validate Dataset

step5_balance_validate.ts
 function analyzeDataset(examples: GoldenExample[]): void {
  const stats = {
    total: examples.length,
    bySource: {} as Record<string, number>,
    byDifficulty: {} as Record<string, number>,
    byCategory: {} as Record<string, number>
  };
  
  for (const ex of examples) {
    stats.bySource[ex.metadata.source] = (stats.bySource[ex.metadata.source] || 0) + 1;
    stats.byDifficulty[ex.metadata.difficulty] = (stats.byDifficulty[ex.metadata.difficulty] || 0) + 1;
    stats.byCategory[ex.metadata.category] = (stats.byCategory[ex.metadata.category] || 0) + 1;
  }
  
  console.log('Dataset Analysis:');
  console.log(`  Total examples: ${stats.total}`);
  console.log('\n  By Source:');
  for (const [source, count] of Object.entries(stats.bySource)) {
    console.log(`    ${source}: ${count} (${(count/stats.total*100).toFixed(1)}%)`);
  }
  console.log('\n  By Difficulty:');
  for (const [diff, count] of Object.entries(stats.byDifficulty)) {
    console.log(`    ${diff}: ${count} (${(count/stats.total*100).toFixed(1)}%)`);
  }
  
  // Check balance recommendations
  const easyPct = (stats.byDifficulty['easy'] || 0) / stats.total;
  const hardPct = (stats.byDifficulty['hard'] || 0) / stats.total;
  
  if (easyPct > 0.7) {
    console.log('\n  WARNING: Dataset too easy (>70% easy). Add harder examples.');
  }
  if (hardPct < 0.1) {
    console.log('\n  WARNING: Not enough hard examples (<10%). Generate adversarial cases.');
  }
}

// Save dataset with versioning
function saveDataset(examples: GoldenExample[], version: string): void {
  const dataset = {
    version,
    createdAt: new Date().toISOString(),
    stats: {
      total: examples.length,
      sources: [...new Set(examples.map(e => e.metadata.source))],
      categories: [...new Set(examples.map(e => e.metadata.category))]
    },
    examples
  };
  
  fs.writeFileSync(
    `datasets/golden_v${version}.json`,
    JSON.stringify(dataset, null, 2)
  );
  
  console.log(`Saved dataset v${version} with ${examples.length} examples`);
} 

Your Task

Build a complete golden dataset pipeline:

Find bug-fix commits in a real repository
Extract before/after pairs for modified files
Structure as golden examples with proper metadata
Generate synthetic edge cases for underrepresented categories
Analyze and balance the final dataset

3. Try It Yourself

starter.ts
 import { execSync } from 'child_process';
import * as fs from 'fs';

interface GoldenExample {
  id: string;
  input: { code: string; context: string };
  expected: { hasIssue: boolean; issueType?: string };
  metadata: {
    source: 'git_history' | 'synthetic';
    difficulty: 'easy' | 'medium' | 'hard';
    category: string;
    addedDate: string;
  };
}

// TODO: Implement these functions

function findBugFixCommits(): string[] {
  // Search git log for bug-fix commits
  throw new Error('Not implemented');
}

function extractCodePair(commitSha: string): { before: string; after: string } | null {
  // Get file content before and after commit
  throw new Error('Not implemented');
}

function classifyDifficulty(linesChanged: number): 'easy' | 'medium' | 'hard' {
  // Classify based on complexity
  throw new Error('Not implemented');
}

function analyzeDataset(examples: GoldenExample[]): void {
  // Print stats and balance warnings
  throw new Error('Not implemented');
}

// Main
const commits = findBugFixCommits();
console.log(`Found ${commits.length} bug-fix commits`); 

This typescript exercise requires local setup. Copy the code to your IDE to run.

4. Get Help (If Needed)

Reveal progressive hints

Hint 1: Use git log --grep to search for commits with 'fix' or 'bug' in the message.

Hint 2: git show COMMIT^:file gets the file content from the parent commit (before the change).

Hint 3: Filter for .ts and .js files only - other file types won't work for code review evaluation.

5. Check the Solution

Reveal the complete solution

solution.ts
 import { execSync } from 'child_process';
import * as fs from 'fs';

interface GoldenExample {
  id: string;
  input: { code: string; context: string };
  expected: { hasIssue: boolean; issueType?: string; fixedCode?: string };
  metadata: {
    source: 'git_history' | 'synthetic';
    difficulty: 'easy' | 'medium' | 'hard';
    category: string;
    commitSha?: string;
    addedDate: string;
  };
}

function findBugFixCommits(): string[] {
  const output = execSync(
    `git log --oneline --all --grep='fix' --grep='bug' --since='2023-01-01' | head -100`,
    { encoding: 'utf-8' }
  );
  return output.trim().split('\n').map(line => line.split(' ')[0]).filter(Boolean);
}

function extractCodePair(commitSha: string): { 
  before: string; 
  after: string; 
  filePath: string; 
  message: string;
  linesChanged: number;
} | null {
  try {
    const message = execSync(`git log -1 --format='%s' ${commitSha}`, { encoding: 'utf-8' }).trim();
    const files = execSync(
      `git diff-tree --no-commit-id --name-only -r ${commitSha}`,
      { encoding: 'utf-8' }
    ).trim().split('\n').filter(f => f.endsWith('.ts') || f.endsWith('.js'));
    
    if (files.length === 0) return null;
    
    const filePath = files[0]; // Take first TypeScript/JavaScript file
    const before = execSync(`git show ${commitSha}^:${filePath}`, { encoding: 'utf-8' });
    const after = execSync(`git show ${commitSha}:${filePath}`, { encoding: 'utf-8' });
    const diffStat = execSync(`git diff --stat ${commitSha}^ ${commitSha} -- ${filePath}`, { encoding: 'utf-8' });
    const linesChanged = parseInt(diffStat.match(/\d+/)?.[0] || '0');
    
    return { before, after, filePath, message, linesChanged };
  } catch {
    return null;
  }
}

function classifyDifficulty(linesChanged: number): 'easy' | 'medium' | 'hard' {
  if (linesChanged < 10) return 'easy';
  if (linesChanged < 50) return 'medium';
  return 'hard';
}

function extractIssueType(message: string): string {
  const lower = message.toLowerCase();
  if (lower.includes('null') || lower.includes('undefined')) return 'null_safety';
  if (lower.includes('type')) return 'type_error';
  if (lower.includes('security') || lower.includes('xss')) return 'security';
  return 'general';
}

function analyzeDataset(examples: GoldenExample[]): void {
  const byDiff: Record<string, number> = { easy: 0, medium: 0, hard: 0 };
  const byCat: Record<string, number> = {};
  
  for (const ex of examples) {
    byDiff[ex.metadata.difficulty]++;
    byCat[ex.metadata.category] = (byCat[ex.metadata.category] || 0) + 1;
  }
  
  console.log('\nDataset Analysis:');
  console.log(`  Total: ${examples.length}`);
  console.log('\n  By Difficulty:');
  for (const [d, c] of Object.entries(byDiff)) {
    console.log(`    ${d}: ${c} (${(c/examples.length*100).toFixed(0)}%)`);
  }
  console.log('\n  By Category:');
  for (const [cat, c] of Object.entries(byCat)) {
    console.log(`    ${cat}: ${c}`);
  }
  
  // Balance warnings
  if (byDiff.easy / examples.length > 0.7) {
    console.log('\n  ⚠️ Too easy - add harder examples');
  }
  if (byDiff.hard / examples.length < 0.1) {
    console.log('\n  ⚠️ Not enough hard examples');
  }
}

// Main execution
const commits = findBugFixCommits();
console.log(`Found ${commits.length} bug-fix commits`);

const examples: GoldenExample[] = [];

for (const sha of commits.slice(0, 50)) { // Process first 50
  const pair = extractCodePair(sha);
  if (!pair) continue;
  
  examples.push({
    id: `git_${sha.slice(0, 8)}`,
    input: {
      code: pair.before,
      context: `File: ${pair.filePath}\nReview for bugs.`
    },
    expected: {
      hasIssue: true,
      issueType: extractIssueType(pair.message),
      fixedCode: pair.after
    },
    metadata: {
      source: 'git_history',
      difficulty: classifyDifficulty(pair.linesChanged),
      category: extractIssueType(pair.message),
      commitSha: sha,
      addedDate: new Date().toISOString().split('T')[0]
    }
  });
}

analyzeDataset(examples);

// Save
fs.mkdirSync('datasets', { recursive: true });
fs.writeFileSync('datasets/golden_v1.json', JSON.stringify({ examples }, null, 2));
console.log(`\nSaved ${examples.length} examples to datasets/golden_v1.json`); 
  Use git log --grep to filter for bug-related commits 
  git show SHA^:file gets content BEFORE the commit (parent) 
  Classify difficulty by lines changed as a simple heuristic 
  Warn when dataset is unbalanced - too easy or not enough hard cases 

Common Mistakes

Not handling commits where file doesn't exist in parent

Why it's wrong: New files have no parent version - git show will fail.

How to fix: Wrap in try/catch and skip commits that add new files.

Including all file types

Why it's wrong: Dataset for code review should only contain code files, not configs or docs.

How to fix: Filter for .ts, .js, .tsx, .jsx extensions only.

No difficulty balancing

Why it's wrong: Dataset with 90% easy examples gives false confidence in eval results.

How to fix: Analyze distribution and generate synthetic hard cases to balance.

Test Cases

Finds bug-fix commits

Should return array of commit SHAs from git history

Input: Repository with bug-fix commits

Expected: Array of SHA strings, length > 0

Extracts before/after pairs

Should extract code from commit and its parent

Input: Valid commit SHA with TypeScript file changes

Expected: Object with before, after, filePath, linesChanged

Classifies difficulty correctly

Should classify based on lines changed

Input: linesChanged: 5

Expected: easy

Sources

Tempered AI — Forged Through Practice, Not Hype

? Keyboard shortcuts