Evaluator-Optimizer Loop: Continuous AI Agent Improvement
Here's a dirty secret about AI agents: their first output is rarely good enough. But most agents just ship it anyway.
The Evaluator-Optimizer Loop fixes this. It's a pattern where one component evaluates the output and another component improves it based on that evaluation. Repeat until quality meets the bar.
This is how you build agents that consistently produce high-quality outputs, not just occasionally good ones.
What Is the Evaluator-Optimizer Loop?
The pattern separates evaluation from generation:
| 1 | ┌─────────────────────────────────────────────────────────────┐ |
| 2 | │ Evaluator-Optimizer Loop │ |
| 3 | ├─────────────────────────────────────────────────────────────┤ |
| 4 | │ │ |
| 5 | │ ┌───────────────┐ │ |
| 6 | │ │ Generator │──────────────┐ │ |
| 7 | │ │ │ │ │ |
| 8 | │ │ Creates │ ▼ │ |
| 9 | │ │ initial │ ┌───────────────┐ │ |
| 10 | │ │ output │ │ Evaluator │ │ |
| 11 | │ └───────────────┘ │ │ │ |
| 12 | │ ▲ │ Scores output │ │ |
| 13 | │ │ │ Finds issues │ │ |
| 14 | │ │ └───────┬───────┘ │ |
| 15 | │ │ │ │ |
| 16 | │ │ ▼ │ |
| 17 | │ │ ┌───────────────┐ │ |
| 18 | │ │ No │ Good enough? │ │ |
| 19 | │ │ ┌───────┤ │ │ |
| 20 | │ │ │ └───────┬───────┘ │ |
| 21 | │ │ │ │ Yes │ |
| 22 | │ ┌──────┴─────┐ │ ▼ │ |
| 23 | │ │ Optimizer │◄──┘ ┌───────────┐ │ |
| 24 | │ │ │ │ Output │ │ |
| 25 | │ │ Fixes │ └───────────┘ │ |
| 26 | │ │ issues │ │ |
| 27 | │ └────────────┘ │ |
| 28 | │ │ |
| 29 | └─────────────────────────────────────────────────────────────┘ |
| 30 | |
Three components work together:
- Generator: Creates the initial output
- Evaluator: Scores the output and identifies issues
- Optimizer: Improves the output based on feedback
The loop continues until the evaluator says "good enough" or max iterations are reached.
Why This Pattern Works
1. Separation of Concerns
Generation and evaluation are different cognitive tasks. Separating them lets each component focus:
| 1 | # Generator mindset: "Create something that works" |
| 2 | # Evaluator mindset: "Find everything wrong with this" |
| 3 | # Optimizer mindset: "Fix these specific issues" |
| 4 | |
An LLM trying to do all three at once often compromises on each.
2. Explicit Quality Gates
Instead of hoping output is good, you define what "good" means:
| 1 | quality_criteria = { |
| 2 | "accuracy": "All facts must be verifiable", |
| 3 | "completeness": "Must address all parts of the question", |
| 4 | "clarity": "A non-expert should understand", |
| 5 | "conciseness": "No unnecessary content" |
| 6 | } |
| 7 | |
The evaluator checks each criterion explicitly.
3. Guaranteed Improvement
Each iteration addresses specific issues. Progress is measurable:
| 1 | Iteration 1: Score 6/10 - Issues: missing examples, too technical |
| 2 | Iteration 2: Score 8/10 - Issues: one factual error |
| 3 | Iteration 3: Score 9/10 - Issues: none critical |
| 4 | → Output accepted |
| 5 | |
Basic Implementation
Here's a complete evaluator-optimizer loop:
| 1 | import openai |
| 2 | import json |
| 3 | from dataclasses import dataclass |
| 4 | |
| 5 | @dataclass |
| 6 | class Evaluation: |
| 7 | score: float # 0-10 |
| 8 | passed: bool |
| 9 | issues: list[str] |
| 10 | suggestions: list[str] |
| 11 | |
| 12 | class EvaluatorOptimizerAgent: |
| 13 | def __init__(self, min_score: float = 8.0, max_iterations: int = 5): |
| 14 | self.client = openai.OpenAI() |
| 15 | self.min_score = min_score |
| 16 | self.max_iterations = max_iterations |
| 17 | |
| 18 | def run(self, task: str) -> dict: |
| 19 | """Generate, evaluate, and optimize until quality threshold met""" |
| 20 | |
| 21 | # Initial generation |
| 22 | output = self._generate(task) |
| 23 | iterations = [] |
| 24 | |
| 25 | for i in range(self.max_iterations): |
| 26 | # Evaluate current output |
| 27 | evaluation = self._evaluate(task, output) |
| 28 | |
| 29 | iterations.append({ |
| 30 | "iteration": i + 1, |
| 31 | "output_preview": output[:200], |
| 32 | "score": evaluation.score, |
| 33 | "issues": evaluation.issues |
| 34 | }) |
| 35 | |
| 36 | # Check if good enough |
| 37 | if evaluation.passed: |
| 38 | return { |
| 39 | "success": True, |
| 40 | "output": output, |
| 41 | "final_score": evaluation.score, |
| 42 | "iterations": len(iterations), |
| 43 | "history": iterations |
| 44 | } |
| 45 | |
| 46 | # Optimize based on feedback |
| 47 | output = self._optimize(task, output, evaluation) |
| 48 | |
| 49 | # Max iterations reached |
| 50 | return { |
| 51 | "success": False, |
| 52 | "output": output, |
| 53 | "final_score": evaluation.score, |
| 54 | "iterations": len(iterations), |
| 55 | "history": iterations, |
| 56 | "note": "Max iterations reached" |
| 57 | } |
| 58 | |
| 59 | def _generate(self, task: str) -> str: |
| 60 | """Initial generation""" |
| 61 | response = self.client.chat.completions.create( |
| 62 | model="gpt-4o", |
| 63 | messages=[{ |
| 64 | "role": "system", |
| 65 | "content": "Generate a high-quality response to the task." |
| 66 | }, { |
| 67 | "role": "user", |
| 68 | "content": task |
| 69 | }] |
| 70 | ) |
| 71 | return response.choices[0].message.content |
| 72 | |
| 73 | def _evaluate(self, task: str, output: str) -> Evaluation: |
| 74 | """Evaluate the output quality""" |
| 75 | response = self.client.chat.completions.create( |
| 76 | model="gpt-4o", |
| 77 | messages=[{ |
| 78 | "role": "system", |
| 79 | "content": f"""Evaluate this output against the original task. |
| 80 | |
| 81 | Score from 0-10 based on: |
| 82 | - Accuracy (are facts correct?) |
| 83 | - Completeness (does it fully address the task?) |
| 84 | - Clarity (is it easy to understand?) |
| 85 | - Quality (is it well-written?) |
| 86 | |
| 87 | Return JSON: |
| 88 | {{ |
| 89 | "score": 7.5, |
| 90 | "issues": ["issue 1", "issue 2"], |
| 91 | "suggestions": ["suggestion 1", "suggestion 2"] |
| 92 | }} |
| 93 | |
| 94 | A score of {self.min_score}+ means it passes.""" |
| 95 | }, { |
| 96 | "role": "user", |
| 97 | "content": f"Task: {task}\n\nOutput to evaluate:\n{output}" |
| 98 | }], |
| 99 | response_format={"type": "json_object"} |
| 100 | ) |
| 101 | |
| 102 | data = json.loads(response.choices[0].message.content) |
| 103 | |
| 104 | return Evaluation( |
| 105 | score=data["score"], |
| 106 | passed=data["score"] >= self.min_score, |
| 107 | issues=data.get("issues", []), |
| 108 | suggestions=data.get("suggestions", []) |
| 109 | ) |
| 110 | |
| 111 | def _optimize(self, task: str, output: str, evaluation: Evaluation) -> str: |
| 112 | """Improve output based on evaluation""" |
| 113 | response = self.client.chat.completions.create( |
| 114 | model="gpt-4o", |
| 115 | messages=[{ |
| 116 | "role": "system", |
| 117 | "content": """Improve the output by addressing the issues identified. |
| 118 | Keep what's already good. Only fix what's broken.""" |
| 119 | }, { |
| 120 | "role": "user", |
| 121 | "content": f"""Original task: {task} |
| 122 | |
| 123 | Current output: |
| 124 | {output} |
| 125 | |
| 126 | Issues to fix: |
| 127 | {json.dumps(evaluation.issues, indent=2)} |
| 128 | |
| 129 | Suggestions: |
| 130 | {json.dumps(evaluation.suggestions, indent=2)} |
| 131 | |
| 132 | Provide the improved output:""" |
| 133 | }] |
| 134 | ) |
| 135 | return response.choices[0].message.content |
| 136 | |
| 137 | |
| 138 | # Usage |
| 139 | agent = EvaluatorOptimizerAgent(min_score=8.0, max_iterations=3) |
| 140 | |
| 141 | result = agent.run( |
| 142 | "Write a technical explanation of how HTTPS works for a junior developer" |
| 143 | ) |
| 144 | |
| 145 | print(f"Success: {result['success']}") |
| 146 | print(f"Final score: {result['final_score']}") |
| 147 | print(f"Iterations: {result['iterations']}") |
| 148 | print(f"\nOutput:\n{result['output']}") |
| 149 | |
Specialized Evaluators
Code Quality Evaluator
| 1 | from hopx import Sandbox |
| 2 | |
| 3 | class CodeEvaluator: |
| 4 | def __init__(self): |
| 5 | self.client = openai.OpenAI() |
| 6 | |
| 7 | def evaluate(self, code: str, requirements: str) -> Evaluation: |
| 8 | """Evaluate code quality with actual execution""" |
| 9 | |
| 10 | # Test 1: Does it run? |
| 11 | execution_result = self._execute_code(code) |
| 12 | |
| 13 | # Test 2: Does it pass tests? |
| 14 | test_result = self._run_tests(code, requirements) |
| 15 | |
| 16 | # Test 3: Code quality analysis |
| 17 | quality_result = self._analyze_quality(code) |
| 18 | |
| 19 | # Combine scores |
| 20 | score = self._calculate_score(execution_result, test_result, quality_result) |
| 21 | |
| 22 | issues = [] |
| 23 | if not execution_result["success"]: |
| 24 | issues.append(f"Execution error: {execution_result['error']}") |
| 25 | if not test_result["passed"]: |
| 26 | issues.extend(test_result["failures"]) |
| 27 | issues.extend(quality_result["issues"]) |
| 28 | |
| 29 | return Evaluation( |
| 30 | score=score, |
| 31 | passed=score >= 8.0 and execution_result["success"], |
| 32 | issues=issues, |
| 33 | suggestions=quality_result.get("suggestions", []) |
| 34 | ) |
| 35 | |
| 36 | def _execute_code(self, code: str) -> dict: |
| 37 | """Actually run the code""" |
| 38 | sandbox = Sandbox.create(template="code-interpreter") |
| 39 | |
| 40 | try: |
| 41 | sandbox.files.write("/app/code.py", code) |
| 42 | result = sandbox.commands.run("python /app/code.py", timeout=30) |
| 43 | |
| 44 | return { |
| 45 | "success": result.exit_code == 0, |
| 46 | "output": result.stdout, |
| 47 | "error": result.stderr if result.exit_code != 0 else None |
| 48 | } |
| 49 | finally: |
| 50 | sandbox.kill() |
| 51 | |
| 52 | def _run_tests(self, code: str, requirements: str) -> dict: |
| 53 | """Generate and run tests""" |
| 54 | # Generate tests based on requirements |
| 55 | test_code = self._generate_tests(code, requirements) |
| 56 | |
| 57 | sandbox = Sandbox.create(template="code-interpreter") |
| 58 | |
| 59 | try: |
| 60 | sandbox.files.write("/app/solution.py", code) |
| 61 | sandbox.files.write("/app/test_solution.py", test_code) |
| 62 | sandbox.commands.run("pip install pytest -q") |
| 63 | |
| 64 | result = sandbox.commands.run("python -m pytest /app/test_solution.py -v") |
| 65 | |
| 66 | passed = result.exit_code == 0 |
| 67 | failures = self._parse_test_failures(result.stdout) if not passed else [] |
| 68 | |
| 69 | return {"passed": passed, "failures": failures} |
| 70 | finally: |
| 71 | sandbox.kill() |
| 72 | |
| 73 | def _analyze_quality(self, code: str) -> dict: |
| 74 | """LLM-based code quality analysis""" |
| 75 | response = self.client.chat.completions.create( |
| 76 | model="gpt-4o", |
| 77 | messages=[{ |
| 78 | "role": "system", |
| 79 | "content": """Analyze code quality. Check for: |
| 80 | - Bugs and logic errors |
| 81 | - Security issues |
| 82 | - Performance problems |
| 83 | - Readability issues |
| 84 | - Missing error handling |
| 85 | |
| 86 | Return JSON: {"score": 0-10, "issues": [...], "suggestions": [...]}""" |
| 87 | }, { |
| 88 | "role": "user", |
| 89 | "content": code |
| 90 | }], |
| 91 | response_format={"type": "json_object"} |
| 92 | ) |
| 93 | |
| 94 | return json.loads(response.choices[0].message.content) |
| 95 | |
Writing Quality Evaluator
| 1 | class WritingEvaluator: |
| 2 | def __init__(self): |
| 3 | self.client = openai.OpenAI() |
| 4 | self.criteria = { |
| 5 | "accuracy": {"weight": 0.25, "description": "Facts are correct and verifiable"}, |
| 6 | "clarity": {"weight": 0.25, "description": "Easy to understand"}, |
| 7 | "structure": {"weight": 0.20, "description": "Well-organized with clear flow"}, |
| 8 | "engagement": {"weight": 0.15, "description": "Interesting and holds attention"}, |
| 9 | "grammar": {"weight": 0.15, "description": "No spelling or grammar errors"} |
| 10 | } |
| 11 | |
| 12 | def evaluate(self, text: str, context: str) -> Evaluation: |
| 13 | """Multi-dimensional writing evaluation""" |
| 14 | |
| 15 | scores = {} |
| 16 | all_issues = [] |
| 17 | all_suggestions = [] |
| 18 | |
| 19 | # Evaluate each criterion |
| 20 | for criterion, config in self.criteria.items(): |
| 21 | result = self._evaluate_criterion(text, context, criterion, config["description"]) |
| 22 | scores[criterion] = result["score"] |
| 23 | all_issues.extend(result.get("issues", [])) |
| 24 | all_suggestions.extend(result.get("suggestions", [])) |
| 25 | |
| 26 | # Calculate weighted score |
| 27 | total_score = sum( |
| 28 | scores[c] * self.criteria[c]["weight"] |
| 29 | for c in self.criteria |
| 30 | ) |
| 31 | |
| 32 | return Evaluation( |
| 33 | score=total_score, |
| 34 | passed=total_score >= 8.0 and all(s >= 6.0 for s in scores.values()), |
| 35 | issues=all_issues, |
| 36 | suggestions=all_suggestions |
| 37 | ) |
| 38 | |
| 39 | def _evaluate_criterion(self, text: str, context: str, criterion: str, description: str) -> dict: |
| 40 | response = self.client.chat.completions.create( |
| 41 | model="gpt-4o", |
| 42 | messages=[{ |
| 43 | "role": "user", |
| 44 | "content": f"""Evaluate this text for {criterion}: {description} |
| 45 | |
| 46 | Context: {context} |
| 47 | |
| 48 | Text: |
| 49 | {text} |
| 50 | |
| 51 | Return JSON: {{"score": 0-10, "issues": [...], "suggestions": [...]}}""" |
| 52 | }], |
| 53 | response_format={"type": "json_object"} |
| 54 | ) |
| 55 | |
| 56 | return json.loads(response.choices[0].message.content) |
| 57 | |
Advanced Patterns
Multi-Evaluator Ensemble
Use multiple evaluators and combine their judgments:
| 1 | class EnsembleEvaluator: |
| 2 | def __init__(self, evaluators: list): |
| 3 | self.evaluators = evaluators |
| 4 | |
| 5 | def evaluate(self, output: str, context: str) -> Evaluation: |
| 6 | """Combine multiple evaluator opinions""" |
| 7 | |
| 8 | all_evaluations = [] |
| 9 | |
| 10 | for evaluator in self.evaluators: |
| 11 | eval_result = evaluator.evaluate(output, context) |
| 12 | all_evaluations.append(eval_result) |
| 13 | |
| 14 | # Aggregate scores (weighted average or voting) |
| 15 | avg_score = sum(e.score for e in all_evaluations) / len(all_evaluations) |
| 16 | |
| 17 | # Collect all unique issues |
| 18 | all_issues = list(set( |
| 19 | issue for e in all_evaluations for issue in e.issues |
| 20 | )) |
| 21 | |
| 22 | # Consensus on pass/fail |
| 23 | passes = sum(1 for e in all_evaluations if e.passed) |
| 24 | majority_pass = passes > len(all_evaluations) / 2 |
| 25 | |
| 26 | return Evaluation( |
| 27 | score=avg_score, |
| 28 | passed=majority_pass, |
| 29 | issues=all_issues, |
| 30 | suggestions=[s for e in all_evaluations for s in e.suggestions] |
| 31 | ) |
| 32 | |
| 33 | |
| 34 | # Usage |
| 35 | ensemble = EnsembleEvaluator([ |
| 36 | AccuracyEvaluator(), |
| 37 | ClarityEvaluator(), |
| 38 | StyleEvaluator() |
| 39 | ]) |
| 40 | |
Progressive Quality Gates
Different quality bars for different stages:
| 1 | class ProgressiveOptimizer: |
| 2 | def __init__(self): |
| 3 | self.quality_gates = [ |
| 4 | {"name": "basic", "min_score": 5.0, "focus": ["correctness"]}, |
| 5 | {"name": "good", "min_score": 7.0, "focus": ["correctness", "clarity"]}, |
| 6 | {"name": "excellent", "min_score": 9.0, "focus": ["correctness", "clarity", "polish"]} |
| 7 | ] |
| 8 | |
| 9 | def run(self, task: str, target_quality: str = "good") -> str: |
| 10 | """Progressively improve through quality gates""" |
| 11 | |
| 12 | output = self._generate(task) |
| 13 | |
| 14 | target_gate = next(g for g in self.quality_gates if g["name"] == target_quality) |
| 15 | target_index = self.quality_gates.index(target_gate) |
| 16 | |
| 17 | # Progress through each gate up to target |
| 18 | for gate in self.quality_gates[:target_index + 1]: |
| 19 | output = self._optimize_for_gate(task, output, gate) |
| 20 | |
| 21 | return output |
| 22 | |
| 23 | def _optimize_for_gate(self, task: str, output: str, gate: dict) -> str: |
| 24 | """Optimize until this gate's criteria are met""" |
| 25 | |
| 26 | for _ in range(3): # Max attempts per gate |
| 27 | evaluation = self._evaluate_for_gate(output, gate) |
| 28 | |
| 29 | if evaluation.score >= gate["min_score"]: |
| 30 | print(f"✓ Passed {gate['name']} gate ({evaluation.score:.1f})") |
| 31 | return output |
| 32 | |
| 33 | output = self._optimize(task, output, evaluation, gate["focus"]) |
| 34 | |
| 35 | print(f"⚠ Could not pass {gate['name']} gate") |
| 36 | return output |
| 37 | |
Optimization with Memory
Remember what works and what doesn't:
| 1 | class LearningOptimizer: |
| 2 | def __init__(self): |
| 3 | self.client = openai.OpenAI() |
| 4 | self.improvement_history = [] # What worked before |
| 5 | self.failure_patterns = [] # What didn't work |
| 6 | |
| 7 | def optimize(self, task: str, output: str, evaluation: Evaluation) -> str: |
| 8 | # Learn from history |
| 9 | relevant_successes = self._find_relevant_successes(evaluation.issues) |
| 10 | patterns_to_avoid = self._find_failure_patterns(evaluation.issues) |
| 11 | |
| 12 | response = self.client.chat.completions.create( |
| 13 | model="gpt-4o", |
| 14 | messages=[{ |
| 15 | "role": "system", |
| 16 | "content": f"""Improve this output. |
| 17 | |
| 18 | Issues to fix: |
| 19 | {json.dumps(evaluation.issues)} |
| 20 | |
| 21 | Strategies that worked before for similar issues: |
| 22 | {json.dumps(relevant_successes)} |
| 23 | |
| 24 | Approaches to AVOID (they didn't work): |
| 25 | {json.dumps(patterns_to_avoid)}""" |
| 26 | }, { |
| 27 | "role": "user", |
| 28 | "content": f"Task: {task}\n\nCurrent output:\n{output}" |
| 29 | }] |
| 30 | ) |
| 31 | |
| 32 | improved = response.choices[0].message.content |
| 33 | |
| 34 | # Track this attempt |
| 35 | self._record_attempt(evaluation.issues, improved) |
| 36 | |
| 37 | return improved |
| 38 | |
| 39 | def record_success(self, issues: list, solution: str): |
| 40 | """Record a successful optimization for future reference""" |
| 41 | self.improvement_history.append({ |
| 42 | "issues": issues, |
| 43 | "solution_approach": self._extract_approach(solution) |
| 44 | }) |
| 45 | |
| 46 | def record_failure(self, issues: list, failed_approach: str): |
| 47 | """Record what didn't work""" |
| 48 | self.failure_patterns.append({ |
| 49 | "issues": issues, |
| 50 | "failed_approach": failed_approach |
| 51 | }) |
| 52 | |
Real-World Example: Article Generator
A complete article generator with evaluation and optimization:
| 1 | from hopx import Sandbox |
| 2 | import openai |
| 3 | import json |
| 4 | |
| 5 | class ArticleGenerator: |
| 6 | def __init__(self): |
| 7 | self.client = openai.OpenAI() |
| 8 | self.min_score = 8.5 |
| 9 | self.max_iterations = 4 |
| 10 | |
| 11 | def generate(self, topic: str, requirements: dict) -> dict: |
| 12 | """Generate a high-quality article through iterative improvement""" |
| 13 | |
| 14 | # Phase 1: Initial draft |
| 15 | draft = self._create_draft(topic, requirements) |
| 16 | |
| 17 | # Phase 2: Iterative improvement |
| 18 | for iteration in range(self.max_iterations): |
| 19 | print(f"\n--- Iteration {iteration + 1} ---") |
| 20 | |
| 21 | # Evaluate |
| 22 | evaluation = self._evaluate_article(draft, topic, requirements) |
| 23 | print(f"Score: {evaluation.score}/10") |
| 24 | print(f"Issues: {evaluation.issues}") |
| 25 | |
| 26 | if evaluation.passed: |
| 27 | print("✓ Article meets quality bar") |
| 28 | break |
| 29 | |
| 30 | # Optimize |
| 31 | draft = self._improve_article(draft, evaluation, requirements) |
| 32 | |
| 33 | # Phase 3: Final polish |
| 34 | final = self._polish(draft) |
| 35 | |
| 36 | # Verify code examples if present |
| 37 | if "```python" in final: |
| 38 | final = self._verify_code_examples(final) |
| 39 | |
| 40 | return { |
| 41 | "article": final, |
| 42 | "iterations": iteration + 1, |
| 43 | "final_score": evaluation.score |
| 44 | } |
| 45 | |
| 46 | def _create_draft(self, topic: str, requirements: dict) -> str: |
| 47 | response = self.client.chat.completions.create( |
| 48 | model="gpt-4o", |
| 49 | messages=[{ |
| 50 | "role": "system", |
| 51 | "content": f"""Write a technical article. |
| 52 | |
| 53 | Requirements: |
| 54 | - Length: {requirements.get('length', '1500-2000')} words |
| 55 | - Audience: {requirements.get('audience', 'developers')} |
| 56 | - Style: {requirements.get('style', 'informative but engaging')} |
| 57 | - Include: code examples, practical tips |
| 58 | |
| 59 | Structure: |
| 60 | 1. Hook/Introduction |
| 61 | 2. Main content (3-5 sections) |
| 62 | 3. Practical examples |
| 63 | 4. Conclusion with actionable takeaways""" |
| 64 | }, { |
| 65 | "role": "user", |
| 66 | "content": f"Topic: {topic}" |
| 67 | }] |
| 68 | ) |
| 69 | return response.choices[0].message.content |
| 70 | |
| 71 | def _evaluate_article(self, article: str, topic: str, requirements: dict) -> Evaluation: |
| 72 | response = self.client.chat.completions.create( |
| 73 | model="gpt-4o", |
| 74 | messages=[{ |
| 75 | "role": "system", |
| 76 | "content": f"""Evaluate this article rigorously. |
| 77 | |
| 78 | Criteria (score each 0-10): |
| 79 | 1. Technical accuracy - Are all facts and code correct? |
| 80 | 2. Completeness - Does it cover the topic adequately? |
| 81 | 3. Clarity - Is it easy to follow? |
| 82 | 4. Engagement - Is it interesting to read? |
| 83 | 5. Actionability - Can readers apply what they learned? |
| 84 | 6. SEO - Are headings and structure optimized? |
| 85 | |
| 86 | Requirements to check: |
| 87 | {json.dumps(requirements)} |
| 88 | |
| 89 | Return JSON: |
| 90 | {{ |
| 91 | "scores": {{"accuracy": 8, "completeness": 7, ...}}, |
| 92 | "overall_score": 7.5, |
| 93 | "issues": ["specific issue 1", "specific issue 2"], |
| 94 | "suggestions": ["specific suggestion 1"] |
| 95 | }}""" |
| 96 | }, { |
| 97 | "role": "user", |
| 98 | "content": f"Topic: {topic}\n\nArticle:\n{article}" |
| 99 | }], |
| 100 | response_format={"type": "json_object"} |
| 101 | ) |
| 102 | |
| 103 | data = json.loads(response.choices[0].message.content) |
| 104 | |
| 105 | return Evaluation( |
| 106 | score=data["overall_score"], |
| 107 | passed=data["overall_score"] >= self.min_score, |
| 108 | issues=data["issues"], |
| 109 | suggestions=data["suggestions"] |
| 110 | ) |
| 111 | |
| 112 | def _improve_article(self, article: str, evaluation: Evaluation, requirements: dict) -> str: |
| 113 | response = self.client.chat.completions.create( |
| 114 | model="gpt-4o", |
| 115 | messages=[{ |
| 116 | "role": "system", |
| 117 | "content": """Improve the article by fixing the identified issues. |
| 118 | Maintain the overall structure and good parts. |
| 119 | Focus specifically on the issues listed.""" |
| 120 | }, { |
| 121 | "role": "user", |
| 122 | "content": f"""Current article: |
| 123 | {article} |
| 124 | |
| 125 | Issues to fix: |
| 126 | {json.dumps(evaluation.issues, indent=2)} |
| 127 | |
| 128 | Suggestions to consider: |
| 129 | {json.dumps(evaluation.suggestions, indent=2)} |
| 130 | |
| 131 | Provide the improved article:""" |
| 132 | }] |
| 133 | ) |
| 134 | return response.choices[0].message.content |
| 135 | |
| 136 | def _polish(self, article: str) -> str: |
| 137 | """Final polish pass""" |
| 138 | response = self.client.chat.completions.create( |
| 139 | model="gpt-4o", |
| 140 | messages=[{ |
| 141 | "role": "user", |
| 142 | "content": f"""Polish this article: |
| 143 | - Fix any remaining typos or grammar issues |
| 144 | - Ensure smooth transitions between sections |
| 145 | - Verify formatting is consistent |
| 146 | |
| 147 | Article: |
| 148 | {article}""" |
| 149 | }] |
| 150 | ) |
| 151 | return response.choices[0].message.content |
| 152 | |
| 153 | def _verify_code_examples(self, article: str) -> str: |
| 154 | """Extract and test all code examples""" |
| 155 | import re |
| 156 | |
| 157 | code_blocks = re.findall(r'```python\n(.*?)```', article, re.DOTALL) |
| 158 | |
| 159 | sandbox = Sandbox.create(template="code-interpreter") |
| 160 | |
| 161 | try: |
| 162 | for i, code in enumerate(code_blocks): |
| 163 | sandbox.files.write(f"/app/example_{i}.py", code) |
| 164 | result = sandbox.commands.run(f"python /app/example_{i}.py") |
| 165 | |
| 166 | if result.exit_code != 0: |
| 167 | # Fix the code |
| 168 | fixed_code = self._fix_code(code, result.stderr) |
| 169 | article = article.replace(f"```python\n{code}```", f"```python\n{fixed_code}```") |
| 170 | |
| 171 | return article |
| 172 | finally: |
| 173 | sandbox.kill() |
| 174 | |
| 175 | |
| 176 | # Usage |
| 177 | generator = ArticleGenerator() |
| 178 | |
| 179 | result = generator.generate( |
| 180 | topic="Building RESTful APIs with FastAPI", |
| 181 | requirements={ |
| 182 | "length": "2000-2500 words", |
| 183 | "audience": "intermediate Python developers", |
| 184 | "style": "practical tutorial", |
| 185 | "must_include": ["authentication", "database integration", "testing"] |
| 186 | } |
| 187 | ) |
| 188 | |
| 189 | print(f"Generated in {result['iterations']} iterations") |
| 190 | print(f"Final score: {result['final_score']}") |
| 191 | print(result["article"]) |
| 192 | |
Best Practices
1. Define Clear Evaluation Criteria
| 1 | # ❌ Vague criteria |
| 2 | criteria = ["make it good", "improve quality"] |
| 3 | |
| 4 | # ✅ Specific, measurable criteria |
| 5 | criteria = { |
| 6 | "accuracy": { |
| 7 | "description": "All facts verifiable, no hallucinations", |
| 8 | "min_score": 9, |
| 9 | "examples": ["dates correct", "quotes accurate", "statistics cited"] |
| 10 | }, |
| 11 | "completeness": { |
| 12 | "description": "Addresses all aspects of the prompt", |
| 13 | "min_score": 8, |
| 14 | "examples": ["all questions answered", "no missing sections"] |
| 15 | } |
| 16 | } |
| 17 | |
2. Limit Iterations
| 1 | class BoundedOptimizer: |
| 2 | def __init__(self, max_iterations: int = 5, timeout_seconds: int = 60): |
| 3 | self.max_iterations = max_iterations |
| 4 | self.timeout = timeout_seconds |
| 5 | |
| 6 | def run(self, task: str) -> str: |
| 7 | start_time = time.time() |
| 8 | |
| 9 | for i in range(self.max_iterations): |
| 10 | # Check timeout |
| 11 | if time.time() - start_time > self.timeout: |
| 12 | print("Timeout reached") |
| 13 | break |
| 14 | |
| 15 | # Check diminishing returns |
| 16 | if i > 2 and score_improvement < 0.5: |
| 17 | print("Diminishing returns, stopping") |
| 18 | break |
| 19 | |
3. Track Optimization History
| 1 | def run_with_tracking(self, task: str) -> dict: |
| 2 | history = [] |
| 3 | |
| 4 | for i in range(self.max_iterations): |
| 5 | evaluation = self._evaluate(output) |
| 6 | |
| 7 | history.append({ |
| 8 | "iteration": i, |
| 9 | "score": evaluation.score, |
| 10 | "issues_count": len(evaluation.issues), |
| 11 | "output_length": len(output), |
| 12 | "timestamp": datetime.now().isoformat() |
| 13 | }) |
| 14 | |
| 15 | # Detect if stuck |
| 16 | if i > 1 and history[-1]["score"] == history[-2]["score"]: |
| 17 | # Try different optimization strategy |
| 18 | output = self._alternative_optimize(output, evaluation) |
| 19 | |
| 20 | return {"output": output, "history": history} |
| 21 | |
4. Fail Gracefully
| 1 | def run_with_fallback(self, task: str) -> dict: |
| 2 | try: |
| 3 | result = self._optimize_loop(task) |
| 4 | |
| 5 | if not result["success"]: |
| 6 | # Return best attempt even if didn't meet threshold |
| 7 | return { |
| 8 | "output": result["output"], |
| 9 | "warning": "Did not meet quality threshold", |
| 10 | "score": result["final_score"] |
| 11 | } |
| 12 | |
| 13 | return result |
| 14 | |
| 15 | except Exception as e: |
| 16 | # Return initial generation on failure |
| 17 | return { |
| 18 | "output": self._generate(task), |
| 19 | "error": str(e), |
| 20 | "fallback": True |
| 21 | } |
| 22 | |
When to Use This Pattern
✅ Use Evaluator-Optimizer when:
- Output quality is critical
- You can define clear quality criteria
- You have token budget for multiple iterations
- Task is complex enough to benefit from iteration
❌ Avoid when:
- Speed is the priority
- Quality criteria are subjective/unclear
- Output is simple and usually correct
- Token costs are a major concern
Conclusion
The Evaluator-Optimizer Loop transforms inconsistent outputs into consistently high-quality ones:
- Explicit evaluation — Define what "good" means
- Iterative improvement — Fix issues systematically
- Quality guarantees — Meet defined thresholds
Start with simple evaluation criteria. Add specialized evaluators for specific domains. Track optimization history to learn what works.
The agent that evaluates and improves beats the agent that hopes for the best. Every time.
Ready to build self-improving agents? Get started with HopX — sandboxes that let you test and verify outputs in isolation.
Further Reading
- The Reflection Pattern — Self-review without separate optimizer
- ReAct Pattern — Reasoning before each action
- Tool Use Pattern — Tools for evaluation
- Memory Pattern — Remember what improvements work