Back to Blog

Evaluator-Optimizer Loop: Continuous AI Agent Improvement

AI AgentsAlin Dobra14 min read

Evaluator-Optimizer Loop: Continuous AI Agent Improvement

Here's a dirty secret about AI agents: their first output is rarely good enough. But most agents just ship it anyway.

The Evaluator-Optimizer Loop fixes this. It's a pattern where one component evaluates the output and another component improves it based on that evaluation. Repeat until quality meets the bar.

This is how you build agents that consistently produce high-quality outputs, not just occasionally good ones.

What Is the Evaluator-Optimizer Loop?

The pattern separates evaluation from generation:

text
1
2
                  Evaluator-Optimizer Loop                   
3
4
                                                             
5
                                            
6
      Generator                             
7
                                                          
8
    Creates                                               
9
    initial                               
10
    output                   Evaluator                   
11
                                          
12
                            Scores output                 
13
                            Finds issues                  
14
                                           
15
                                                           
16
                                                           
17
                                           
18
                    No       Good enough?                 
19
                                                  
20
                                          
21
                                   Yes                    
22
                                             
23
     Optimizer                            
24
                              Output                     
25
    Fixes                                     
26
    issues                                                 
27
                                               
28
                                                             
29
30
 

Three components work together:

  1. Generator: Creates the initial output
  2. Evaluator: Scores the output and identifies issues
  3. Optimizer: Improves the output based on feedback

The loop continues until the evaluator says "good enough" or max iterations are reached.

Why This Pattern Works

1. Separation of Concerns

Generation and evaluation are different cognitive tasks. Separating them lets each component focus:

python
1
# Generator mindset: "Create something that works"
2
# Evaluator mindset: "Find everything wrong with this"
3
# Optimizer mindset: "Fix these specific issues"
4
 

An LLM trying to do all three at once often compromises on each.

2. Explicit Quality Gates

Instead of hoping output is good, you define what "good" means:

python
1
quality_criteria = {
2
    "accuracy": "All facts must be verifiable",
3
    "completeness": "Must address all parts of the question",
4
    "clarity": "A non-expert should understand",
5
    "conciseness": "No unnecessary content"
6
}
7
 

The evaluator checks each criterion explicitly.

3. Guaranteed Improvement

Each iteration addresses specific issues. Progress is measurable:

text
1
Iteration 1: Score 6/10 - Issues: missing examples, too technical
2
Iteration 2: Score 8/10 - Issues: one factual error
3
Iteration 3: Score 9/10 - Issues: none critical
4
 Output accepted
5
 

Basic Implementation

Here's a complete evaluator-optimizer loop:

python
1
import openai
2
import json
3
from dataclasses import dataclass
4
 
5
@dataclass
6
class Evaluation:
7
    score: float  # 0-10
8
    passed: bool
9
    issues: list[str]
10
    suggestions: list[str]
11
 
12
class EvaluatorOptimizerAgent:
13
    def __init__(self, min_score: float = 8.0, max_iterations: int = 5):
14
        self.client = openai.OpenAI()
15
        self.min_score = min_score
16
        self.max_iterations = max_iterations
17
    
18
    def run(self, task: str) -> dict:
19
        """Generate, evaluate, and optimize until quality threshold met"""
20
        
21
        # Initial generation
22
        output = self._generate(task)
23
        iterations = []
24
        
25
        for i in range(self.max_iterations):
26
            # Evaluate current output
27
            evaluation = self._evaluate(task, output)
28
            
29
            iterations.append({
30
                "iteration": i + 1,
31
                "output_preview": output[:200],
32
                "score": evaluation.score,
33
                "issues": evaluation.issues
34
            })
35
            
36
            # Check if good enough
37
            if evaluation.passed:
38
                return {
39
                    "success": True,
40
                    "output": output,
41
                    "final_score": evaluation.score,
42
                    "iterations": len(iterations),
43
                    "history": iterations
44
                }
45
            
46
            # Optimize based on feedback
47
            output = self._optimize(task, output, evaluation)
48
        
49
        # Max iterations reached
50
        return {
51
            "success": False,
52
            "output": output,
53
            "final_score": evaluation.score,
54
            "iterations": len(iterations),
55
            "history": iterations,
56
            "note": "Max iterations reached"
57
        }
58
    
59
    def _generate(self, task: str) -> str:
60
        """Initial generation"""
61
        response = self.client.chat.completions.create(
62
            model="gpt-4o",
63
            messages=[{
64
                "role": "system",
65
                "content": "Generate a high-quality response to the task."
66
            }, {
67
                "role": "user",
68
                "content": task
69
            }]
70
        )
71
        return response.choices[0].message.content
72
    
73
    def _evaluate(self, task: str, output: str) -> Evaluation:
74
        """Evaluate the output quality"""
75
        response = self.client.chat.completions.create(
76
            model="gpt-4o",
77
            messages=[{
78
                "role": "system",
79
                "content": f"""Evaluate this output against the original task.
80
 
81
Score from 0-10 based on:
82
- Accuracy (are facts correct?)
83
- Completeness (does it fully address the task?)
84
- Clarity (is it easy to understand?)
85
- Quality (is it well-written?)
86
 
87
Return JSON:
88
{{
89
    "score": 7.5,
90
    "issues": ["issue 1", "issue 2"],
91
    "suggestions": ["suggestion 1", "suggestion 2"]
92
}}
93
 
94
A score of {self.min_score}+ means it passes."""
95
            }, {
96
                "role": "user",
97
                "content": f"Task: {task}\n\nOutput to evaluate:\n{output}"
98
            }],
99
            response_format={"type": "json_object"}
100
        )
101
        
102
        data = json.loads(response.choices[0].message.content)
103
        
104
        return Evaluation(
105
            score=data["score"],
106
            passed=data["score"] >= self.min_score,
107
            issues=data.get("issues", []),
108
            suggestions=data.get("suggestions", [])
109
        )
110
    
111
    def _optimize(self, task: str, output: str, evaluation: Evaluation) -> str:
112
        """Improve output based on evaluation"""
113
        response = self.client.chat.completions.create(
114
            model="gpt-4o",
115
            messages=[{
116
                "role": "system",
117
                "content": """Improve the output by addressing the issues identified.
118
Keep what's already good. Only fix what's broken."""
119
            }, {
120
                "role": "user",
121
                "content": f"""Original task: {task}
122
 
123
Current output:
124
{output}
125
 
126
Issues to fix:
127
{json.dumps(evaluation.issues, indent=2)}
128
 
129
Suggestions:
130
{json.dumps(evaluation.suggestions, indent=2)}
131
 
132
Provide the improved output:"""
133
            }]
134
        )
135
        return response.choices[0].message.content
136
 
137
 
138
# Usage
139
agent = EvaluatorOptimizerAgent(min_score=8.0, max_iterations=3)
140
 
141
result = agent.run(
142
    "Write a technical explanation of how HTTPS works for a junior developer"
143
)
144
 
145
print(f"Success: {result['success']}")
146
print(f"Final score: {result['final_score']}")
147
print(f"Iterations: {result['iterations']}")
148
print(f"\nOutput:\n{result['output']}")
149
 

Specialized Evaluators

Code Quality Evaluator

python
1
from hopx import Sandbox
2
 
3
class CodeEvaluator:
4
    def __init__(self):
5
        self.client = openai.OpenAI()
6
    
7
    def evaluate(self, code: str, requirements: str) -> Evaluation:
8
        """Evaluate code quality with actual execution"""
9
        
10
        # Test 1: Does it run?
11
        execution_result = self._execute_code(code)
12
        
13
        # Test 2: Does it pass tests?
14
        test_result = self._run_tests(code, requirements)
15
        
16
        # Test 3: Code quality analysis
17
        quality_result = self._analyze_quality(code)
18
        
19
        # Combine scores
20
        score = self._calculate_score(execution_result, test_result, quality_result)
21
        
22
        issues = []
23
        if not execution_result["success"]:
24
            issues.append(f"Execution error: {execution_result['error']}")
25
        if not test_result["passed"]:
26
            issues.extend(test_result["failures"])
27
        issues.extend(quality_result["issues"])
28
        
29
        return Evaluation(
30
            score=score,
31
            passed=score >= 8.0 and execution_result["success"],
32
            issues=issues,
33
            suggestions=quality_result.get("suggestions", [])
34
        )
35
    
36
    def _execute_code(self, code: str) -> dict:
37
        """Actually run the code"""
38
        sandbox = Sandbox.create(template="code-interpreter")
39
        
40
        try:
41
            sandbox.files.write("/app/code.py", code)
42
            result = sandbox.commands.run("python /app/code.py", timeout=30)
43
            
44
            return {
45
                "success": result.exit_code == 0,
46
                "output": result.stdout,
47
                "error": result.stderr if result.exit_code != 0 else None
48
            }
49
        finally:
50
            sandbox.kill()
51
    
52
    def _run_tests(self, code: str, requirements: str) -> dict:
53
        """Generate and run tests"""
54
        # Generate tests based on requirements
55
        test_code = self._generate_tests(code, requirements)
56
        
57
        sandbox = Sandbox.create(template="code-interpreter")
58
        
59
        try:
60
            sandbox.files.write("/app/solution.py", code)
61
            sandbox.files.write("/app/test_solution.py", test_code)
62
            sandbox.commands.run("pip install pytest -q")
63
            
64
            result = sandbox.commands.run("python -m pytest /app/test_solution.py -v")
65
            
66
            passed = result.exit_code == 0
67
            failures = self._parse_test_failures(result.stdout) if not passed else []
68
            
69
            return {"passed": passed, "failures": failures}
70
        finally:
71
            sandbox.kill()
72
    
73
    def _analyze_quality(self, code: str) -> dict:
74
        """LLM-based code quality analysis"""
75
        response = self.client.chat.completions.create(
76
            model="gpt-4o",
77
            messages=[{
78
                "role": "system",
79
                "content": """Analyze code quality. Check for:
80
- Bugs and logic errors
81
- Security issues
82
- Performance problems
83
- Readability issues
84
- Missing error handling
85
 
86
Return JSON: {"score": 0-10, "issues": [...], "suggestions": [...]}"""
87
            }, {
88
                "role": "user",
89
                "content": code
90
            }],
91
            response_format={"type": "json_object"}
92
        )
93
        
94
        return json.loads(response.choices[0].message.content)
95
 

Writing Quality Evaluator

python
1
class WritingEvaluator:
2
    def __init__(self):
3
        self.client = openai.OpenAI()
4
        self.criteria = {
5
            "accuracy": {"weight": 0.25, "description": "Facts are correct and verifiable"},
6
            "clarity": {"weight": 0.25, "description": "Easy to understand"},
7
            "structure": {"weight": 0.20, "description": "Well-organized with clear flow"},
8
            "engagement": {"weight": 0.15, "description": "Interesting and holds attention"},
9
            "grammar": {"weight": 0.15, "description": "No spelling or grammar errors"}
10
        }
11
    
12
    def evaluate(self, text: str, context: str) -> Evaluation:
13
        """Multi-dimensional writing evaluation"""
14
        
15
        scores = {}
16
        all_issues = []
17
        all_suggestions = []
18
        
19
        # Evaluate each criterion
20
        for criterion, config in self.criteria.items():
21
            result = self._evaluate_criterion(text, context, criterion, config["description"])
22
            scores[criterion] = result["score"]
23
            all_issues.extend(result.get("issues", []))
24
            all_suggestions.extend(result.get("suggestions", []))
25
        
26
        # Calculate weighted score
27
        total_score = sum(
28
            scores[c] * self.criteria[c]["weight"]
29
            for c in self.criteria
30
        )
31
        
32
        return Evaluation(
33
            score=total_score,
34
            passed=total_score >= 8.0 and all(s >= 6.0 for s in scores.values()),
35
            issues=all_issues,
36
            suggestions=all_suggestions
37
        )
38
    
39
    def _evaluate_criterion(self, text: str, context: str, criterion: str, description: str) -> dict:
40
        response = self.client.chat.completions.create(
41
            model="gpt-4o",
42
            messages=[{
43
                "role": "user",
44
                "content": f"""Evaluate this text for {criterion}: {description}
45
 
46
Context: {context}
47
 
48
Text:
49
{text}
50
 
51
Return JSON: {{"score": 0-10, "issues": [...], "suggestions": [...]}}"""
52
            }],
53
            response_format={"type": "json_object"}
54
        )
55
        
56
        return json.loads(response.choices[0].message.content)
57
 

Advanced Patterns

Multi-Evaluator Ensemble

Use multiple evaluators and combine their judgments:

python
1
class EnsembleEvaluator:
2
    def __init__(self, evaluators: list):
3
        self.evaluators = evaluators
4
    
5
    def evaluate(self, output: str, context: str) -> Evaluation:
6
        """Combine multiple evaluator opinions"""
7
        
8
        all_evaluations = []
9
        
10
        for evaluator in self.evaluators:
11
            eval_result = evaluator.evaluate(output, context)
12
            all_evaluations.append(eval_result)
13
        
14
        # Aggregate scores (weighted average or voting)
15
        avg_score = sum(e.score for e in all_evaluations) / len(all_evaluations)
16
        
17
        # Collect all unique issues
18
        all_issues = list(set(
19
            issue for e in all_evaluations for issue in e.issues
20
        ))
21
        
22
        # Consensus on pass/fail
23
        passes = sum(1 for e in all_evaluations if e.passed)
24
        majority_pass = passes > len(all_evaluations) / 2
25
        
26
        return Evaluation(
27
            score=avg_score,
28
            passed=majority_pass,
29
            issues=all_issues,
30
            suggestions=[s for e in all_evaluations for s in e.suggestions]
31
        )
32
 
33
 
34
# Usage
35
ensemble = EnsembleEvaluator([
36
    AccuracyEvaluator(),
37
    ClarityEvaluator(),
38
    StyleEvaluator()
39
])
40
 

Progressive Quality Gates

Different quality bars for different stages:

python
1
class ProgressiveOptimizer:
2
    def __init__(self):
3
        self.quality_gates = [
4
            {"name": "basic", "min_score": 5.0, "focus": ["correctness"]},
5
            {"name": "good", "min_score": 7.0, "focus": ["correctness", "clarity"]},
6
            {"name": "excellent", "min_score": 9.0, "focus": ["correctness", "clarity", "polish"]}
7
        ]
8
    
9
    def run(self, task: str, target_quality: str = "good") -> str:
10
        """Progressively improve through quality gates"""
11
        
12
        output = self._generate(task)
13
        
14
        target_gate = next(g for g in self.quality_gates if g["name"] == target_quality)
15
        target_index = self.quality_gates.index(target_gate)
16
        
17
        # Progress through each gate up to target
18
        for gate in self.quality_gates[:target_index + 1]:
19
            output = self._optimize_for_gate(task, output, gate)
20
        
21
        return output
22
    
23
    def _optimize_for_gate(self, task: str, output: str, gate: dict) -> str:
24
        """Optimize until this gate's criteria are met"""
25
        
26
        for _ in range(3):  # Max attempts per gate
27
            evaluation = self._evaluate_for_gate(output, gate)
28
            
29
            if evaluation.score >= gate["min_score"]:
30
                print(f"✓ Passed {gate['name']} gate ({evaluation.score:.1f})")
31
                return output
32
            
33
            output = self._optimize(task, output, evaluation, gate["focus"])
34
        
35
        print(f"⚠ Could not pass {gate['name']} gate")
36
        return output
37
 

Optimization with Memory

Remember what works and what doesn't:

python
1
class LearningOptimizer:
2
    def __init__(self):
3
        self.client = openai.OpenAI()
4
        self.improvement_history = []  # What worked before
5
        self.failure_patterns = []      # What didn't work
6
    
7
    def optimize(self, task: str, output: str, evaluation: Evaluation) -> str:
8
        # Learn from history
9
        relevant_successes = self._find_relevant_successes(evaluation.issues)
10
        patterns_to_avoid = self._find_failure_patterns(evaluation.issues)
11
        
12
        response = self.client.chat.completions.create(
13
            model="gpt-4o",
14
            messages=[{
15
                "role": "system",
16
                "content": f"""Improve this output.
17
 
18
Issues to fix:
19
{json.dumps(evaluation.issues)}
20
 
21
Strategies that worked before for similar issues:
22
{json.dumps(relevant_successes)}
23
 
24
Approaches to AVOID (they didn't work):
25
{json.dumps(patterns_to_avoid)}"""
26
            }, {
27
                "role": "user",
28
                "content": f"Task: {task}\n\nCurrent output:\n{output}"
29
            }]
30
        )
31
        
32
        improved = response.choices[0].message.content
33
        
34
        # Track this attempt
35
        self._record_attempt(evaluation.issues, improved)
36
        
37
        return improved
38
    
39
    def record_success(self, issues: list, solution: str):
40
        """Record a successful optimization for future reference"""
41
        self.improvement_history.append({
42
            "issues": issues,
43
            "solution_approach": self._extract_approach(solution)
44
        })
45
    
46
    def record_failure(self, issues: list, failed_approach: str):
47
        """Record what didn't work"""
48
        self.failure_patterns.append({
49
            "issues": issues,
50
            "failed_approach": failed_approach
51
        })
52
 

Real-World Example: Article Generator

A complete article generator with evaluation and optimization:

python
1
from hopx import Sandbox
2
import openai
3
import json
4
 
5
class ArticleGenerator:
6
    def __init__(self):
7
        self.client = openai.OpenAI()
8
        self.min_score = 8.5
9
        self.max_iterations = 4
10
    
11
    def generate(self, topic: str, requirements: dict) -> dict:
12
        """Generate a high-quality article through iterative improvement"""
13
        
14
        # Phase 1: Initial draft
15
        draft = self._create_draft(topic, requirements)
16
        
17
        # Phase 2: Iterative improvement
18
        for iteration in range(self.max_iterations):
19
            print(f"\n--- Iteration {iteration + 1} ---")
20
            
21
            # Evaluate
22
            evaluation = self._evaluate_article(draft, topic, requirements)
23
            print(f"Score: {evaluation.score}/10")
24
            print(f"Issues: {evaluation.issues}")
25
            
26
            if evaluation.passed:
27
                print("✓ Article meets quality bar")
28
                break
29
            
30
            # Optimize
31
            draft = self._improve_article(draft, evaluation, requirements)
32
        
33
        # Phase 3: Final polish
34
        final = self._polish(draft)
35
        
36
        # Verify code examples if present
37
        if "```python" in final:
38
            final = self._verify_code_examples(final)
39
        
40
        return {
41
            "article": final,
42
            "iterations": iteration + 1,
43
            "final_score": evaluation.score
44
        }
45
    
46
    def _create_draft(self, topic: str, requirements: dict) -> str:
47
        response = self.client.chat.completions.create(
48
            model="gpt-4o",
49
            messages=[{
50
                "role": "system",
51
                "content": f"""Write a technical article.
52
 
53
Requirements:
54
- Length: {requirements.get('length', '1500-2000')} words
55
- Audience: {requirements.get('audience', 'developers')}
56
- Style: {requirements.get('style', 'informative but engaging')}
57
- Include: code examples, practical tips
58
 
59
Structure:
60
1. Hook/Introduction
61
2. Main content (3-5 sections)
62
3. Practical examples
63
4. Conclusion with actionable takeaways"""
64
            }, {
65
                "role": "user",
66
                "content": f"Topic: {topic}"
67
            }]
68
        )
69
        return response.choices[0].message.content
70
    
71
    def _evaluate_article(self, article: str, topic: str, requirements: dict) -> Evaluation:
72
        response = self.client.chat.completions.create(
73
            model="gpt-4o",
74
            messages=[{
75
                "role": "system",
76
                "content": f"""Evaluate this article rigorously.
77
 
78
Criteria (score each 0-10):
79
1. Technical accuracy - Are all facts and code correct?
80
2. Completeness - Does it cover the topic adequately?
81
3. Clarity - Is it easy to follow?
82
4. Engagement - Is it interesting to read?
83
5. Actionability - Can readers apply what they learned?
84
6. SEO - Are headings and structure optimized?
85
 
86
Requirements to check:
87
{json.dumps(requirements)}
88
 
89
Return JSON:
90
{{
91
    "scores": {{"accuracy": 8, "completeness": 7, ...}},
92
    "overall_score": 7.5,
93
    "issues": ["specific issue 1", "specific issue 2"],
94
    "suggestions": ["specific suggestion 1"]
95
}}"""
96
            }, {
97
                "role": "user",
98
                "content": f"Topic: {topic}\n\nArticle:\n{article}"
99
            }],
100
            response_format={"type": "json_object"}
101
        )
102
        
103
        data = json.loads(response.choices[0].message.content)
104
        
105
        return Evaluation(
106
            score=data["overall_score"],
107
            passed=data["overall_score"] >= self.min_score,
108
            issues=data["issues"],
109
            suggestions=data["suggestions"]
110
        )
111
    
112
    def _improve_article(self, article: str, evaluation: Evaluation, requirements: dict) -> str:
113
        response = self.client.chat.completions.create(
114
            model="gpt-4o",
115
            messages=[{
116
                "role": "system",
117
                "content": """Improve the article by fixing the identified issues.
118
Maintain the overall structure and good parts.
119
Focus specifically on the issues listed."""
120
            }, {
121
                "role": "user",
122
                "content": f"""Current article:
123
{article}
124
 
125
Issues to fix:
126
{json.dumps(evaluation.issues, indent=2)}
127
 
128
Suggestions to consider:
129
{json.dumps(evaluation.suggestions, indent=2)}
130
 
131
Provide the improved article:"""
132
            }]
133
        )
134
        return response.choices[0].message.content
135
    
136
    def _polish(self, article: str) -> str:
137
        """Final polish pass"""
138
        response = self.client.chat.completions.create(
139
            model="gpt-4o",
140
            messages=[{
141
                "role": "user",
142
                "content": f"""Polish this article:
143
- Fix any remaining typos or grammar issues
144
- Ensure smooth transitions between sections
145
- Verify formatting is consistent
146
 
147
Article:
148
{article}"""
149
            }]
150
        )
151
        return response.choices[0].message.content
152
    
153
    def _verify_code_examples(self, article: str) -> str:
154
        """Extract and test all code examples"""
155
        import re
156
        
157
        code_blocks = re.findall(r'```python\n(.*?)```', article, re.DOTALL)
158
        
159
        sandbox = Sandbox.create(template="code-interpreter")
160
        
161
        try:
162
            for i, code in enumerate(code_blocks):
163
                sandbox.files.write(f"/app/example_{i}.py", code)
164
                result = sandbox.commands.run(f"python /app/example_{i}.py")
165
                
166
                if result.exit_code != 0:
167
                    # Fix the code
168
                    fixed_code = self._fix_code(code, result.stderr)
169
                    article = article.replace(f"```python\n{code}```", f"```python\n{fixed_code}```")
170
            
171
            return article
172
        finally:
173
            sandbox.kill()
174
 
175
 
176
# Usage
177
generator = ArticleGenerator()
178
 
179
result = generator.generate(
180
    topic="Building RESTful APIs with FastAPI",
181
    requirements={
182
        "length": "2000-2500 words",
183
        "audience": "intermediate Python developers",
184
        "style": "practical tutorial",
185
        "must_include": ["authentication", "database integration", "testing"]
186
    }
187
)
188
 
189
print(f"Generated in {result['iterations']} iterations")
190
print(f"Final score: {result['final_score']}")
191
print(result["article"])
192
 

Best Practices

1. Define Clear Evaluation Criteria

python
1
# ❌ Vague criteria
2
criteria = ["make it good", "improve quality"]
3
 
4
# ✅ Specific, measurable criteria
5
criteria = {
6
    "accuracy": {
7
        "description": "All facts verifiable, no hallucinations",
8
        "min_score": 9,
9
        "examples": ["dates correct", "quotes accurate", "statistics cited"]
10
    },
11
    "completeness": {
12
        "description": "Addresses all aspects of the prompt",
13
        "min_score": 8,
14
        "examples": ["all questions answered", "no missing sections"]
15
    }
16
}
17
 

2. Limit Iterations

python
1
class BoundedOptimizer:
2
    def __init__(self, max_iterations: int = 5, timeout_seconds: int = 60):
3
        self.max_iterations = max_iterations
4
        self.timeout = timeout_seconds
5
    
6
    def run(self, task: str) -> str:
7
        start_time = time.time()
8
        
9
        for i in range(self.max_iterations):
10
            # Check timeout
11
            if time.time() - start_time > self.timeout:
12
                print("Timeout reached")
13
                break
14
            
15
            # Check diminishing returns
16
            if i > 2 and score_improvement < 0.5:
17
                print("Diminishing returns, stopping")
18
                break
19
 

3. Track Optimization History

python
1
def run_with_tracking(self, task: str) -> dict:
2
    history = []
3
    
4
    for i in range(self.max_iterations):
5
        evaluation = self._evaluate(output)
6
        
7
        history.append({
8
            "iteration": i,
9
            "score": evaluation.score,
10
            "issues_count": len(evaluation.issues),
11
            "output_length": len(output),
12
            "timestamp": datetime.now().isoformat()
13
        })
14
        
15
        # Detect if stuck
16
        if i > 1 and history[-1]["score"] == history[-2]["score"]:
17
            # Try different optimization strategy
18
            output = self._alternative_optimize(output, evaluation)
19
    
20
    return {"output": output, "history": history}
21
 

4. Fail Gracefully

python
1
def run_with_fallback(self, task: str) -> dict:
2
    try:
3
        result = self._optimize_loop(task)
4
        
5
        if not result["success"]:
6
            # Return best attempt even if didn't meet threshold
7
            return {
8
                "output": result["output"],
9
                "warning": "Did not meet quality threshold",
10
                "score": result["final_score"]
11
            }
12
        
13
        return result
14
    
15
    except Exception as e:
16
        # Return initial generation on failure
17
        return {
18
            "output": self._generate(task),
19
            "error": str(e),
20
            "fallback": True
21
        }
22
 

When to Use This Pattern

Use Evaluator-Optimizer when:

  • Output quality is critical
  • You can define clear quality criteria
  • You have token budget for multiple iterations
  • Task is complex enough to benefit from iteration

Avoid when:

  • Speed is the priority
  • Quality criteria are subjective/unclear
  • Output is simple and usually correct
  • Token costs are a major concern

Conclusion

The Evaluator-Optimizer Loop transforms inconsistent outputs into consistently high-quality ones:

  • Explicit evaluation — Define what "good" means
  • Iterative improvement — Fix issues systematically
  • Quality guarantees — Meet defined thresholds

Start with simple evaluation criteria. Add specialized evaluators for specific domains. Track optimization history to learn what works.

The agent that evaluates and improves beats the agent that hopes for the best. Every time.


Ready to build self-improving agents? Get started with HopX — sandboxes that let you test and verify outputs in isolation.

Further Reading