Skip to content

Automated Prompt Optimization Script

Appendix: Implementation Reference

This script implements the automated I/T/E optimization workflow described in the Prompt Engineering Field Guide. It takes a failing prompt, runs diagnostic analysis, generates variations, tests them systematically, and returns the optimized prompt.

What This Script Does

  1. Accepts a failing prompt + test cases from the user
  2. Uses a reasoning model (Claude Opus) to diagnose which variable (I/T/E) is broken
  3. Generates 3 variations (changing only the diagnosed variable)
  4. Runs all variations against your test set
  5. Uses LLM-as-Judge to score results
  6. Returns the winning variation with performance metrics

Requirements

  • Python 3.8+
  • Anthropic API key
  • Libraries: anthropic, json, typing

Installation

pip install anthropic
export ANTHROPIC_API_KEY='your-api-key-here'

Usage

python Automated_Prompt_Optimization.py --prompt "your_prompt.txt" --tests "test_cases.json"

Source Code

The full implementation is available in the repository:

View Automated_Prompt_Optimization.py on GitHub

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
"""Appendix A: Automated Prompt Optimization Script
Overview
This script implements the automated I/T/E optimization workflow described in Section 5.5. It takes a failing prompt, runs diagnostic analysis, generates variations, tests them systematically, and returns the optimized prompt.
What this script does:

Accepts a failing prompt + test cases from the user
Uses a reasoning model (Claude Opus) to diagnose which variable (I/T/E) is broken
Generates 3 variations (changing only the diagnosed variable)
Runs all variations against your test set
Uses LLM-as-Judge to score results
Returns the winning variation with performance metrics

Requirements:

Python 3.8+
Anthropic API key
Libraries: anthropic, json, typing"""

"""Installation:

    pip install anthropic

Set your API key:

    export ANTHROPIC_API_KEY='your-api-key-here'
"""

# The complete script
#!/usr/bin/env python3
"""
Automated I/T/E Prompt Optimizer
Implements the optimization framework from Section 5.5

Usage:
    python optimize_prompt.py --prompt "your_prompt.txt" --tests "test_cases.json"
"""

import argparse
import json
import os
import time
import random
from typing import List, Dict, Tuple, Optional
from anthropic import Anthropic, APIError, APITimeoutError, RateLimitError

# Initialize Anthropic client
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

# Configuration - Current as of January 2026
META_MODEL = "claude-opus-4-5-20251101"      # Opus 4.5 - Most capable for diagnosis
TEST_MODEL = "claude-sonnet-4-5-20250929"    # Sonnet 4.5 - Balanced for testing
JUDGE_MODEL = "claude-sonnet-4-5-20250929"   # Sonnet 4.5 - Fast, reliable judging


def call_with_retry(api_call_func, max_retries=3, context="API call"):
    """Call API with exponential backoff on transient failures.

    Args:
        api_call_func: Lambda/function that makes the API call
        max_retries: Maximum retry attempts
        context: Description for logging

    Returns:
        API response

    Raises:
        Last encountered error if all retries fail
    """
    for attempt in range(max_retries):
        try:
            return api_call_func()

        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Rate limits need longer backoff
            wait = min(60, (2 ** attempt) * 10)  # 10s, 20s, 40s (max 60s)
            jitter = random.uniform(0, 0.1 * wait)
            total_wait = wait + jitter
            print(f"⚠️  Rate limit hit during {context}")
            print(f"   Waiting {total_wait:.1f}s before retry {attempt + 2}/{max_retries}...")
            time.sleep(total_wait)

        except (APITimeoutError, APIError) as e:
            if attempt == max_retries - 1:
                raise
            # Network errors need shorter backoff
            wait = 2 ** attempt  # 1s, 2s, 4s
            jitter = random.uniform(0, 0.1 * wait)
            total_wait = wait + jitter
            print(f"⚠️  {type(e).__name__} during {context}: {e}")
            print(f"   Retrying in {total_wait:.1f}s... ({attempt + 2}/{max_retries})")
            time.sleep(total_wait)

        except Exception as e:
            # Unknown errors - don't retry
            print(f"❌ Unexpected error during {context}: {e}")
            raise


class TestCase:
    """Represents a single test case with input, expected output, and actual output."""

    def __init__(self, input_text: str, expected_output: str, actual_output: str = None):
        self.input = input_text
        self.expected = expected_output
        self.actual = actual_output

    def to_dict(self) -> Dict:
        return {
            "input": self.input,
            "expected": self.expected,
            "actual": self.actual
        }


class PromptOptimizer:
    """Main optimizer class that orchestrates the I/T/E optimization workflow."""

    def __init__(self, failing_prompt: str, test_cases: List[TestCase],
                 success_criteria: str, num_variations: int = 3,
                 test_combinations: bool = False, skip_baseline: bool = False):
        self.original_prompt = failing_prompt
        self.test_cases = test_cases
        self.success_criteria = success_criteria
        self.num_variations = num_variations
        self.test_combinations = test_combinations
        self.skip_baseline = skip_baseline
        self.diagnosis_result = None
        self.variations = []
        self.results = []
        self.baseline_score = None

    def validate_test_cases(self) -> Tuple[bool, List[str]]:
        """Validate test case quality and coverage.

        Returns:
            (is_valid, list_of_warnings)
        """
        warnings = []

        # Rule 1: Minimum quantity
        if len(self.test_cases) < 5:
            warnings.append(
                f"Only {len(self.test_cases)} test cases. Recommend 10+ for reliable optimization."
            )

        # Rule 2: Input diversity (simple heuristic)
        inputs = [tc.input for tc in self.test_cases]
        unique_words = set()
        for inp in inputs:
            unique_words.update(inp.lower().split())

        if len(unique_words) < len(inputs) * 3:  # Expect ~3 unique words per test
            warnings.append(
                "Test inputs may lack diversity. Consider adding more varied scenarios."
            )

        # Rule 3: Expected output specificity
        for i, tc in enumerate(self.test_cases):
            if len(tc.expected.split()) < 5:
                warnings.append(
                    f"Test {i+1}: Expected output is very short ('{tc.expected}'). "
                    "Vague expectations make judging difficult."
                )

        # Rule 4: Check for contradictions (LLM-based, optional)
        if len(self.test_cases) >= 3:
            contradiction_check = self._check_contradictions()
            if contradiction_check:
                warnings.append(f"Potential contradiction: {contradiction_check}")

        # Rule 5: Edge case coverage (heuristic)
        edge_case_keywords = ['empty', 'null', 'invalid', 'missing', 'error',
                             'edge', 'boundary', 'max', 'min']
        has_edge_cases = any(
            any(keyword in tc.input.lower() for keyword in edge_case_keywords)
            for tc in self.test_cases
        )
        if not has_edge_cases:
            warnings.append(
                "No obvious edge cases detected. Consider adding tests for "
                "empty inputs, errors, or boundary conditions."
            )

        is_valid = len(warnings) == 0 or len(self.test_cases) >= 5
        return is_valid, warnings

    def _check_contradictions(self) -> Optional[str]:
        """Use LLM to detect contradictory test cases (optional, costs 1 API call)."""

        test_summary = "\n".join([
            f"{i+1}. Input: {tc.input} → Expected: {tc.expected}"
            for i, tc in enumerate(self.test_cases[:10])  # Check first 10
        ])

        prompt = f"""Review these test cases for contradictions:

{test_summary}

Do any test cases contradict each other? For example:
- Same input expecting different outputs
- Requirements that are mutually exclusive
- Inconsistent expectations

If you find contradictions, return JSON:
{{"has_contradiction": true, "description": "Test 2 expects X but Test 5 expects Y for similar input"}}

If no contradictions, return:
{{"has_contradiction": false}}

Output ONLY valid JSON."""

        try:
            response = call_with_retry(
                lambda: client.messages.create(
                    model=JUDGE_MODEL,
                    max_tokens=500,
                    messages=[{"role": "user", "content": prompt}]
                ),
                context="contradiction check"
            )
            result = json.loads(response.content[0].text)
            if result.get("has_contradiction"):
                return result["description"]
        except:
            pass  # Contradiction check is optional, fail silently

        return None

    def run_optimization(self) -> Dict:
        """Main optimization pipeline."""
        print("=" * 80)
        print("AUTOMATED I/T/E PROMPT OPTIMIZATION")
        print("=" * 80)

        # Validation: Check test case quality
        print("\n[Validation] Checking test case quality...")
        is_valid, warnings = self.validate_test_cases()

        if warnings:
            print("⚠️  Test Case Warnings:")
            for w in warnings:
                print(f"   - {w}")

            if not is_valid:
                response = input("\nContinue anyway? (y/n): ")
                if response.lower() != 'y':
                    print("Aborted. Please improve test cases and re-run.")
                    return None
        else:
            print("✓ Test cases look good")

        # Step 0: Test baseline (unless skipped)
        if not self.skip_baseline:
            print("\n[Step 0/6] Testing original prompt for baseline...")
            baseline_result = self._test_single_variation({
                "variation_id": 0,
                "modified_prompt": self.original_prompt,
                "change_description": "Original (baseline)",
                "test_hypothesis": "Baseline performance"
            })
            baseline_score = self._score_single_result(baseline_result)
            print(f"✓ Baseline score: {baseline_score['overall_score']:.1f}/100")

            self.baseline_score = baseline_score['overall_score']

            # Early exit if baseline is already good
            if baseline_score['overall_score'] >= 90:
                print("\n🎉 Original prompt already performs well (≥90%). No optimization needed.")
                return self._format_final_report({
                    "variation_id": 0,
                    "optimized_prompt": self.original_prompt,
                    "overall_score": baseline_score['overall_score'],
                    "improvement": 0,
                    "change_description": "No changes needed",
                    "pass_rate": baseline_score.get('pass_rate', 'N/A'),
                    "strengths": baseline_score.get('strengths', []),
                    "weaknesses": baseline_score.get('weaknesses', [])
                })
        else:
            self.baseline_score = 50  # Fallback assumption if baseline skipped

        # Step 1: Diagnose the failure
        print("\n[Step 1/6] Diagnosing failure type...")
        self.diagnosis_result = self._diagnose_failure()
        print(f"✓ Diagnosis complete: {self.diagnosis_result['failure_type']}")
        print(f"  Broken Variable: {self.diagnosis_result['broken_variable']}")
        print(f"  Confidence: {self.diagnosis_result['confidence']}")

        # Step 2: Generate variations
        print(f"\n[Step 2/6] Generating {self.num_variations} prompt variations...")
        self.variations = self._generate_variations()
        print(f"✓ Generated {len(self.variations)} variations")

        # Step 3: Test all variations
        print("\n[Step 3/6] Testing variations against test set...")
        self.results = self._test_variations()
        print(f"✓ Tested all variations ({len(self.test_cases)} test cases each)")

        # Step 4: Score with LLM-as-Judge
        print("\n[Step 4/6] Scoring results with LLM-as-Judge...")
        scored_results = self._score_results()
        print("✓ Scoring complete")

        # Step 5: Select winner
        print("\n[Step 5/6] Selecting best variation...")
        winner = self._select_winner(scored_results)
        print(f"✓ Winner: Variation {winner['variation_id']}")
        print(f"  Score: {winner['score']:.2f}/100")
        print(f"  Improvement: +{winner['improvement']:.1f}%")

        # Step 6: Optional combination testing
        if self.test_combinations:
            print("\n[Step 6/6] Testing variable combinations (Phase 2)...")
            combo_winner = self._test_combinations(winner)
            if combo_winner and combo_winner['overall_score'] > winner['overall_score']:
                print(f"✓ Combination improved score to {combo_winner['overall_score']:.2f}/100")
                winner = combo_winner
            else:
                print("✓ No combination outperformed the single-variable optimization")
        else:
            print("\n[Step 6/6] Skipping combination testing (use --test-combinations to enable)")

        return self._format_final_report(winner)

    def _test_single_variation(self, variation: Dict) -> Dict:
        """Test a single variation against all test cases."""
        variation_results = {
            "variation_id": variation["variation_id"],
            "test_outputs": []
        }

        for test_case in self.test_cases:
            # Run the test with this variation's prompt
            test_prompt = variation["modified_prompt"]

            response = call_with_retry(
                lambda: client.messages.create(
                    model=TEST_MODEL,
                    max_tokens=2000,
                    messages=[{
                        "role": "user",
                        "content": f"{test_prompt}\n\nInput: {test_case.input}"
                    }]
                ),
                context=f"testing variation {variation['variation_id']}"
            )

            output = response.content[0].text

            variation_results["test_outputs"].append({
                "input": test_case.input,
                "expected": test_case.expected,
                "actual": output
            })

        return variation_results

    def _score_single_result(self, result: Dict) -> Dict:
        """Score a single variation's results."""
        judge_prompt = f"""You are an LLM Judge evaluating prompt optimization results.

Success Criteria:
{self.success_criteria}

Evaluate these test results and provide a score:

{json.dumps(result['test_outputs'], indent=2)}

Provide your evaluation as JSON:
{{
  "overall_score": 0-100,
  "pass_rate": "X/Y tests passed",
  "strengths": ["what this variation does well"],
  "weaknesses": ["remaining issues"],
  "meets_criteria": true/false
}}

Output ONLY valid JSON."""

        response = call_with_retry(
            lambda: client.messages.create(
                model=JUDGE_MODEL,
                max_tokens=1500,
                messages=[{"role": "user", "content": judge_prompt}]
            ),
            context="judging results"
        )

        score = json.loads(response.content[0].text)
        score["variation_id"] = result["variation_id"]
        return score

    def _diagnose_failure(self) -> Dict:
        """Use meta-model to diagnose which variable (I/T/E) is broken."""

        # Format test case failures for diagnosis
        failures_text = "\n\n".join([
            f"Test Case {i+1}:\nInput: {tc.input}\nExpected: {tc.expected}\nActual: {tc.actual}"
            for i, tc in enumerate(self.test_cases)
        ])

        diagnostic_prompt = f"""You are an Expert Prompt Optimization Agent specializing in the I/T/E framework.

# Background: The I/T/E Framework
Every agent prompt decomposes into three independent variables:
- I (Instructions): The task directive and behavioral rules
- T (Thoughts/State): The reasoning structure or state file format  
- E (Exemplars): Few-shot examples demonstrating correct behavior

# Current Failing Prompt
{self.original_prompt}

# Observed Failures
{failures_text}

# Success Criteria
{self.success_criteria}

# Your Task
Analyze the failures and provide a diagnosis in JSON format:

{{
  "failure_type": "Structure Failure" | "Logic Failure" | "Syntax Failure",
  "broken_variable": "I" | "T" | "E",
  "confidence": "High" | "Medium" | "Low",
  "reasoning": "Explain what in the failure examples led to this diagnosis",
  "diagnostic_evidence": ["specific symptom 1", "specific symptom 2", ...]
}}

Output ONLY valid JSON, no other text."""

        response = call_with_retry(
            lambda: client.messages.create(
                model=META_MODEL,
                max_tokens=2000,
                messages=[{"role": "user", "content": diagnostic_prompt}]
            ),
            context="diagnosis"
        )

        # Parse JSON response
        diagnosis = json.loads(response.content[0].text)
        return diagnosis

    def _generate_variations(self) -> List[Dict]:
        """Generate variations based on the diagnosed broken variable."""

        variation_prompt = f"""Based on this diagnosis:

Failure Type: {self.diagnosis_result['failure_type']}
Broken Variable: {self.diagnosis_result['broken_variable']}

Original Prompt:
{self.original_prompt}

Generate {self.num_variations} variations that ONLY modify Variable {self.diagnosis_result['broken_variable']}.
Lock the other two variables (keep them unchanged).

Return as JSON array:
[
  {{
    "variation_id": 1,
    "modified_prompt": "full revised prompt",
    "change_description": "what specifically changed",
    "test_hypothesis": "what this should fix"
  }},
  ...
]

Output ONLY valid JSON, no other text."""

        response = call_with_retry(
            lambda: client.messages.create(
                model=META_MODEL,
                max_tokens=4000,
                messages=[{"role": "user", "content": variation_prompt}]
            ),
            context="variation generation"
        )

        variations = json.loads(response.content[0].text)
        return variations

    def _test_variations(self) -> List[Dict]:
        """Run each variation against all test cases."""

        results = []

        for var_idx, variation in enumerate(self.variations):
            print(f"  Testing Variation {var_idx + 1}/{len(self.variations)}...")
            variation_results = self._test_single_variation(variation)
            results.append(variation_results)

        return results

    def _score_results(self) -> List[Dict]:
        """Use LLM-as-Judge to score each variation's performance."""

        scored = []

        for result in self.results:
            score = self._score_single_result(result)
            scored.append(score)

        return scored

    def _select_winner(self, scored_results: List[Dict]) -> Dict:
        """Select the best-performing variation."""

        # Sort by overall_score
        sorted_results = sorted(scored_results,
                              key=lambda x: x["overall_score"],
                              reverse=True)

        winner = sorted_results[0]

        # Calculate improvement over baseline
        if self.baseline_score and self.baseline_score > 0:
            improvement = ((winner["overall_score"] - self.baseline_score) / self.baseline_score) * 100
        else:
            improvement = 0

        winner["improvement"] = improvement
        winner["score"] = winner["overall_score"]

        # Attach the winning variation's prompt
        winning_variation = next(v for v in self.variations
                                if v["variation_id"] == winner["variation_id"])
        winner["optimized_prompt"] = winning_variation["modified_prompt"]
        winner["change_description"] = winning_variation["change_description"]

        return winner

    def _test_combinations(self, current_winner: Dict) -> Optional[Dict]:
        """Phase 2: Test combinations of winning variables (optional).

        Tests:
        - Best I + Best T (keep original E)
        - Best I + Best E (keep original T)
        - Best T + Best E (keep original I)
        """
        print("  Generating combination variations...")

        # Extract the best variation for each variable type
        # For now, we'll use the current winner for the diagnosed variable
        # and keep originals for the others
        broken_var = self.diagnosis_result['broken_variable']
        best_variation = next(v for v in self.variations
                             if v["variation_id"] == current_winner["variation_id"])

        # Create combinations by merging the optimized variable with originals
        # Note: This is a simplified implementation. A full implementation would
        # need to parse and extract I/T/E from prompts, which is complex.
        # For now, we'll generate new combination prompts via LLM

        combo_prompt = f"""You previously optimized variable {broken_var} in this prompt:

Original Prompt:
{self.original_prompt}

Optimized {broken_var}:
{best_variation['modified_prompt']}

Now generate 3 combination variations that test interactions:
1. Optimize both I and T (keep original E)
2. Optimize both I and E (keep original T)
3. Optimize both T and E (keep original I)

Since you previously optimized {broken_var}, incorporate that improvement into the relevant combinations.

Return as JSON array:
[
  {{
    "variation_id": "combo_1",
    "modified_prompt": "full combined prompt",
    "change_description": "I+T combination",
    "test_hypothesis": "Tests I/T interaction"
  }},
  ...
]

Output ONLY valid JSON, no other text."""

        try:
            response = call_with_retry(
                lambda: client.messages.create(
                    model=META_MODEL,
                    max_tokens=4000,
                    messages=[{"role": "user", "content": combo_prompt}]
                ),
                context="combination generation"
            )

            combinations = json.loads(response.content[0].text)

            # Test each combination
            print(f"  Testing {len(combinations)} combinations...")
            combo_results = []

            for combo in combinations:
                result = self._test_single_variation(combo)
                score = self._score_single_result(result)
                combo_results.append(score)

            # Find best combination
            best_combo = max(combo_results, key=lambda x: x["overall_score"])

            # Attach prompt and metadata
            combo_variation = next(c for c in combinations
                                  if c["variation_id"] == best_combo["variation_id"])
            best_combo["optimized_prompt"] = combo_variation["modified_prompt"]
            best_combo["change_description"] = combo_variation["change_description"]
            best_combo["score"] = best_combo["overall_score"]

            # Calculate improvement over baseline
            if self.baseline_score and self.baseline_score > 0:
                best_combo["improvement"] = ((best_combo["overall_score"] - self.baseline_score) / self.baseline_score) * 100
            else:
                best_combo["improvement"] = 0

            return best_combo

        except Exception as e:
            print(f"⚠️  Combination testing failed: {e}")
            return None

    def _format_final_report(self, winner: Dict) -> Dict:
        """Format the final optimization report."""

        report = {
            "original_prompt": self.original_prompt,
            "diagnosis": self.diagnosis_result,
            "optimized_prompt": winner["optimized_prompt"],
            "performance": {
                "score": winner["overall_score"],
                "improvement": winner["improvement"],
                "pass_rate": winner["pass_rate"]
            },
            "changes_made": winner["change_description"],
            "recommendations": winner.get("weaknesses", [])
        }

        return report


def load_test_cases_from_json(filepath: str) -> List[TestCase]:
    """Load test cases from JSON file."""
    with open(filepath, 'r') as f:
        data = json.load(f)

    return [TestCase(tc["input"], tc["expected"], tc.get("actual")) 
            for tc in data["test_cases"]]


def main():
    """Main entry point with argument parsing."""
    parser = argparse.ArgumentParser(
        description='Automated I/T/E Prompt Optimizer',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Run with built-in example
  python optimize_prompt.py

  # Use custom prompt and test cases
  python optimize_prompt.py --prompt my_prompt.txt --tests test_cases.json

  # Generate more variations and test combinations
  python optimize_prompt.py --num-variations 5 --test-combinations

  # Skip baseline testing to save cost
  python optimize_prompt.py --skip-baseline
        """
    )

    parser.add_argument(
        '--prompt',
        type=str,
        help='Path to file containing the failing prompt (default: use built-in example)'
    )

    parser.add_argument(
        '--tests',
        type=str,
        help='Path to JSON file containing test cases (default: use built-in example)'
    )

    parser.add_argument(
        '--success-criteria',
        type=str,
        help='Success criteria description (default: use built-in example)'
    )

    parser.add_argument(
        '--num-variations',
        type=int,
        default=3,
        help='Number of variations to generate (default: 3)'
    )

    parser.add_argument(
        '--test-combinations',
        action='store_true',
        help='Enable Phase 2 combination testing (tests I+T, I+E, T+E interactions)'
    )

    parser.add_argument(
        '--skip-baseline',
        action='store_true',
        help='Skip baseline testing (saves cost but less accurate improvement metrics)'
    )

    parser.add_argument(
        '--output',
        type=str,
        default='optimization_result.json',
        help='Output file for optimization report (default: optimization_result.json)'
    )

    args = parser.parse_args()

    # Load prompt
    if args.prompt:
        with open(args.prompt, 'r') as f:
            failing_prompt = f.read().strip()
    else:
        # Use built-in example
        failing_prompt = """You are a Worker Agent. Read feature_list.md and complete one task."""

    # Load test cases
    if args.tests:
        test_cases = load_test_cases_from_json(args.tests)
    else:
        # Use built-in example
        test_cases = [
            TestCase(
                input="feature_list.md contains '- [ ] Add login button'",
                expected="Agent writes code AND marks task as [x]",
                actual="Agent writes code but doesn't update the file"
            ),
            TestCase(
                input="feature_list.md contains '- [ ] Write tests'",
                expected="Agent writes tests THEN marks [x]",
                actual="Agent marks [x] without writing tests"
            ),
            TestCase(
                input="feature_list.md contains '- [ ] Fix bug in auth.py'",
                expected="Agent fixes bug, writes test, marks [x]",
                actual="Agent fixes bug but skips test and marking"
            )
        ]

    # Load success criteria
    if args.success_criteria:
        success_criteria = args.success_criteria
    else:
        # Use built-in example
        success_criteria = """
        - Agent must update feature_list.md to [x] after completing task
        - Agent must not mark [x] until work is actually done
        - Agent must write tests for any code changes
        """

    # Run optimization
    optimizer = PromptOptimizer(
        failing_prompt,
        test_cases,
        success_criteria,
        num_variations=args.num_variations,
        test_combinations=args.test_combinations,
        skip_baseline=args.skip_baseline
    )
    result = optimizer.run_optimization()

    # Handle early exit (e.g., user aborted or prompt already optimal)
    if result is None:
        return

    # Print final report
    print("\n" + "=" * 80)
    print("OPTIMIZATION REPORT")
    print("=" * 80)
    print(f"\n📊 Performance Score: {result['performance']['score']}/100")
    print(f"📈 Improvement: +{result['performance']['improvement']:.1f}%")
    print(f"✅ Pass Rate: {result['performance']['pass_rate']}")

    print(f"\n🔧 Changes Made:")
    print(f"   {result['changes_made']}")

    print(f"\n✨ Optimized Prompt:")
    print("   " + "-" * 76)
    for line in result['optimized_prompt'].split('\n'):
        print(f"   {line}")
    print("   " + "-" * 76)

    if result.get('recommendations'):
        print(f"\n⚠️  Remaining Issues:")
        for rec in result['recommendations']:
            print(f"   - {rec}")

    # Save to file
    with open(args.output, 'w') as f:
        json.dump(result, f, indent=2)

    print(f"\n💾 Full report saved to: {args.output}")


if __name__ == "__main__":
    main()

"""Usage Guide

Basic Usage:
    # Run with the built-in example
    python optimize_prompt.py

    # Provide your own prompt and test cases
    python optimize_prompt.py --prompt my_prompt.txt --tests my_test_cases.json

    # Generate 5 variations instead of 3
    python optimize_prompt.py --num-variations 5

    # Enable Phase 2 combination testing
    python optimize_prompt.py --test-combinations

    # Skip baseline testing to save API costs
    python optimize_prompt.py --skip-baseline

    # Combine multiple options
    python optimize_prompt.py --prompt my_prompt.txt --tests my_tests.json --num-variations 5 --test-combinations --output my_result.json
"""

"""Test Cases JSON Format:

{
  "test_cases": [
    {
      "input": "Description of what the agent receives",
      "expected": "What the agent should do/output",
      "actual": "What the agent actually did (if known)"
    }
  ]
}

Expected Output:
"""

"""
================================================================================
AUTOMATED I/T/E PROMPT OPTIMIZATION
================================================================================

[Step 1/5] Diagnosing failure type...
✓ Diagnosis complete: Logic Failure
  Broken Variable: I
  Confidence: High

[Step 2/5] Generating prompt variations...
✓ Generated 3 variations

[Step 3/5] Testing variations against test set...
  Testing Variation 1/3...
  Testing Variation 2/3...
  Testing Variation 3/3...
✓ Tested all variations (3 test cases each)

[Step 4/5] Scoring results with LLM-as-Judge...
✓ Scoring complete

[Step 5/5] Selecting best variation...
✓ Winner: Variation 1
  Score: 92.00/100
  Improvement: +84.0%

================================================================================
OPTIMIZATION REPORT
================================================================================

📊 Performance Score: 92/100
📈 Improvement: +84.0%
✅ Pass Rate: 3/3 tests passed

🔧 Changes Made:
   Added explicit 3-phase loop with timing constraints

✨ Optimized Prompt:
   --------------------------------------------------------------------------
   You are a Worker Agent. Follow this protocol:
   PHASE 1 (Read): Read feature_list.md, identify next [ ] task
   PHASE 2 (Execute): Complete that task fully
   PHASE 3 (Update): ONLY after work is complete, mark it [x]

   Constraint: You may NOT mark [x] until Phase 2 is complete.
   --------------------------------------------------------------------------

💾 Full report saved to: optimization_result.json
"""

"""Customization Options

Change Models:
    # Use different models for different stages in the configuration at the top of the file
    META_MODEL = "claude-opus-4-5-20251101"      # Most capable for diagnosis
    TEST_MODEL = "claude-sonnet-4-5-20250929"    # Balanced for testing
    JUDGE_MODEL = "claude-sonnet-4-5-20250929"   # Fast, reliable judging

Add More Test Cases:
    The more test cases, the more reliable the optimization
    - Minimum: 5 test cases
    - Recommended: 10-20 test cases
    - Production: 50+ test cases

Adjust Scoring Criteria:
    Modify the _score_single_result() method to weight different factors
    Example: Weight correctness more than format

    judge_prompt = '''...
    Scoring weights:
    - Correctness: 70%
    - Format compliance: 20%
    - Efficiency: 10%
    ...'''

Cost Estimation:
    For a typical optimization run:

    - Diagnosis: 1 Opus call (~2K tokens) = $0.03
    - Variation Generation: 1 Opus call (~4K tokens) = $0.06
    - Testing: 3 variations × 5 test cases × Sonnet = ~$0.15
    - Judging: 3 Sonnet calls (~3K tokens each) = $0.03

    Total per optimization: ~$0.27 (with 5 test cases)
    Recommended: Start with 5 test cases during development,
    scale to 20+ for production validation.

Limitations & Future Enhancements:

    Current Limitations:
    - Requires manual creation of test cases (can't auto-generate them yet)
    - Assumes test cases can be run via simple API calls (doesn't handle complex environments)
    - LLM-as-Judge may have biases (validate against human judgment initially)

    Possible Enhancements:
    - Auto-generate test cases from prompt description
    - Support for multi-turn agent conversations (not just single prompts)
    - Integration with actual agent harnesses (test real Workers, not simulated)
    - Iterative optimization (if first round doesn't hit 95%, run again)
"""