GA Tips & Recipes

This page distills practical advice for using the GA in evoproc.ga_scaffold_structured effectively.

Scorers: structural vs task-eval

  • Structural hygiene (default):
    Uses evoproc.scorers’ adapter over the validator suite. Great for bootstrapping before you wire up execution.

  • Task-eval scoring (recommended for benchmarks):
    Pass all three to ga.run(...):

    • final_answer_schema

    • run_steps_fn(proc, question, final_answer_schema, model, print_bool=False) -> state

    • eval_fn(state, proc) -> float (e.g., exact-match or numeric closeness)

best, _ = ga.run(
    task_description=question,
    final_answer_schema=get_schema("gsm"),
    run_steps_fn=run_steps_fn,   # calls your runner
    eval_fn=eval_fn,             # returns [0,1]
)

Sensible hyperparameters (start here)

GAConfig(
  population_size=6,         # 6–12 for quick loops
  elitism=2,                 # keep top 2 unchanged
  crossover_rate=0.7,        # favor crossover
  mutation_rate=0.3,         # still allow edits
  max_generations=3,         # 3–6 to start
  tournament_k=3,
  random_immigrant_rate=0.10,# diversity without chaos
  seed=42,
)
  • Increase generations first, then population.

  • Keep elitism ≥ 1 so progress isn’t lost.

  • Immigrants help escape local minima; 5–20% is typical.

Reproducibility

  • Set seed in both GAConfig and your backend query options.

  • Log the chosen model and any temperature/top-p settings.

  • Persist runs to JSONL: keep best.proc, fitness, and final state.

import json
with open("runs.jsonl","a") as f:
    f.write(json.dumps({
        "qid": qid,
        "fitness": best.fitness,
        "proc": best.proc,
        "state": final_state,
    }) + "\n")

Efficient task-eval loops

  • Use a lightweight run_steps_fn: small schemas, minimal I/O, strict JSON parsing.

  • Cache step prompts if you’re experimenting; only the best individual per gen must be executed.

  • If your runner is expensive, consider generation-end evaluation only (evaluate after reproduction, not for every candidate mid-gen).

Operators: practical notes

  • Crossover tends to clean up naming and structure—keep it dominant (0.6–0.8).

  • Mutation is single, small edits by design (split/insert/rename/verify). If progress stalls, slightly raise mutation_rate.

Diagnosing stagnation

  • Flat fitness across generations → Scorer too coarse. Switch to task-eval or add a shaping term (e.g., small penalty for > N steps).

  • Validator ping-pong → Tighten repair prompts; prefer minimal fixes and renumber steps after repair.

  • Degenerate loops (same child every time) → add random_immigrant_rate and ensure seed is not globally constant if you want exploration.

GSM8K recipe (mini)

FINAL_SCHEMA = get_schema("gsm")

def run_steps_fn(proc, q, s, m, print_bool=False):
    return run_steps_stateful_minimal(proc, q, s, m, query_fn=query, print_bool=print_bool)

def eval_fn(state, proc):
    import math, re
    nums = re.findall(r"-?\d+(?:\.\d+)?", state.get("answer",""))
    pred = state.get("answer_numerical") or (float(nums[-1]) if nums else None)
    gold = state.get("_gold_num")
    return 1.0 if (pred is not None and gold is not None and math.isclose(pred, gold, abs_tol=1e-6)) else 0.0

# loop
for ex in examples:  # list of {id, question, answer}
    gold_num = extract_number(ex["answer"])
    best, _ = ga.run(
        task_description=ex["question"],
        final_answer_schema=FINAL_SCHEMA,
        run_steps_fn=run_steps_fn,
        eval_fn=lambda st, pr: eval_fn({**st, "_gold_num": gold_num}, pr),
    )
    state = run_steps_fn(best.proc, ex["question"], FINAL_SCHEMA, ga.model)

Troubleshooting

  • ModuleNotFoundError during docs build Make sure both packages are installed in the docs env and conf.py adds their src/ paths.

  • LLM returns prose, not JSON Lower temperature and include the schema verbatim (format= parameter if your backend supports it).

  • “Missing required output …” at runtime Keep strict_missing=True to catch it early; if the model is flaky, allow one retry with the same prompt/seed.

When to stop evolving

  • Fitness plateaus for ≥2 generations.

  • Best procedure stabilizes (diff of steps/variables is small).

  • Wall-clock budget reached (log total tokens/calls if possible).