GA Tips & Recipes¶
This page distills practical advice for using the GA in evoproc.ga_scaffold_structured effectively.
Scorers: structural vs task-eval¶
Structural hygiene (default):
Usesevoproc.scorers’ adapter over the validator suite. Great for bootstrapping before you wire up execution.Task-eval scoring (recommended for benchmarks):
Pass all three toga.run(...):final_answer_schemarun_steps_fn(proc, question, final_answer_schema, model, print_bool=False) -> stateeval_fn(state, proc) -> float(e.g., exact-match or numeric closeness)
best, _ = ga.run(
task_description=question,
final_answer_schema=get_schema("gsm"),
run_steps_fn=run_steps_fn, # calls your runner
eval_fn=eval_fn, # returns [0,1]
)
Sensible hyperparameters (start here)¶
GAConfig(
population_size=6, # 6–12 for quick loops
elitism=2, # keep top 2 unchanged
crossover_rate=0.7, # favor crossover
mutation_rate=0.3, # still allow edits
max_generations=3, # 3–6 to start
tournament_k=3,
random_immigrant_rate=0.10,# diversity without chaos
seed=42,
)
Increase generations first, then population.
Keep elitism ≥ 1 so progress isn’t lost.
Immigrants help escape local minima; 5–20% is typical.
Reproducibility¶
Set seed in both GAConfig and your backend query options.
Log the chosen model and any temperature/top-p settings.
Persist runs to JSONL: keep best.proc, fitness, and final state.
import json
with open("runs.jsonl","a") as f:
f.write(json.dumps({
"qid": qid,
"fitness": best.fitness,
"proc": best.proc,
"state": final_state,
}) + "\n")
Efficient task-eval loops¶
Use a lightweight run_steps_fn: small schemas, minimal I/O, strict JSON parsing.
Cache step prompts if you’re experimenting; only the best individual per gen must be executed.
If your runner is expensive, consider generation-end evaluation only (evaluate after reproduction, not for every candidate mid-gen).
Operators: practical notes¶
Crossover tends to clean up naming and structure—keep it dominant (0.6–0.8).
Mutation is single, small edits by design (split/insert/rename/verify). If progress stalls, slightly raise mutation_rate.
Diagnosing stagnation¶
Flat fitness across generations → Scorer too coarse. Switch to task-eval or add a shaping term (e.g., small penalty for > N steps).
Validator ping-pong → Tighten repair prompts; prefer minimal fixes and renumber steps after repair.
Degenerate loops (same child every time) → add random_immigrant_rate and ensure seed is not globally constant if you want exploration.
GSM8K recipe (mini)¶
FINAL_SCHEMA = get_schema("gsm")
def run_steps_fn(proc, q, s, m, print_bool=False):
return run_steps_stateful_minimal(proc, q, s, m, query_fn=query, print_bool=print_bool)
def eval_fn(state, proc):
import math, re
nums = re.findall(r"-?\d+(?:\.\d+)?", state.get("answer",""))
pred = state.get("answer_numerical") or (float(nums[-1]) if nums else None)
gold = state.get("_gold_num")
return 1.0 if (pred is not None and gold is not None and math.isclose(pred, gold, abs_tol=1e-6)) else 0.0
# loop
for ex in examples: # list of {id, question, answer}
gold_num = extract_number(ex["answer"])
best, _ = ga.run(
task_description=ex["question"],
final_answer_schema=FINAL_SCHEMA,
run_steps_fn=run_steps_fn,
eval_fn=lambda st, pr: eval_fn({**st, "_gold_num": gold_num}, pr),
)
state = run_steps_fn(best.proc, ex["question"], FINAL_SCHEMA, ga.model)
Troubleshooting¶
ModuleNotFoundError during docs build Make sure both packages are installed in the docs env and conf.py adds their src/ paths.
LLM returns prose, not JSON Lower temperature and include the schema verbatim (format= parameter if your backend supports it).
“Missing required output …” at runtime Keep strict_missing=True to catch it early; if the model is flaky, allow one retry with the same prompt/seed.
When to stop evolving¶
Fitness plateaus for ≥2 generations.
Best procedure stabilizes (diff of steps/variables is small).
Wall-clock budget reached (log total tokens/calls if possible).