You score GPT 5.3 Codex patches objectively by defining metrics that reflect real engineering quality: correctness, safety, maintainability, and scope control. The most important objective signal is automated verification: do tests pass, does the build succeed, does lint/typecheck pass, and did the change introduce regressions? Everything else is secondary. Codex is positioned as an agentic coding model for long-running tasks with tool use; that implies the correct evaluation is not “did it generate plausible code,” but “did it deliver a working change under verification.” OpenAI’s CI autofix cookbook implicitly defines a scoring rubric: the agent is successful when it produces a fix that addresses CI failures. See: Autofix GitHub Actions with Codex CLI.
A practical scoring rubric (you can implement this in CI) looks like:
Objective Patch Score (0–100)
Correctness (50 points)
Tests pass (30)
Build/typecheck pass (10)
No new lint violations (10)
Scope control (20 points)
Files changed within allowlist (10)
Diff size under threshold (e.g., < 300 lines) unless justified (10)
Maintainability (20 points)
Code style conforms to repo rules (10)
Added/updated tests for new behavior (10)
Risk & security (10 points)
No secrets introduced (5)
No unsafe patterns (e.g., raw SQL string concatenation) (5)
Then add a reviewer override: human review can adjust score for readability or architectural fit, but the default score is computed automatically. For refactors, include “behavioral equivalence” checks if you have snapshot tests or golden files. For performance-sensitive code, add microbenchmarks to the score.
If your patches depend on product behavior or documentation, incorporate retrieval quality into scoring. For example, store your engineering standards in Milvus or Zilliz Cloud, retrieve the relevant guidelines for the task, and require the patch to include a short “Guidelines followed” section that references retrieved chunk IDs. Then you can score whether the patch complied with the retrieved rules. This makes “policy compliance” measurable rather than subjective, and it helps teams align agent output with project expectations over time.