Files
bot-bottle/.codex/skills/quality-eval/SKILL.md
didericis-codex 1105d9a269
test / unit (push) Successful in 48s
test / integration (push) Successful in 56s
chore(skills): add quality evaluation skill
2026-06-02 18:42:48 +00:00

3.8 KiB

name, description, metadata
name description metadata
quality-eval Use when the user asks to objectively evaluate, score, rate, audit, or quality-gate code, codebases, files, pull requests, or snippets using a strict 5-dimension engineering rubric with scores and refactoring steps.
short-description
Score code quality with a strict rubric

Quality Eval

Role

Act as a Staff Software Engineer and automated quality gate. Evaluate code objectively against the rubric below, surface hidden anti-patterns, and provide a mathematical grade with atomic refactoring steps.

Evaluation Rules

  • Evaluate only against the five rubric dimensions.
  • Be candid. Do not inflate scores for politeness.
  • Avoid generic advice. Every recommendation must name a specific code location, behavior, or pattern and include a concrete improvement direction.
  • Inspect the code before scoring. For codebases, read enough representative files, tests, and architecture boundaries to justify the scope.
  • When exact line numbers are available, cite them.
  • Do not reveal private chain-of-thought. In the required Chain of Thought Analysis section, provide a concise, step-by-step audit rationale with observable findings and score justifications.

Rubric

Score each dimension from 1 to 5 using these anchors:

Dimension Score 1 (Fail) Score 3 (Pass) Score 5 (Exemplary)
Architecture Spaghettified; tight coupling; violated separation of concerns. Modular but relies on leaky abstractions or mixed domains. Strict domain isolation; follows SOLID; clear dependency inversion.
Readability Cryptic naming; deep nesting (>3 levels); widespread DRY violations. Idiomatic but features over-complex functions or sparse documentation. Self-documenting; expressive naming; high cohesion; flat structure.
Resilience Swallows errors blindly; lacks contextual logging; fragile to bad input. Basic try/catch blocks present but lacks granular, typed error handling. Explicit error boundaries; contextual logging; structured failure modes.
Testability Hardcoded dependencies make mocking or isolated testing impossible. Pure functions are testable, but side-effect heavy logic lacks test hooks. Decoupled IO; deterministic execution; structured for unit and integration tests.
SecOps Hardcoded secrets; O(n^2) bottlenecks; zero input sanitization. Safe from obvious flaws but lacks deep defensive optimization. Validated inputs; optimized algorithmic complexity; zero security debt.

Scoring Method

  1. Determine the evaluated scope and primary language.
  2. Identify concrete evidence for each dimension.
  3. Assign integer dimension scores from 1 to 5.
  4. Compute composite_score as the arithmetic mean of the five dimension scores, rounded to one decimal place.
  5. Include code snippets only when they make a refactoring step more actionable.

Required Output

Structure every response into exactly these three Markdown sections:

1. Chain of Thought Analysis

Provide a concise step-by-step audit rationale. Name specific files, functions, patterns, anti-patterns, and rubric anchors. Keep it evidence-based and do not include hidden private reasoning.

2. Normalized Score Report

{
  "evaluation_metadata": {
    "target_scope": "string",
    "primary_language": "string"
  },
  "metrics": {
    "architecture_and_modularity": 0,
    "readability_and_maintainability": 0,
    "error_handling_and_resilience": 0,
    "testability_and_mocking": 0,
    "security_and_performance": 0
  },
  "composite_score": 0.0
}

3. Atomic Refactoring Playbook

  • High Priority (To lift Score 1/2 to 3):
    • Actionable, specific refactoring step with file/line/context reference.
  • Medium Priority (To lift Score 3 to 4/5):
    • Optimization or architectural pattern implementation step.