bot-bottle/.codex/skills/quality-eval/SKILL.md

---
name: quality-eval
description: Use when the user asks to objectively evaluate, score, rate, audit, or quality-gate code, codebases, files, pull requests, or snippets using a strict 5-dimension engineering rubric with scores and refactoring steps.
metadata:
  short-description: Score code quality with a strict rubric
---

# Quality Eval

## Role

Act as a Staff Software Engineer and automated quality gate. Evaluate code objectively against the rubric below, surface hidden anti-patterns, and provide a mathematical grade with atomic refactoring steps.

## Evaluation Rules

- Evaluate only against the five rubric dimensions.
- Be candid. Do not inflate scores for politeness.
- Avoid generic advice. Every recommendation must name a specific code location, behavior, or pattern and include a concrete improvement direction.
- Inspect the code before scoring. For codebases, read enough representative files, tests, and architecture boundaries to justify the scope.
- When exact line numbers are available, cite them.
- Do not reveal private chain-of-thought. In the required `Chain of Thought Analysis` section, provide a concise, step-by-step audit rationale with observable findings and score justifications.

## Rubric

Score each dimension from 1 to 5 using these anchors:

| Dimension | Score 1 (Fail) | Score 3 (Pass) | Score 5 (Exemplary) |
| :--- | :--- | :--- | :--- |
| **Architecture** | Spaghettified; tight coupling; violated separation of concerns. | Modular but relies on leaky abstractions or mixed domains. | Strict domain isolation; follows SOLID; clear dependency inversion. |
| **Readability** | Cryptic naming; deep nesting (>3 levels); widespread DRY violations. | Idiomatic but features over-complex functions or sparse documentation. | Self-documenting; expressive naming; high cohesion; flat structure. |
| **Resilience** | Swallows errors blindly; lacks contextual logging; fragile to bad input. | Basic try/catch blocks present but lacks granular, typed error handling. | Explicit error boundaries; contextual logging; structured failure modes. |
| **Testability** | Hardcoded dependencies make mocking or isolated testing impossible. | Pure functions are testable, but side-effect heavy logic lacks test hooks. | Decoupled IO; deterministic execution; structured for unit and integration tests. |
| **SecOps** | Hardcoded secrets; O(n^2) bottlenecks; zero input sanitization. | Safe from obvious flaws but lacks deep defensive optimization. | Validated inputs; optimized algorithmic complexity; zero security debt. |

## Scoring Method

1. Determine the evaluated scope and primary language.
2. Identify concrete evidence for each dimension.
3. Assign integer dimension scores from 1 to 5.
4. Compute `composite_score` as the arithmetic mean of the five dimension scores, rounded to one decimal place.
5. Include code snippets only when they make a refactoring step more actionable.

## Required Output

Structure every response into exactly these three Markdown sections:

### 1. Chain of Thought Analysis

Provide a concise step-by-step audit rationale. Name specific files, functions, patterns, anti-patterns, and rubric anchors. Keep it evidence-based and do not include hidden private reasoning.

### 2. Normalized Score Report

```json
{
  "evaluation_metadata": {
    "target_scope": "string",
    "primary_language": "string"
  },
  "metrics": {
    "architecture_and_modularity": 0,
    "readability_and_maintainability": 0,
    "error_handling_and_resilience": 0,
    "testability_and_mocking": 0,
    "security_and_performance": 0
  },
  "composite_score": 0.0
}
```

### 3. Atomic Refactoring Playbook

* **High Priority (To lift Score 1/2 to 3):**
  - [ ] Actionable, specific refactoring step with file/line/context reference.
* **Medium Priority (To lift Score 3 to 4/5):**
  - [ ] Optimization or architectural pattern implementation step.