Judges Framework
The Glacis judges framework provides a pluggable pipeline for evaluating AI outputs using LLM-as-judge patterns. Multiple judges can run on the same item, and their scores are aggregated into a recommendation (uphold, borderline, or escalate). Judge results are attested alongside the original output, creating an auditable review record.
Overview
Section titled “Overview”The framework consists of four components:
| Component | Description |
|---|---|
BaseJudge | Abstract class for implementing judges |
JudgeVerdict | Result from a single judge on a single item |
JudgeRunner | Runs multiple judges and aggregates scores |
Review | Aggregated result with final score and recommendation |
BaseJudge
Section titled “BaseJudge”All judges subclass BaseJudge and implement the evaluate() method:
from glacis.judges import BaseJudge, JudgeVerdict
class MyJudge(BaseJudge): judge_id = "my-judge-v1"
def evaluate(self, item, reference=None, rubric=None): # Your evaluation logic here return JudgeVerdict( judge_id=self.judge_id, score=2.5, rationale="The answer is accurate and well-structured.", latency_ms=150, metadata={"model": "gpt-4o"}, )evaluate() Parameters
Section titled “evaluate() Parameters”| Parameter | Type | Description |
|---|---|---|
item | dict[str, Any] | The item to evaluate (structure depends on use case) |
reference | str | None | Optional reference data for evaluation context |
rubric | str | None | Optional scoring rubric override (prompt text) |
Judges also support the context manager protocol. Override close() to release expensive resources like API clients or model handles.
JudgeVerdict
Section titled “JudgeVerdict”Each judge returns a JudgeVerdict:
| Field | Type | Default | Description |
|---|---|---|---|
judge_id | str | — | Identifier for the judge |
score | float | — | Numeric score (>= 0). Scale defined by JudgesConfig.max_score |
rationale | str | — | Judge’s explanation for the score |
latency_ms | int | 0 | Processing time in milliseconds |
metadata | dict | {} | Judge-specific metadata for audit trail |
JudgeRunner
Section titled “JudgeRunner”JudgeRunner orchestrates multiple judges on a single item and aggregates their results:
from glacis.judges import JudgeRunner
runner = JudgeRunner(judges=[judge_a, judge_b])result = runner.run( item={"question": "What is AI?", "answer": "AI is..."}, reference="Source document text...",)
print(f"Final score: {result.final_score}")print(f"Recommendation: {result.recommendation}")print(f"Consensus: {result.consensus}")If a judge raises an exception, it is caught and recorded as a verdict with score=0 and the error message as the rationale. The pipeline continues with the remaining judges.
JudgeRunner Parameters
Section titled “JudgeRunner Parameters”| Parameter | Type | Default | Description |
|---|---|---|---|
judges | list[BaseJudge] | (required) | List of judge instances to run |
consensus_threshold | float | 1.0 | Max score spread before flagging disagreement (ignored if config provided) |
config | JudgesConfig | None | None | Full configuration with all thresholds |
Review
Section titled “Review”The aggregated result from running all judges:
| Field | Type | Description |
|---|---|---|
verdicts | list[JudgeVerdict] | Individual judge results |
final_score | float | Average of all judge scores |
max_score | float | Maximum possible score (from config) |
consensus | bool | Whether judges agree within the threshold |
recommendation | str | "uphold", "borderline", or "escalate" |
to_wire_review()
Section titled “to_wire_review()”Convert a Review to a dict matching the glacis.models.Review wire format (for L2 attestation):
review = runner.run(item, reference=source_doc)wire = review.to_wire_review(sample_probability=0.1)# wire = {# "sample_probability": 0.1,# "judge_ids": ["gpt-4o-mini", "claude-haiku"],# "conformity_score": 0.9167, # final_score / max_score, clamped to [0, 1]# "recommendation": "uphold",# "rationale": "correct; mostly correct",# }| Parameter | Type | Description |
|---|---|---|
sample_probability | float | Probability this item was sampled (0.0-1.0) |
Recommendation Logic
Section titled “Recommendation Logic”The recommendation is derived from the final_score (average of all judge scores) using configurable thresholds:
if final_score >= uphold_threshold: recommendation = "uphold"elif final_score >= borderline_threshold: recommendation = "borderline"else: recommendation = "escalate"With default thresholds (0-3 scale):
| Score Range | Recommendation | Meaning |
|---|---|---|
| >= 2.0 | "uphold" | Quality is acceptable |
| >= 1.0 | "borderline" | Needs human review |
| < 1.0 | "escalate" | Quality concern, requires attention |
JudgesConfig
Section titled “JudgesConfig”All thresholds are configurable via JudgesConfig:
| Field | Type | Default | Description |
|---|---|---|---|
max_score | float | 3.0 | Maximum score on the rubric scale |
consensus_threshold | float | 1.0 | Max score spread between judges before flagging disagreement |
uphold_threshold | float | 2.0 | Minimum average score for "uphold" |
borderline_threshold | float | 1.0 | Minimum average score for "borderline" (below this means "escalate") |
score_precision | int | 4 | Decimal places for rounding final_score |
Example: Custom Scale
Section titled “Example: Custom Scale”from glacis.judges import JudgesConfig, JudgeRunner
# Binary pass/fail scale (0-1)config = JudgesConfig( max_score=1.0, uphold_threshold=0.7, borderline_threshold=0.4, consensus_threshold=0.2,)
runner = JudgeRunner(judges=[judge_a, judge_b], config=config)Writing a Custom Judge
Section titled “Writing a Custom Judge”Here is a complete example of a fact-checking judge that uses an LLM to evaluate answer accuracy:
import timefrom typing import Any, Optional
from glacis.judges import BaseJudge, JudgeVerdict
class FactCheckJudge(BaseJudge): """Fact-checking judge using an LLM."""
judge_id = "fact-check-gpt4o"
def __init__(self, openai_client): self._client = openai_client
def evaluate( self, item: dict[str, Any], reference: Optional[str] = None, rubric: Optional[str] = None, ) -> JudgeVerdict: start = time.perf_counter()
prompt = rubric or ( "Rate the factual accuracy of this answer on a 0-3 scale.\n" "0 = completely wrong, 1 = partially correct, " "2 = mostly correct, 3 = fully correct.\n" "Respond with just the score and a brief rationale." )
messages = [ {"role": "system", "content": prompt}, {"role": "user", "content": ( f"Question: {item.get('question', '')}\n" f"Answer: {item.get('answer', '')}\n" f"Reference: {reference or 'N/A'}" )}, ]
response = self._client.chat.completions.create( model="gpt-4o", messages=messages, temperature=0.0, )
text = response.choices[0].message.content or "" # Parse score from response (simplified) score = float(text[0]) if text and text[0].isdigit() else 0.0
latency_ms = int((time.perf_counter() - start) * 1000)
return JudgeVerdict( judge_id=self.judge_id, score=min(score, 3.0), rationale=text, latency_ms=latency_ms, metadata={"model": "gpt-4o"}, )
def close(self) -> None: """Release the OpenAI client if needed.""" passMultiple Judges and Consensus
Section titled “Multiple Judges and Consensus”When running multiple judges, the consensus flag indicates whether the judges agree:
from glacis.judges import JudgeRunner, JudgesConfig
config = JudgesConfig(consensus_threshold=1.0)runner = JudgeRunner( judges=[fact_check_judge, accuracy_judge], config=config,)
result = runner.run(item={"question": "...", "answer": "..."})
if not result.consensus: print("Judges disagree significantly!") for v in result.verdicts: print(f" {v.judge_id}: {v.score} - {v.rationale}")Consensus is computed as: max(scores) - min(scores) ≤ consensus_threshold. With a single judge, consensus is always True.
Configuration via glacis.yaml
Section titled “Configuration via glacis.yaml”Judge thresholds can be set in your config file:
version: "1.3"judges: max_score: 3.0 consensus_threshold: 1.0 uphold_threshold: 2.0 borderline_threshold: 1.0 score_precision: 4Load the config and pass it to JudgeRunner:
from glacis.config import load_configfrom glacis.judges import JudgeRunner
cfg = load_config("glacis.yaml")runner = JudgeRunner(judges=[my_judge], config=cfg.judges)See Also
Section titled “See Also”- Sampling & Evidence — how L2 sampling identifies attestations eligible for judge evaluation
- Configuration — full
glacis.yamlreference with judges section