Writing Rubrics

Evaluation is the driving force of the AI development cycle. Large language models are random systems—one-off tests or strict prompts can’t ensure they will meet your application’s bar for quality; it doesn’t matter if 10 tests succeed if only 10 out of every 15 succeed on average. Monitoring your agent or workflow’s performance is therefore a continuous process, requiring reliable metrics to evaluate where your system needs to improve. Pi Scorer is a fast, deterministic language model developed by Pi Labs which scores text based on a rubric of questions. In contrast to LLM-as-a-judge, Pi Scorer gives well-distributed scores with minimal latency, forming a solid foundation for both online and offline evaluations. Instead of hacking at a system prompt to convince an LLM judge to score a certain way, you can easily define a rubric of clear, interpretable criteria and tune the rubric using your application’s data.

Rubrics

Criteria

Pi Scorer accepts a rubric of criteria to evaluate data. Each criterion is a yes/no question that captures some essential aspect of what it means for the application to perform well. For example, the rubric below scores the performance of an LLM which performs a search function.

[
  { "question": "Does the response maintain a professional tone?" },
  { "question": "Did the response fulfill the intent of the user's query?" },
  { "question": "Did the response only present data relevant to the user's query?" }
]

To use this rubric, we can pass it along with an LLM generation to Pi Scorer via the API:

from withpi import PiClient

pi = PiClient()
scores = pi.scoring_system.score(
  llm_input="What are some good day trips from Milan by train?",
  llm_output="Milan is an excellent hub for day trips by train. You can take a short ride to the stunning shores of Lake Como to explore picturesque towns like Bellagio and Varenna. Alternatively, the historic hilltop city of Bergamo is another charming and easily accessible option. For a spectacular alpine adventure, consider the Bernina Express scenic train journey through the Swiss Alps. Cities like Turin and Verona are also just an hour or two away via high-speed train.",
  scoring_spec=[
    { "question": "Does the response maintain a professional tone?" },
    { "question": "Did the response fulfill the intent of the user's query?" },
    { "question": "Did the response only present data relevant to the user's query?" }
  ]
)
print('Total Score:', scores.total_score)
print('Question Scores:', scores.question_scores)

Total Score: 0.821
Question Scores: {
  "Does the response maintain a professional tone?": 0.9023,
  "Did the response only present data relevant to the user's query?": 0.7969,
  "Is the response shorter than 6 sentences?": 0.6523
}

Each question is scored between 0 and 1; 0 represents the question being false for the input, 1 represents true for the input. This is why each criterion must be phrased as a yes/no question—Pi Scorer determines how true each question is for the input. The total score is an aggregate of the individual question scores, by default the geometric mean. Different aggregation methods penalize low scores to different degrees. Since questions can be long, you can pass a label with each criterion to use as a short identifier for the question:

from withpi import PiClient

pi = PiClient()
scores = pi.scoring_system.score(
  llm_input="What are some good day trips from Milan by train?",
  llm_output="Milan is an excellent hub for day trips by train. You can take a short ride to the stunning shores of Lake Como to explore picturesque towns like Bellagio and Varenna. Alternatively, the historic hilltop city of Bergamo is another charming and easily accessible option. For a spectacular alpine adventure, consider the Bernina Express scenic train journey through the Swiss Alps. Cities like Turin and Verona are also just an hour or two away via high-speed train.",
  scoring_spec=[
    {
      "label": "Professional Tone",
      "question": "Does the response maintain a professional tone?"
    },
    {
      "label": "Intent Fulfillment",
      "question": "Did the response fulfill the intent of the user's query?"
    },
    {
      "label": "Relevance Check",
      "question": "Did the response only present data relevant to the user's query?"
    }
  ]
)
print('Total Score:', scores.total_score)
print('Question Scores:', scores.question_scores)

Total Score: 0.821
Question Scores: {
  'Intent Fulfillment': 0.7695,
  'Professional Tone': 0.9023,
  'Relevance Check': 0.7969
}

By separating your quality goals into distinct criteria and aggregating evaluations, you can gain insight into how your application performs both along each dimension of quality and holistically.

Python Criteria

If a criterion can be implemented programmatically, you can implement it as a Python function which will be run along with Pi Scorer. This is useful for criteria which are algorithmic, such as testing whether a response is valid JSON. For example:

from withpi import PiClient

python_code = """
import json

def score(response_text: str, input_text: str, kwargs: dict) -> dict:
  try:
    json.loads(response_text)
    return { "score": 1.0 }
  except ValueError:
    return { "score": 0.0 }
"""

pi = PiClient()
scores = pi.scoring_system.score(
  llm_input="What are some good day trips from Milan by train?",
  llm_output='{ "locations": ["Lake Como", "Bergamo", "Bernina Express", "Turin", "Verona"] }',
  scoring_spec=[
    {
      "question": "Is the response valid JSON?",
      "python_code": python_code,
    },
  ]
)
print('Score:', scores.total_score)

Score: 1.0

The type signature of the score function should match exactly as shown above, with the dictionary containing the single key “score”.

Aggregation

By default, the total score is the geometric mean of the question scores. You can select a different aggregation method with the aggregation_method parameter. Currently, the supported aggregation methods are:

arithmetic_mean: the least sensitive to low scores.
geometric_mean (default): more sensitive to low scores.
harmonic_mean: the most sensitive to low scores.

Next Steps

Now that you’ve seen how to define rubrics and call Pi Scorer, you can learn more about tuning your rubric to your data or try generating a rubric with your application data in Pi Studio:

Tuning Rubrics

Learn how to tune your rubric to your data.

Pi Studio

Build a rubric tuned to your data.

Getting Started

Pi Scorer

RAG

Pi Studio

Integrations

Rubrics

Criteria

Python Criteria

Aggregation

Next Steps

Tuning Rubrics

Pi Studio

Getting Started

Pi Scorer

RAG

Pi Studio

Integrations

​Rubrics

​Criteria

​Python Criteria

​Aggregation

​Next Steps

Tuning Rubrics

Pi Studio

Rubrics

Criteria

Python Criteria

Aggregation

Next Steps