Rubrics
Criteria
Pi Scorer accepts a rubric of criteria to evaluate data. Each criterion is a yes/no question that captures some essential aspect of what it means for the application to perform well. For example, the rubric below scores the performance of an LLM which performs a search function.label
with each criterion to use as a short identifier for the question:
Python Criteria
If a criterion can be implemented programmatically, you can implement it as a Python function which will be run along with Pi Scorer. This is useful for criteria which are algorithmic, such as testing whether a response is valid JSON. For example:score
function should match exactly as shown above, with the dictionary containing the single key “score”.
Aggregation
By default, the total score is the geometric mean of the question scores. You can select a different aggregation method with theaggregation_method
parameter.
Currently, the supported aggregation methods are:
arithmetic_mean
: the least sensitive to low scores.geometric_mean
(default): more sensitive to low scores.harmonic_mean
: the most sensitive to low scores.
Next Steps
Now that you’ve seen how to define rubrics and call Pi Scorer, you can learn more about tuning your rubric to your data or try generating a rubric with your application data in Pi Studio:Tuning
Learn how to tune your rubric to your data.
Pi Studio
Build a rubric in Pi Studio