Tuning thresholds
While individual scores give information on the degree to which an LLM performed well, it is often useful to evaluate generations on a strict pass-fail basis. We can interpret criteria as binary tests which can pass or fail by applying a threshold to the scores. For the above example, if we chose a threshold of 0.5, considering 0.5 and above as passing, the generation would be scored like so:Tuning Weights
By default, all criteria are weighted equally when computing the total score. You can increase or decrease the importance of certain criteria by assigning them weights in the rubric:Tuning Parameters
Each criterion can be tuned with a set of parameters which alter how the question is scored. Parameters are expressed as a list of numbers between 0 and 1 which define a piecewise linear function mapping raw scores to new ranges. For example, the parameters[0.5, 0.8, 0.9]
transform the raw score as follows:
[0.5, 0.8, 0.9]
are mapped to [0.25, 0.5, 0.75]
; since the transformed scores are less than each of the parameters, this transformation increases the strictness of the scorer.
In practice, these parameters are not determined manually. Instead, you can use the calibration endpoint to fit parameters to your data automatically.
Next steps
Pi Studio
Tune a rubric in Pi Studio
Calibration Colab
Try out the Calibration example notebook in Google Colab