Tuning thresholds

While individual scores give information on the degree to which an LLM performed well, it is often useful to evaluate generations on a strict pass-fail basis. We can interpret criteria as binary tests which can pass or fail by applying a threshold to the scores. For the above example, if we chose a threshold of 0.5, considering 0.5 and above as passing, the generation would be scored like so:
from withpi import PiClient

pi = PiClient()
scores = pi.scoring_system.score(
  llm_input="What are some good day trips from Milan by train?",
  llm_output="Milan is an excellent hub for day trips by train. You can take a short ride to the stunning shores of Lake Como to explore picturesque towns like Bellagio and Varenna. Alternatively, the historic hilltop city of Bergamo is another charming and easily accessible option. For a spectacular alpine adventure, consider the Bernina Express scenic train journey through the Swiss Alps. Cities like Turin and Verona are also just an hour or two away via high-speed train.",
  scoring_spec=[
    { "question": "Is the response formatted as a bulleted list?" },
    { "question": "Did the response fulfill the intent of the user's query?" },
    { "question": "Did the response only present data relevant to the user's query?" },
    { "question": "Is the response shorter than 6 sentences?" },
  ],
)

pass_fail = lambda score: f"{score} (PASS)" if score >= 0.5 else f"{score} (FAIL)"

print("Total Score:", pass_fail(scores.total_score))
print(
  "Question Scores:",
  { question: pass_fail(score) for question, score in scores.question_scores.items() },
)
Total Score: 0.7368 (PASS)
Question Scores: {
  "Did the response fulfill the intent of the user's query?": '0.7695 (PASS)',
  "Did the response only present data relevant to the user's query?": '0.7969 (PASS)',
  'Is the response shorter than 6 sentences?': '0.6523 (PASS)'
}
Since all tests pass, let’s try a question that is false for this generation:
from withpi import PiClient

pi = PiClient()
scores = pi.scoring_system.score(
  llm_input="What are some good day trips from Milan by train?",
  llm_output="Milan is an excellent hub for day trips by train. You can take a short ride to the stunning shores of Lake Como to explore picturesque towns like Bellagio and Varenna. Alternatively, the historic hilltop city of Bergamo is another charming and easily accessible option. For a spectacular alpine adventure, consider the Bernina Express scenic train journey through the Swiss Alps. Cities like Turin and Verona are also just an hour or two away via high-speed train.",
  scoring_spec=[
    { "question": "Is the response formatted as a bulleted list?" }
  ]
)
print('Score:', scores.total_score)
Score: 0.0928
With our threshold of 0.5, the question is considered a failing test. Let’s see its impact on the total score if we included it in the original rubric, marking the scores according to the threshold:
Total Score: 0.4336 (FAIL)
Question Scores: {
  "Did the response fulfill the intent of the user's query?": 0.7695 (PASS),
  "Did the response only present data relevant to the user's query?": 0.8008 (PASS),
  "Is the response formatted as a bulleted list?": 0.0874 (FAIL),
  "Is the response shorter than 6 sentences?": 0.6562 (PASS)
}
In this case, the question that failed significantly lowered the total score enough for the whole generation to be considered a failure. If enough criteria fail, the total score will fall below your threshold, indicating that the generation is not acceptable. While this is a useful application of criteria, this type of binary scoring requires more sophisticated control of score aggregation than a simple average in practice. To control the influence of each criterion on the total score, there are three strategies for tuning the rubric.

Tuning Weights

By default, all criteria are weighted equally when computing the total score. You can increase or decrease the importance of certain criteria by assigning them weights in the rubric:
[
  {
    "question": "Is the response formatted as a bulleted list?",
    "weight": 3
  },
  {
    "question": "Did the response fulfill the intent of the user's query?",
    "weight": 1
  },
  {
    "question": "Did the response only present data relevant to the user's query?",
    "weight": 1
  },
  {
    "question": "Is the response shorter than 6 sentences?",
    "weight": 1
  }
]
Weights are relative—by default, criteria are assigned a weight of 1. In the example above, the first criterion is three times as important as each of the other criteria and equally as important as the other criteria considered together. Scoring our example with this rubric:
Total Score: 0.2542
Question Scores: {
  "Did the response fulfill the intent of the user's query?": 0.7695,
  "Did the response only present data relevant to the user's query?": 0.8008,
  "Is the response formatted as a bulleted list?": 0.0874,
  "Is the response shorter than 6 sentences?": 0.6562
}
The total score is now much lower, reflecting the increased importance of the failing question. By tuning weights, you can ensure that the total score reflects your quality priorities. Sometimes, however, you may want to alter the way an individual criterion is scored, making it stricter or more lenient. To achieve this, you can tune your criteria using the second strategy: parameters.

Tuning Parameters

Each criterion can be tuned with a set of parameters which alter how the question is scored. Parameters are expressed as a list of numbers between 0 and 1 which define a piecewise linear function mapping raw scores to new ranges. For example, the parameters [0.5, 0.8, 0.9] transform the raw score as follows: Parameter Transformation The parameters are mapped to evenly spaced values between 0 and 1, the values in between being mapped to line segments connecting the parameters; 0 and 1 remain rooted at their original values. In this case, [0.5, 0.8, 0.9] are mapped to [0.25, 0.5, 0.75]; since the transformed scores are less than each of the parameters, this transformation increases the strictness of the scorer. In practice, these parameters are not determined manually. Instead, you can use the calibration endpoint to fit parameters to your data automatically.

Next steps

Pi Studio

Tune a rubric in Pi Studio

Calibration Colab

Try out the Calibration example notebook in Google Colab