Tuning Rubrics

The power of evaluation via rubrics comes from the ability to customize them to your specific use case. In addition to writing good criteria, you can tune your rubric to better fit your application data. You can tune three sets of parameters based on your needs: thresholds, weights, and parameters.

Tuning Thresholds

While individual scores give information on the degree to which an LLM performed well, it is often useful to evaluate generations on a strict pass-fail basis. We can interpret criteria as binary tests which can pass or fail by applying a threshold to the scores. For the above example, if we chose a threshold of 0.5, considering 0.5 and above as passing, the generation would be scored like so:

from withpi import PiClient

pi = PiClient()
scores = pi.scoring_system.score(
  llm_input="What are some good day trips from Milan by train?",
  llm_output="Milan is an excellent hub for day trips by train. You can take a short ride to the stunning shores of Lake Como to explore picturesque towns like Bellagio and Varenna. Alternatively, the historic hilltop city of Bergamo is another charming and easily accessible option. For a spectacular alpine adventure, consider the Bernina Express scenic train journey through the Swiss Alps. Cities like Turin and Verona are also just an hour or two away via high-speed train.",
  scoring_spec=[
    { "question": "Does the response maintain a professional tone?" },
    { "question": "Did the response fulfill the intent of the user's query?" },
    { "question": "Did the response only present data relevant to the user's query?" },
  ],
)

pass_fail = lambda score: f"{score} (PASS)" if score >= 0.5 else f"{score} (FAIL)"

print("Total Score:", pass_fail(scores.total_score))
print(
  "Question Scores:",
  { question: pass_fail(score) for question, score in scores.question_scores.items() },
)

Total Score: 0.821 (PASS)
Question Scores: {
  "Did the response fulfill the intent of the user's query?": '0.7695 (PASS)',
  "Did the response only present data relevant to the user's query?": '0.7969 (PASS)',
  'Does the response maintain a professional tone?': '0.9023 (PASS)'
}

Since all tests pass, let’s try a question that is false for this generation:

from withpi import PiClient

pi = PiClient()
scores = pi.scoring_system.score(
  llm_input="What are some good day trips from Milan by train?",
  llm_output="Milan is an excellent hub for day trips by train. You can take a short ride to the stunning shores of Lake Como to explore picturesque towns like Bellagio and Varenna. Alternatively, the historic hilltop city of Bergamo is another charming and easily accessible option. For a spectacular alpine adventure, consider the Bernina Express scenic train journey through the Swiss Alps. Cities like Turin and Verona are also just an hour or two away via high-speed train.",
  scoring_spec=[
    { "question": "Is the response formatted as a bulleted list?" }
  ]
)
print('Score:', scores.total_score)

Score: 0.0928

With our threshold of 0.5, the question is considered a failing test. Let’s see its impact on the total score if we included it in the original rubric, marking the scores according to the threshold:

Total Score: 0.4695 (FAIL)
Question Scores: {
  "Did the response fulfill the intent of the user's query?": '0.7695 (PASS)',
  "Did the response only present data relevant to the user's query?": '0.8008 (PASS)',
  'Does the response maintain a professional tone?': '0.9023 (PASS)',
  'Is the response formatted as a bulleted list?': '0.0874 (FAIL)'
}

In this case, the question that failed significantly lowered the total score enough for the whole generation to be considered a failure. If enough criteria fail, the total score will fall below your threshold, indicating that the generation is not acceptable. While this is a useful application of criteria, this type of binary scoring requires more sophisticated control of score aggregation than a simple average in practice. To control the influence of each criterion on the total score, there are three strategies for tuning the rubric.

Tuning Weights

By default, all criteria are weighted equally when computing the total score. You can increase or decrease the importance of certain criteria by assigning them weights in the rubric:

[
  {
    "question": "Is the response formatted as a bulleted list?",
    "weight": 3
  },
  {
    "question": "Did the response fulfill the intent of the user's query?",
    "weight": 1
  },
  {
    "question": "Did the response only present data relevant to the user's query?",
    "weight": 1
  },
  {
    "question": "Does the response maintain a professional tone?",
    "weight": 1
  }
]

Weights are relative—by default, criteria are assigned a weight of 1. In the example above, the first criterion is three times as important as each of the other criteria and equally as important as the other criteria considered together. Scoring our example with this rubric:

Total Score: 0.2542
Question Scores: {
  "Did the response fulfill the intent of the user's query?": 0.7695,
  "Did the response only present data relevant to the user's query?": 0.8008,
  "Is the response formatted as a bulleted list?": 0.0874,
  "Does the response maintain a professional tone?": 0.6562
}

The total score is now much lower, reflecting the increased importance of the failing question. By tuning weights, you can ensure that the total score reflects your quality priorities. Sometimes, however, you may want to alter the way an individual criterion is scored, making it stricter or more lenient. To achieve this, you can tune your criteria using the second strategy: parameters.

Tuning Parameters

Each criterion can be tuned with a set of parameters which alter how the question is scored. Parameters are expressed as a list of numbers between 0 and 1 which define a piecewise linear function mapping raw scores to new ranges. For example, the parameters [0.5, 0.8, 0.9] transform the raw score as follows:

An example of a strictness-increasing parameter transformation.

The parameters are mapped to evenly spaced values between 0 and 1, the values in between being mapped to line segments connecting the parameters; 0 and 1 remain rooted at their original values. In this case, [0.5, 0.8, 0.9] are mapped to [0.25, 0.5, 0.75]; since the transformed scores are less than each of the parameters, this transformation increases the strictness of the scorer. In practice, these parameters are not determined manually. Instead, you can use the calibration endpoint to fit parameters to your data automatically.

Getting Started

Pi Scorer

RAG

Pi Studio

Integrations

Tuning Thresholds

Tuning Weights

Tuning Parameters

Next steps

Pi Studio

Calibration Colab

Getting Started

Pi Scorer

RAG

Pi Studio

Integrations

​Tuning Thresholds

​Tuning Weights

​Tuning Parameters

​Next steps

Pi Studio

Calibration Colab

Tuning Thresholds

Tuning Weights

Tuning Parameters

Next steps