Skip to main content
POST
/
scoring_system
/
generate
JavaScript
import PiClient from 'withpi';

const client = new PiClient({
  apiKey: 'My API Key',
});

const response = await client.scoringSystem.generate.startJob({
  application_description: "Write a children's story communicating a simple life lesson.",
  examples: [
    { llm_input: 'good input', llm_output: 'good response' },
    { llm_input: 'neutral input', llm_output: 'neutral response' },
  ],
  preference_examples: [
    { chosen: 'chosen response', llm_input: 'some input', rejected: 'rejected response' },
  ],
});

console.log(response.job_id);
{
  "balanced_accuracy": 123,
  "detailed_status": [
    "Downloading model",
    "Tuning prompt"
  ],
  "f1": 123,
  "job_id": "1234abcd",
  "num_labeled_examples_used": 123,
  "num_preference_examples_used": 123,
  "precision": 123,
  "recall": 123,
  "scoring_spec": [
    {
      "custom_model_id": "your-model-id",
      "label": "Relevance to Prompt",
      "parameters": [
        0.14285714285714285,
        0.2857142857142857,
        0.42857142857142855,
        0.5714285714285714,
        0.7142857142857143,
        0.8571428571428571
      ],
      "python_code": "\ndef score(response_text: str, input_text: str, kwargs: dict) -> dict:\n    word_count = len(response_text.split())\n    if word_count > 10:\n        return {\"score\": 0.2, \"explanation\": \"Response has more than 10 words\"}\n    elif word_count > 5:\n        return{\"score\": 0.6, \"explanation\": \"Response has more than 5 words\"}\n    else:\n        return {\"score\": 1, \"explanation\": \"Response has 5 or fewer words\"}\n",
      "question": "Is the response relevant to the prompt?",
      "scoring_type": "PI_SCORER",
      "tag": "Legal Formatting",
      "weight": 1
    }
  ],
  "state": "RUNNING",
  "threshold": 123
}

Authorizations

x-api-key
string
header
required

Body

application/json
application_description
string
required

The application description to generate a scoring spec for.

Examples:

"Write a children's story communicating a simple life lesson."

examples
SDKLabeledExample · object[]
required

Rated examples to use for generating the discriminating questions. The scores can be class labels or actual scores (but must be between 0 and 1)

Examples:
[
{
"llm_input": "good input",
"llm_output": "good response",
"score": 0.9
},
{
"llm_input": "neutral input",
"llm_output": "neutral response",
"score": 0.5
}
]
preference_examples
SDKPreferenceExample · object[]
required

Preference examples to use for generating the discriminating questions. Must specify either the examples or preference examples

Examples:
[
{
"chosen": "chosen response",
"llm_input": "some input",
"rejected": "rejected response"
}
]
batch_size
integer
default:10

Number of examples to use in one batch to generate the questions.

Examples:

"10"

existing_questions
Question · object[]

Existing questions for the applications, these may or may not be retained in the output depending on their performance

Examples:
[
{
"label": "some input",
"question": "Is the output relevant to input?",
"weight": 1
}
]
num_questions
integer
default:-1

The maximum number of new questions that the generated scoring system should contain. If <= 0, then the number is auto selected.

Examples:

"10"

retain_existing_questions
boolean
default:true

If true, only generate new questions that improve the accuracy.

Examples:

false

try_auto_generating_python_code
boolean
default:true

If true, try to generate python code for the generated questions.

Examples:

false

Response

Successful Response

detailed_status
string[]
required

Detailed status of the job

Examples:
["Downloading model", "Tuning prompt"]
job_id
string
required

The job id

Examples:

"1234abcd"

state
enum<string>
required

Current state of the job

Available options:
QUEUED,
RUNNING,
DONE,
ERROR,
CANCELLED
balanced_accuracy
number | null

Weighted combination fo average accuracy per class for the labeled data and overall accuracy for preference data.

f1
number | null

F1 for the labeled data.

num_labeled_examples_used
integer | null

Number of labeled examples used for spec generation.

num_preference_examples_used
integer | null

Number of preference examples used for spec generation.

precision
number | null

Precision for the labeled data.

recall
number | null

Recall for the labeled data.

scoring_spec
Question · object[] | null

The generated scoring spec

threshold
number | null

Threshold to use to separate 0 and 1 labels in the case of classification.