Metric: L3Score

L3Score Result

📌 Description

L3Score evaluates how semantically close a model-generated answer is to a reference answer for a given question. It prompts a language model as a judge using:

You are given a question, ground-truth answer, and a candidate answer.

Question: {{question}}  
Ground-truth answer: {{gt}}  
Candidate answer: {{answer}}

Is the semantic meaning of the ground-truth and candidate answers similar?  
Answer in one word - Yes or No.

The model's log-probabilities for "Yes" and "No" tokens are used to compute the score.

🧮 Scoring Logic

Let $l_{\text{yes}}$ and $l_{\text{no}}$ be the log-probabilities of 'Yes' and 'No', respectively.

If neither token is in the top-5:

$$ \text{L3Score} = 0 $$

If both are present:

$$ \text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})} $$

If only one is present, the missing token’s probability is estimated using the minimum of:
- remaining probability mass apart from the top-5 tokens
- the least likely top-5 token

The score ranges from 0 to 1, where 1 indicates the highest confidence by the LLM that the predicted and reference answers are semantically equivalent. See SPIQA paper for details.

🚀 How to Use

import evaluate

l3score = evaluate.load("nhop/L3Score")

questions = ["What is the capital of France?", "What is the capital of Germany?"]
predictions = ["Paris", "Moscow"]
references = ["Paris", "Berlin"]

score = l3score.compute(
    questions=questions,
    predictions=predictions,
    references=references,
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)

print(score)
# {'L3Score': 0.49..., 'Cost': ...}

🔠 Inputs

Name	Type	Description
`questions`	`list[str]`	The list of input questions.
`predictions`	`list[str]`	Generated answers by the model being evaluated.
`references`	`list[str]`	Ground-truth or reference answers.
`api_key`	`str`	API key for the selected LLM provider.
`provider`	`str`	Must support top-n token log-probabilities. Default: openai
`model`	`str`	Name of the evaluation LLM. Default: gpt-4o-mini

📄 Output

Calling the compute method returns a dictionary containing the L3Score:

{"L3Score": float, "Cost": float}

The value is the average score over all (question, prediction, reference) triplets and the total cost of all API calls.

📊 Examples

l3score = evaluate.load("nhop/L3Score")

score = l3score.compute(
    questions=["What is the capital of France?"],
    predictions=["Paris"],
    references=["Paris"],
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)
# {'L3Score': 0.99..., 'Cost': ...}

score = l3score.compute(
    questions=["What is the capital of Germany?"],
    predictions=["Moscow"],
    references=["Berlin"],
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)
# {'L3Score': 0.00..., 'Cost': ...}

⚠️ Limitations and Bias

Requires models that expose top-n token log-probabilities (e.g., OpenAI, DeepSeek, Groq).
Scores are only comparable when using the same judge model.

📖 Citation

@article{pramanick2024spiqa,
  title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
  author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
  journal={arXiv preprint arXiv:2407.09413},
  year={2024}
}