Skip to main content

Evaluation

AI agents evaluation can be quantitatively assessed through a variety of dimensions and indicators of the performance of the application.

RAG Application Evaluation

Process

1. Prepare Test Data

A typical Retrieval-Augmented Generation (RAG) application typically generates a set of data. When creating an evaluation task for a RAG application, the following items should be prepared in the specified format and stored in a CSV file:

  • Questionlist[str]: The questions posed by users to the application.
  • Contextslist[list[str]]: Based on the user's question, the Retriever looks up knowledge entries to the knowledge base to form a context, which is provided to the larger model. Multiple context entries are separated by a semicolon ;.
  • Answerlist[str]: The answers generated by the large models based on the question, contexts, and prompt templates.

In addition, for alignment with the actual content during evaluation, the following is typically required:

  • Ground_truthslist[list[str]]: the true and correct answers corresponding to the user's question. If there are multiple correct answers for a question (e.g. involving multiple parts of the source text), the entries are separated by semicolons ;.

In the evaluation task, the preparation process consists of two parts:

  1. Generating QA pairs from the knowledge bases data, which serve as the question set and ground truths in the test data.
  2. Presenting the generated question set to the RAG application for answering, collecting the response content and the retrieved knowledge bases context, which serve as the contexts and answers in the test data.

2. Prepare the “referee”

The evaluation task requires specifying a large model or a large model service as the evaluator (also referred to as the "referee" or "examiner"). By default, GPT-3.5 is used as the “referee”.

If using a locally running large model as the evaluator, it is recommended to:

  • Use models with a parameter size of 7B or higher. Larger models generally perform better on tasks.
  • Use "instruct fine-tuned" models. These fine-tuned models are often better at returning results in a specific format according to instructions.
  • When dealing with domain-specific knowledge (e.g. medical, financial), it is better to use large models fine-tuned on that particular domain for improved performance.

Additionally, some metrics require vectorization of the data using an Embedding model. The default option is to use the OpenAI Embedding API, but you can also use local models that support the Huggingface library, such as the bge series.

3. Perform Evaluation

Based on the given parameters, data to be measured, and the referee model, execute the ragas_once cli tool to generate a summary.csv file that includes a summary score and a result.csv file that includes the individual QA scores as the result of the evaluation.

4. Return Results

Read the evaluation result and return the score and whether it meets the threshold requirement.

Metrics

For Retrieval-Augmented Generation (RAG) applications that follow the typical structure of "knowledge base + large model", the following metrics can be evaluated:

  • Faithfulness

    Does the large model only use the information provided in the contexts to answer the questions? Does it generate any "hallucinated" information? This metric assesses whether the large model faithfully follows the information provided in the contexts when generating answers.

    • Evaluation components: Question, Contexts, Answer
  • Answer Relevancy

    Does the large model provide a complete and relevant answer to the question? Is there an answer that does not answer the question? This metric measures the correspondence between the answer generated by the large model and the question.

    • This metric does not consider the correctness of the answer.
    • Evaluating this metric requires the use of an Embedding model.
    • Evaluation components: Question, Answer
  • Context Precision

    When the retriever retrieves multiple context entries from the knowledge bases, does it prioritize the contexts that are more relevant to the ground truth by placing them at the top of the search results? This metric assesses the retriever's ability to match more accurate contexts with the question.

    • Evaluation components: Question, Contexts
  • Context Relevancy

    What proportion of the retrieved contexts contribute to answering the question? This metric measures the retriever's ability to accurately search for entries in the context that are relevant to the question.

    • Evaluation components: Question, Contexts
  • Context Recall

    Does the retrieved context completely cover the correct answer? This metric assesses the retriever's ability to comprehensively search for context that is relevant to the correct answer.

    • Evaluation components: Ground Truths, Contexts
  • Answer Similarity

    This metric calculates the semantic similarity between the generated answer and the correct answer using a cross-encoder model.

    • Evaluating this metric requires the use of an Embedding model.
    • Evaluation components: Answer, Ground Truths
  • Answer Correctness

    This metric evaluates whether the large model's answer aligns with the correct answer based on semantic similarity and factual correctness using weights.

    • The default weights for semantic similarity and factual correctness are [0.5, 0.5], but they can be customized.
    • Evaluating this metric requires the use of an Embedding model.
    • Evaluation components: Answer, Ground Truths

When creating an evaluation task, the first five metrics are evaluated by default.

RAG Evaluation CR Definition Example

// RAGSpec defines the desired state of RAG
type RAGSpec struct {
// CommonSpec
basev1alpha1.CommonSpec `json:",inline"`

// Application(required) defines the target of this RAG evaluation
Application *basev1alpha1.TypedObjectReference `json:"application"`

// Datasets defines the dataset which will be used to generate test datasets
Datasets []Dataset `json:"datasets"`

// JudgeLLM(required) defines the judge which is a LLM to evaluate RAG application against test dataset
JudgeLLM *basev1alpha1.TypedObjectReference `json:"judge_llm"`

// Metrics that this rag evaluation will do
Metrics []Metric `json:"metrics"`

// Report defines the evaluation report configurations
Report Report `json:"report,omitempty"`

// Storage storage must be provided and data needs to be saved throughout the evaluation phase.
Storage *corev1.PersistentVolumeClaimSpec `json:"storage"`

// ServiceAccountName define the user when the job is run
// +kubebuilder:default=default
ServiceAccountName string `json:"serviceAccountName,omitempty"`

// Suspend suspension of the evaluation process
// +kubebuilder:default=false
Suspend bool `json:"suspend,omitempty"`
}

// RAGStatus defines the observed state of RAG
type RAGStatus struct {
// CompletionTime Evaluation completion time
CompletionTime *metav1.Time `json:"completionTime,omitempty"`

// Phase evaluation current stage,
// init,download,generate,judge,upload,complete
Phase RAGPhase `json:"phase,omitempty"`

// Conditions show the status of the job in the current stage
Conditions []v1.JobCondition `json:"conditions,omitempty"`
}

Explanation of Spec Fields

  • CommonSpec: Basic descriptive information
  • Application: The RAG application to be evaluated in this assessment task
  • Datasets: QA dataset objects used to generate test data, including
    • Source: The source of the dataset
    • Files: The dataset files
  • JudgeLLM: The referee large model used for evaluating the test data
  • Metrics: A list of metrics to be evaluated. Each metric object includes:
    • Kind: The type of metric
    • Parameters: Parameters stored as key-value pairs, such as weights
    • ToleranceThreshold: The tolerance threshold for the metric. A score below this threshold indicates poor performance of the RAG application for that metric
  • Report: Evaluation report
  • Storage: Requested persistent storage for storing various data files
  • ServiceAccountName: The username of the running task
  • Suspend: Indicates whether the evaluation task has been suspended

Explanation of Status Fields:

  • CompletionTime: The total time taken to complete the evaluation task
  • Phase:The stage of the evaluation task
    • QA generation; Generating test data; Evaluation; Completion.
  • Conditions:The status of the running subtasks

Ragas_once cli Tool

Preparing the ragas and langchain environment:

pip install ragas==0.0.22 langchain==0.0.354

Source code installation:

git clone https://github.com/kubeagi/arcadia.git
cd arcadia/pypi/ragas_once
pip install -e .

Review of fiqa dataset demo using openai apikey:

ro --apikey YOUR_API_KEY

Evaluate the specified dataset csv using other openai format interfaces and the Huggingface Embedding model:

ro --model MODEL_NAME --apibase API_BASE_URL --embeddings BAAI/bge-small-en --dataset path/to/dataset.csv
  • To run the Embedding model, you need the sentence-transformers library: pip install sentence-transformers

Parameters Description:

  • --model: Specifies the model to use for evaluation.
    • Default value is "gpt-3.5-turbo". Langchain compatible.
  • --apibase: Specifies the base URL for the API.
  • --apikey: Specifies the API key to authenticate requests.
    • Not required if using psuedo-openai API server, e.g. vLLM, Fastchat, etc.
  • --embeddings: Specifies the Huggingface embeddings model to use for evaluation.
    • Embeddings will run locally.
    • Will use OpenAI embeddings if not set.
    • Better set if using psuedo-openai API server.
  • --metrics: Specifies the metrics to use for evaluation.
    • Will use Ragas default metrics if not set.
    • Default metrics: ["answer_relevancy", "context_precision", "faithfulness", "context_recall", "context_relevancy"]
    • Other metrics: "answer_similarity", "answer_correctness"
  • --dataset: Specifies the path to the dataset for evaluation.
    • Dataset format must meet RAGAS requirements.
    • Will use fiqa dataset as demo if not set.