AI & Machine Learning Engineering

RAG Evaluation Pipeline: Implementing Ragas and TruLens for LLM Output Quality Metrics

MatterAI
MatterAI
12 min read·

RAG Evaluation Pipeline: Implementing Ragas and TruLens for LLM Output Quality Metrics

Evaluating RAG systems requires measuring retrieval and generation quality without ground truth labels. Ragas and TruLens provide complementary approaches: Ragas offers model-based batch evaluation metrics, while TruLens provides real-time feedback functions with observability tracking.

Prerequisites and Version Pinning

Both libraries have undergone breaking API changes. Pin versions strictly for reproducibility:

pip install ragas==0.2.0 trulens-apps-langchain==1.0.0 trulens-providers-openai==1.0.0 langchain-openai==0.2.0 langchain-chroma==0.1.4

The RAG Triad Framework

Both tools converge on three core metrics:

MetricDefinitionTarget
Context RelevanceRetrieved context's relevance to the queryRetrieval quality
Groundedness/FaithfulnessAnswer's factual alignment with retrieved contextHallucination detection
Answer RelevanceResponse's relevance to the original questionGeneration quality

Ragas Implementation (v0.2+)

Ragas uses LLM-as-a-judge to compute metrics on evaluation datasets.

Dataset Structure

Ragas v0.2+ uses SingleTurnSample objects wrapped in EvaluationDataset:

from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas.metrics import Faithfulness, ResponseRelevancy, ContextPrecision

# Create individual samples
samples = [
    SingleTurnSample(
        user_input="What is RAG?",
        retrieved_contexts=["Retrieval-Augmented Generation combines retrieval with generation..."],
        response="RAG is a technique that retrieves documents to augment LLM context.",
        reference="RAG retrieves documents to augment LLM context."
    ),
    SingleTurnSample(
        user_input="How does vector search work?",
        retrieved_contexts=["Vector search uses embeddings to find similar documents..."],
        response="Vector search finds similar documents using embedding similarity.",
        reference="Vector search compares embedding similarity."
    )
]

# Build evaluation dataset
eval_dataset = EvaluationDataset(samples=samples)

Running Evaluation

from ragas import evaluate
from langchain_openai import ChatOpenAI

# Configure evaluator LLM
evaluator_llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Run evaluation with error handling
try:
    results = evaluate(
        dataset=eval_dataset,
        metrics=[Faithfulness(), ResponseRelevancy(), ContextPrecision()],
        llm=evaluator_llm
    )
    print(results.to_pandas())
except Exception as e:
    print(f"Evaluation failed: {e}")
    # Consider fallback or retry logic

Async Evaluation for Large Datasets

Ragas supports async evaluation for improved throughput on large datasets:

import asyncio
from ragas import evaluate

async def evaluate_async(dataset, metrics, llm):
    """Run async evaluation for better performance on large datasets."""
    results = await evaluate(
        dataset=dataset,
        metrics=metrics,
        llm=llm
    )
    return results

# For datasets > 100 samples, async provides 2-3x speedup
results = asyncio.run(evaluate_async(
    eval_dataset,
    [Faithfulness(), ResponseRelevancy()],
    evaluator_llm
))

Key Ragas Metrics

MetricDescriptionGround Truth Required
FaithfulnessMeasures claims in the answer inferable from contextNo
Response RelevancyScores how well the answer addresses the questionNo
Context PrecisionSignal-to-noise ratio: are relevant chunks ranked higher?No
Context RecallCompleteness of retrieved context vs referenceYes

Context Precision vs Context Relevancy

These metrics measure different aspects:

  • Context Precision: Evaluates whether relevant contexts appear at the top of the retrieval list. Uses the formula: precision at each relevant chunk position, averaged. High scores indicate good ranking.

  • Context Relevancy (deprecated in v0.2+): Previously measured overall relevance percentage. Use Context Precision instead for current versions.

TruLens Implementation (v1.0+)

TruLens provides feedback functions with built-in tracking and a dashboard.

Feedback Functions Setup

from trulens.core import Feedback
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain

provider = OpenAI(model_engine="gpt-4o")

# Define feedback functions with proper chaining
f_relevance = (
    Feedback(provider.relevance_with_cot_reasons)
    .on_input_output()
)

f_groundedness = (
    Feedback(provider.groundedness_measure_with_reasons)
    .on(TruChain.select_context())
    .on_output()
)

f_answer_relevance = (
    Feedback(provider.relevance)
    .on_input()
    .on_output()
)

Recording RAG Application (LCEL with LangChain 0.2+)

RetrievalQA is deprecated in LangChain 0.2+. Use LCEL with create_retrieval_chain:

from trulens.apps.langchain import TruChain
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Initialize components
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")
retriever = vectorstore.as_retriever()

# Build LCEL chain (replaces deprecated RetrievalQA)
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the context:
{context}

Question: {input}
""")

document_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, document_chain)

# Wrap with TruLens
try:
    tru_recorder = TruChain(
        chain,
        app_id="rag_pipeline_v1",
        feedbacks=[f_relevance, f_groundedness, f_answer_relevance]
    )

    # Record a query
    with tru_recorder as recording:
        response = chain.invoke({"input": "What is RAG?"})

    print(response["answer"])
except Exception as e:
    print(f"TruLens recording failed: {e}")

Launching the Dashboard

from trulens.dashboard import run_dashboard

# Launch dashboard on default port (requires TruLens 1.0+)
run_dashboard()

# For custom port or remote access:
# run_dashboard(port=8501, force=True)

Custom Feedback Functions

For production groundedness evaluation, use a cross-encoder NLI model:

from trulens.core import Feedback
from sentence_transformers import CrossEncoder
import re

# Load NLI model for entailment detection
nli_model = CrossEncoder('cross-encoder/nli-deberta-v3-base')

def extract_claims(text: str) -> list:
    """Extract factual claims from text using sentence splitting."""
    sentences = re.split(r'[.!?]+', text)
    return [s.strip() for s in sentences if s.strip()]

def verify_claim_with_nli(claim: str, context: str) -> float:
    """
    Use NLI cross-encoder to verify claim against context.
    Returns entailment probability (0-1).
    """
    scores = nli_model.predict([(context, claim)])
    # Scores: [contradiction, entailment, neutral]
    entailment_idx = 1  # Index for entailment in NLI output
    return float(scores[entailment_idx])

def production_groundedness(context: str, answer: str) -> float:
    """
    Production-ready groundedness using NLI entailment.
    Returns average entailment score across all claims.
    """
    claims = extract_claims(answer)
    if not claims:
        return 0.0
    
    entailment_scores = [
        verify_claim_with_nli(claim, context)
        for claim in claims
    ]
    return sum(entailment_scores) / len(entailment_scores)

# Register as TruLens feedback
f_custom_groundedness = (
    Feedback(production_groundedness)
    .on(TruChain.select_context())
    .on_output()
)

Evaluation Dataset Sizing

Statistical significance requires sufficient sample sizes:

Dataset SizeStatistical ConfidenceUse Case
< 30 samplesLow (high variance)Quick smoke tests only
50-100 samplesModerateDevelopment iteration
100-200 samplesGoodPre-production validation
500+ samplesHighProduction benchmarking, CI/CD

Recommendations:

  • Minimum 50 samples for meaningful metric averages
  • 100+ samples for detecting 5% performance changes with 80% confidence
  • Stratify samples across query types and difficulty levels

Metric Score Interpretation

Score RangeInterpretationAction
0.8 - 1.0ExcellentProduction ready
0.6 - 0.8AcceptableMonitor for degradation
0.4 - 0.6PoorInvestigate retrieval or generation
< 0.4CriticalBlock deployment, debug immediately

Recommended thresholds for alerting:

  • Groundedness < 0.7: Potential hallucination risk
  • Context Precision < 0.6: Retrieval needs improvement
  • Answer Relevance < 0.7: Generation misaligned with query

Evaluation Cost Considerations

LLM-as-a-judge evaluation incurs API costs per metric per sample:

MetricLLM Calls per SampleApproximate Tokens
Faithfulness2-3500-1500
Response Relevancy1200-500
Context Precision1 per context chunk300-800

Cost estimation formula: samples × metrics × avg_tokens_per_metric × price_per_1k_tokens

Example calculation (GPT-4o-mini at $0.15/1K input tokens):

  • 100 samples × 3 metrics × 800 avg tokens = 240,000 tokens
  • 240,000 × 0.15/1,000=0.15 / 1,000 = **36.00 USD**

For GPT-4o at 2.50/1Kinputtokens:sameworkload=2.50/1K input tokens: same workload = **600.00 USD**

Comparative Analysis

AspectRagasTruLens
Primary UseBatch evaluation, CI/CD pipelinesReal-time observability, production monitoring
OutputMetric scores per sampleDashboard with traces and feedback
IntegrationPost-hoc evaluationInline recording during inference
Ground TruthOptional (required for Context Recall)Not required
Best ForDataset benchmarking, model comparisonProduction monitoring, debugging

Implementation Pipeline

Combined Architecture

from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas import evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain

# Initialize components
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")
retriever = vectorstore.as_retriever()

# Build LCEL chain
prompt = ChatPromptTemplate.from_template("Answer: {context}\n\nQuestion: {input}")
document_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, document_chain)

# Define TruLens feedback functions
provider = OpenAI(model_engine="gpt-4o")
f_groundedness = (
    Feedback(provider.groundedness_measure_with_reasons)
    .on(TruChain.select_context())
    .on_output()
)
f_relevance = (
    Feedback(provider.relevance_with_cot_reasons)
    .on_input_output()
)

def build_ragas_dataset(qa_pairs: list, retriever) -> EvaluationDataset:
    """Build Ragas EvaluationDataset from QA pairs with retrieved contexts."""
    samples = []
    for pair in qa_pairs:
        docs = retriever.invoke(pair["question"])
        contexts = [doc.page_content for doc in docs]
        
        samples.append(SingleTurnSample(
            user_input=pair["question"],
            retrieved_contexts=contexts,
            response=pair["answer"],
            reference=pair.get("ground_truth")
        ))
    return EvaluationDataset(samples=samples)

def evaluate_rag_offline(qa_pairs: list, retriever, llm) -> dict:
    """Run offline evaluation with Ragas."""
    eval_dataset = build_ragas_dataset(qa_pairs, retriever, llm)
    try:
        results = evaluate(
            dataset=eval_dataset,
            metrics=[Faithfulness(), ResponseRelevancy()],
            llm=llm
        )
        return results.to_pandas().to_dict()
    except Exception as e:
        print(f"Offline evaluation error: {e}")
        return {}

def create_monitored_chain(base_chain, feedbacks: list):
    """Wrap chain with TruLens monitoring."""
    return TruChain(
        base_chain,
        app_id="production_rag",
        feedbacks=feedbacks
    )

# Example usage
test_set = [
    {"question": "What is RAG?", "answer": "RAG retrieves documents...", "ground_truth": "RAG is..."},
    {"question": "How does vector search work?", "answer": "Vector search...", "ground_truth": "Vector search..."}
]

results = evaluate_rag_offline(test_set, retriever, llm)
monitored_chain = create_monitored_chain(rag_chain, [f_groundedness, f_relevance])

Getting Started

  1. Install dependencies with pinned versions: pip install ragas==0.2.0 trulens-apps-langchain==1.0.0 langchain-openai==0.2.0

  2. Prepare evaluation dataset (minimum 50 samples) with questions, retrieved contexts, and generated answers using SingleTurnSample objects

  3. Run Ragas evaluation for baseline metrics; use async for datasets > 100 samples

  4. Build LCEL chain with create_retrieval_chain (not deprecated RetrievalQA)

  5. Wrap production chain with TruChain recorder for ongoing monitoring

  6. Set alert thresholds: groundedness < 0.7, context precision < 0.6

  7. Monitor costs: GPT-4o-mini ~36/100samples;GPT4o 36/100 samples; GPT-4o ~600/100 samples

  8. Iterate on retrieval and generation based on metric patterns

Share this Guide:

More Guides

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Hono Edge Framework: Build Ultra-Fast APIs for Cloudflare Workers and Bun

Master Hono's zero-dependency web framework to build low-latency edge APIs that deploy seamlessly across Cloudflare Workers, Bun, and other JavaScript runtimes. Learn routing, middleware, validation, and real-time streaming patterns optimized for edge computing.

6 min read

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

No credit card requiredSOC 2 Type IISetup in 2 min