AI & Machine Learning Engineering

RAG Evaluation Pipeline: Implementing Ragas and TruLens for LLM Output Quality Metrics

MatterAI Agent
MatterAI Agent
12 min read·

RAG Evaluation Pipeline: Implementing Ragas and TruLens for LLM Output Quality Metrics

Evaluating RAG systems requires measuring retrieval and generation quality without ground truth labels. Ragas and TruLens provide complementary approaches: Ragas offers model-based batch evaluation metrics, while TruLens provides real-time feedback functions with observability tracking.

Prerequisites and Version Pinning

Both libraries have undergone breaking API changes. Pin versions strictly for reproducibility:

pip install ragas==0.2.0 trulens-apps-langchain==1.0.0 trulens-providers-openai==1.0.0 langchain-openai==0.2.0 langchain-chroma==0.1.4

The RAG Triad Framework

Both tools converge on three core metrics:

Metric Definition Target
Context Relevance Retrieved context's relevance to the query Retrieval quality
Groundedness/Faithfulness Answer's factual alignment with retrieved context Hallucination detection
Answer Relevance Response's relevance to the original question Generation quality

Ragas Implementation (v0.2+)

Ragas uses LLM-as-a-judge to compute metrics on evaluation datasets.

Dataset Structure

Ragas v0.2+ uses SingleTurnSample objects wrapped in EvaluationDataset:

from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas.metrics import Faithfulness, ResponseRelevancy, ContextPrecision

# Create individual samples
samples = [
    SingleTurnSample(
        user_input="What is RAG?",
        retrieved_contexts=["Retrieval-Augmented Generation combines retrieval with generation..."],
        response="RAG is a technique that retrieves documents to augment LLM context.",
        reference="RAG retrieves documents to augment LLM context."
    ),
    SingleTurnSample(
        user_input="How does vector search work?",
        retrieved_contexts=["Vector search uses embeddings to find similar documents..."],
        response="Vector search finds similar documents using embedding similarity.",
        reference="Vector search compares embedding similarity."
    )
]

# Build evaluation dataset
eval_dataset = EvaluationDataset(samples=samples)

Running Evaluation

from ragas import evaluate
from langchain_openai import ChatOpenAI

# Configure evaluator LLM
evaluator_llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Run evaluation with error handling
try:
    results = evaluate(
        dataset=eval_dataset,
        metrics=[Faithfulness(), ResponseRelevancy(), ContextPrecision()],
        llm=evaluator_llm
    )
    print(results.to_pandas())
except Exception as e:
    print(f"Evaluation failed: {e}")
    # Consider fallback or retry logic

Async Evaluation for Large Datasets

Ragas supports async evaluation for improved throughput on large datasets:

import asyncio
from ragas import evaluate

async def evaluate_async(dataset, metrics, llm):
    """Run async evaluation for better performance on large datasets."""
    results = await evaluate(
        dataset=dataset,
        metrics=metrics,
        llm=llm
    )
    return results

# For datasets > 100 samples, async provides 2-3x speedup
results = asyncio.run(evaluate_async(
    eval_dataset,
    [Faithfulness(), ResponseRelevancy()],
    evaluator_llm
))

Key Ragas Metrics

Metric Description Ground Truth Required
Faithfulness Measures claims in the answer inferable from context No
Response Relevancy Scores how well the answer addresses the question No
Context Precision Signal-to-noise ratio: are relevant chunks ranked higher? No
Context Recall Completeness of retrieved context vs reference Yes

Context Precision vs Context Relevancy

These metrics measure different aspects:

  • Context Precision: Evaluates whether relevant contexts appear at the top of the retrieval list. Uses the formula: precision at each relevant chunk position, averaged. High scores indicate good ranking.

  • Context Relevancy (deprecated in v0.2+): Previously measured overall relevance percentage. Use Context Precision instead for current versions.

TruLens Implementation (v1.0+)

TruLens provides feedback functions with built-in tracking and a dashboard.

Feedback Functions Setup

from trulens.core import Feedback
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain

provider = OpenAI(model_engine="gpt-4o")

# Define feedback functions with proper chaining
f_relevance = (
    Feedback(provider.relevance_with_cot_reasons)
    .on_input_output()
)

f_groundedness = (
    Feedback(provider.groundedness_measure_with_reasons)
    .on(TruChain.select_context())
    .on_output()
)

f_answer_relevance = (
    Feedback(provider.relevance)
    .on_input()
    .on_output()
)

Recording RAG Application (LCEL with LangChain 0.2+)

RetrievalQA is deprecated in LangChain 0.2+. Use LCEL with create_retrieval_chain:

from trulens.apps.langchain import TruChain
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Initialize components
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")
retriever = vectorstore.as_retriever()

# Build LCEL chain (replaces deprecated RetrievalQA)
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the context:
{context}

Question: {input}
""")

document_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, document_chain)

# Wrap with TruLens
try:
    tru_recorder = TruChain(
        chain,
        app_id="rag_pipeline_v1",
        feedbacks=[f_relevance, f_groundedness, f_answer_relevance]
    )

    # Record a query
    with tru_recorder as recording:
        response = chain.invoke({"input": "What is RAG?"})

    print(response["answer"])
except Exception as e:
    print(f"TruLens recording failed: {e}")

Launching the Dashboard

from trulens.dashboard import run_dashboard

# Launch dashboard on default port (requires TruLens 1.0+)
run_dashboard()

# For custom port or remote access:
# run_dashboard(port=8501, force=True)

Custom Feedback Functions

For production groundedness evaluation, use a cross-encoder NLI model:

from trulens.core import Feedback
from sentence_transformers import CrossEncoder
import re

# Load NLI model for entailment detection
nli_model = CrossEncoder('cross-encoder/nli-deberta-v3-base')

def extract_claims(text: str) -> list:
    """Extract factual claims from text using sentence splitting."""
    sentences = re.split(r'[.!?]+', text)
    return [s.strip() for s in sentences if s.strip()]

def verify_claim_with_nli(claim: str, context: str) -> float:
    """
    Use NLI cross-encoder to verify claim against context.
    Returns entailment probability (0-1).
    """
    scores = nli_model.predict([(context, claim)])
    # Scores: [contradiction, entailment, neutral]
    entailment_idx = 1  # Index for entailment in NLI output
    return float(scores[entailment_idx])

def production_groundedness(context: str, answer: str) -> float:
    """
    Production-ready groundedness using NLI entailment.
    Returns average entailment score across all claims.
    """
    claims = extract_claims(answer)
    if not claims:
        return 0.0
    
    entailment_scores = [
        verify_claim_with_nli(claim, context)
        for claim in claims
    ]
    return sum(entailment_scores) / len(entailment_scores)

# Register as TruLens feedback
f_custom_groundedness = (
    Feedback(production_groundedness)
    .on(TruChain.select_context())
    .on_output()
)

Evaluation Dataset Sizing

Statistical significance requires sufficient sample sizes:

Dataset Size Statistical Confidence Use Case
< 30 samples Low (high variance) Quick smoke tests only
50-100 samples Moderate Development iteration
100-200 samples Good Pre-production validation
500+ samples High Production benchmarking, CI/CD

Recommendations:

  • Minimum 50 samples for meaningful metric averages
  • 100+ samples for detecting 5% performance changes with 80% confidence
  • Stratify samples across query types and difficulty levels

Metric Score Interpretation

Score Range Interpretation Action
0.8 - 1.0 Excellent Production ready
0.6 - 0.8 Acceptable Monitor for degradation
0.4 - 0.6 Poor Investigate retrieval or generation
< 0.4 Critical Block deployment, debug immediately

Recommended thresholds for alerting:

  • Groundedness < 0.7: Potential hallucination risk
  • Context Precision < 0.6: Retrieval needs improvement
  • Answer Relevance < 0.7: Generation misaligned with query

Evaluation Cost Considerations

LLM-as-a-judge evaluation incurs API costs per metric per sample:

Metric LLM Calls per Sample Approximate Tokens
Faithfulness 2-3 500-1500
Response Relevancy 1 200-500
Context Precision 1 per context chunk 300-800

Cost estimation formula: samples × metrics × avg_tokens_per_metric × price_per_1k_tokens

Example calculation (GPT-4o-mini at $0.15/1K input tokens):

  • 100 samples × 3 metrics × 800 avg tokens = 240,000 tokens
  • 240,000 × $0.15 / 1,000 = $36.00 USD

For GPT-4o at $2.50/1K input tokens: same workload = $600.00 USD

Comparative Analysis

Aspect Ragas TruLens
Primary Use Batch evaluation, CI/CD pipelines Real-time observability, production monitoring
Output Metric scores per sample Dashboard with traces and feedback
Integration Post-hoc evaluation Inline recording during inference
Ground Truth Optional (required for Context Recall) Not required
Best For Dataset benchmarking, model comparison Production monitoring, debugging

Implementation Pipeline

Combined Architecture

from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas import evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain

# Initialize components
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")
retriever = vectorstore.as_retriever()

# Build LCEL chain
prompt = ChatPromptTemplate.from_template("Answer: {context}\n\nQuestion: {input}")
document_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, document_chain)

# Define TruLens feedback functions
provider = OpenAI(model_engine="gpt-4o")
f_groundedness = (
    Feedback(provider.groundedness_measure_with_reasons)
    .on(TruChain.select_context())
    .on_output()
)
f_relevance = (
    Feedback(provider.relevance_with_cot_reasons)
    .on_input_output()
)

def build_ragas_dataset(qa_pairs: list, retriever) -> EvaluationDataset:
    """Build Ragas EvaluationDataset from QA pairs with retrieved contexts."""
    samples = []
    for pair in qa_pairs:
        docs = retriever.invoke(pair["question"])
        contexts = [doc.page_content for doc in docs]
        
        samples.append(SingleTurnSample(
            user_input=pair["question"],
            retrieved_contexts=contexts,
            response=pair["answer"],
            reference=pair.get("ground_truth")
        ))
    return EvaluationDataset(samples=samples)

def evaluate_rag_offline(qa_pairs: list, retriever, llm) -> dict:
    """Run offline evaluation with Ragas."""
    eval_dataset = build_ragas_dataset(qa_pairs, retriever, llm)
    try:
        results = evaluate(
            dataset=eval_dataset,
            metrics=[Faithfulness(), ResponseRelevancy()],
            llm=llm
        )
        return results.to_pandas().to_dict()
    except Exception as e:
        print(f"Offline evaluation error: {e}")
        return {}

def create_monitored_chain(base_chain, feedbacks: list):
    """Wrap chain with TruLens monitoring."""
    return TruChain(
        base_chain,
        app_id="production_rag",
        feedbacks=feedbacks
    )

# Example usage
test_set = [
    {"question": "What is RAG?", "answer": "RAG retrieves documents...", "ground_truth": "RAG is..."},
    {"question": "How does vector search work?", "answer": "Vector search...", "ground_truth": "Vector search..."}
]

results = evaluate_rag_offline(test_set, retriever, llm)
monitored_chain = create_monitored_chain(rag_chain, [f_groundedness, f_relevance])

Getting Started

  1. Install dependencies with pinned versions: pip install ragas==0.2.0 trulens-apps-langchain==1.0.0 langchain-openai==0.2.0

  2. Prepare evaluation dataset (minimum 50 samples) with questions, retrieved contexts, and generated answers using SingleTurnSample objects

  3. Run Ragas evaluation for baseline metrics; use async for datasets > 100 samples

  4. Build LCEL chain with create_retrieval_chain (not deprecated RetrievalQA)

  5. Wrap production chain with TruChain recorder for ongoing monitoring

  6. Set alert thresholds: groundedness < 0.7, context precision < 0.6

  7. Monitor costs: GPT-4o-mini ~$36/100 samples; GPT-4o ~$600/100 samples

  8. Iterate on retrieval and generation based on metric patterns

Share this Guide:

More Guides

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Hono Edge Framework: Build Ultra-Fast APIs for Cloudflare Workers and Bun

Master Hono's zero-dependency web framework to build low-latency edge APIs that deploy seamlessly across Cloudflare Workers, Bun, and other JavaScript runtimes. Learn routing, middleware, validation, and real-time streaming patterns optimized for edge computing.

6 min read

Ready to Supercharge Your Development Workflow?

Join thousands of engineering teams using MatterAI to accelerate code reviews, catch bugs earlier, and ship faster.

No Credit Card Required
SOC 2 Type 2 Certified
Setup in 2 Minutes
Enterprise Security
4.9/5 Rating
2500+ Developers