AI & Machine Learning Engineering

RAG Evaluation Pipeline: Implementing Ragas and TruLens for LLM Output Quality Metrics

MatterAI

12 min read·March 3, 2026

RAG Evaluation Pipeline: Implementing Ragas and TruLens for LLM Output Quality Metrics

Evaluating RAG systems requires measuring retrieval and generation quality without ground truth labels. Ragas and TruLens provide complementary approaches: Ragas offers model-based batch evaluation metrics, while TruLens provides real-time feedback functions with observability tracking.

Prerequisites and Version Pinning

Both libraries have undergone breaking API changes. Pin versions strictly for reproducibility:

pip install ragas==0.2.0 trulens-apps-langchain==1.0.0 trulens-providers-openai==1.0.0 langchain-openai==0.2.0 langchain-chroma==0.1.4

The RAG Triad Framework

Both tools converge on three core metrics:

Metric	Definition	Target
Context Relevance	Retrieved context's relevance to the query	Retrieval quality
Groundedness/Faithfulness	Answer's factual alignment with retrieved context	Hallucination detection
Answer Relevance	Response's relevance to the original question	Generation quality

Ragas Implementation (v0.2+)

Ragas uses LLM-as-a-judge to compute metrics on evaluation datasets.

Dataset Structure

Ragas v0.2+ uses SingleTurnSample objects wrapped in EvaluationDataset:

from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas.metrics import Faithfulness, ResponseRelevancy, ContextPrecision

# Create individual samples
samples = [
    SingleTurnSample(
        user_input="What is RAG?",
        retrieved_contexts=["Retrieval-Augmented Generation combines retrieval with generation..."],
        response="RAG is a technique that retrieves documents to augment LLM context.",
        reference="RAG retrieves documents to augment LLM context."
    ),
    SingleTurnSample(
        user_input="How does vector search work?",
        retrieved_contexts=["Vector search uses embeddings to find similar documents..."],
        response="Vector search finds similar documents using embedding similarity.",
        reference="Vector search compares embedding similarity."
    )
]

# Build evaluation dataset
eval_dataset = EvaluationDataset(samples=samples)

Running Evaluation

from ragas import evaluate
from langchain_openai import ChatOpenAI

# Configure evaluator LLM
evaluator_llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Run evaluation with error handling
try:
    results = evaluate(
        dataset=eval_dataset,
        metrics=[Faithfulness(), ResponseRelevancy(), ContextPrecision()],
        llm=evaluator_llm
    )
    print(results.to_pandas())
except Exception as e:
    print(f"Evaluation failed: {e}")
    # Consider fallback or retry logic

Async Evaluation for Large Datasets

Ragas supports async evaluation for improved throughput on large datasets:

import asyncio
from ragas import evaluate

async def evaluate_async(dataset, metrics, llm):
    """Run async evaluation for better performance on large datasets."""
    results = await evaluate(
        dataset=dataset,
        metrics=metrics,
        llm=llm
    )
    return results

# For datasets > 100 samples, async provides 2-3x speedup
results = asyncio.run(evaluate_async(
    eval_dataset,
    [Faithfulness(), ResponseRelevancy()],
    evaluator_llm
))

Key Ragas Metrics

Metric	Description	Ground Truth Required
Faithfulness	Measures claims in the answer inferable from context	No
Response Relevancy	Scores how well the answer addresses the question	No
Context Precision	Signal-to-noise ratio: are relevant chunks ranked higher?	No
Context Recall	Completeness of retrieved context vs reference	Yes

Context Precision vs Context Relevancy

These metrics measure different aspects:

Context Precision: Evaluates whether relevant contexts appear at the top of the retrieval list. Uses the formula: precision at each relevant chunk position, averaged. High scores indicate good ranking.
Context Relevancy (deprecated in v0.2+): Previously measured overall relevance percentage. Use Context Precision instead for current versions.

TruLens Implementation (v1.0+)

TruLens provides feedback functions with built-in tracking and a dashboard.

Feedback Functions Setup

from trulens.core import Feedback
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain

provider = OpenAI(model_engine="gpt-4o")

# Define feedback functions with proper chaining
f_relevance = (
    Feedback(provider.relevance_with_cot_reasons)
    .on_input_output()
)

f_groundedness = (
    Feedback(provider.groundedness_measure_with_reasons)
    .on(TruChain.select_context())
    .on_output()
)

f_answer_relevance = (
    Feedback(provider.relevance)
    .on_input()
    .on_output()
)

Recording RAG Application (LCEL with LangChain 0.2+)

RetrievalQA is deprecated in LangChain 0.2+. Use LCEL with create_retrieval_chain:

from trulens.apps.langchain import TruChain
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Initialize components
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")
retriever = vectorstore.as_retriever()

# Build LCEL chain (replaces deprecated RetrievalQA)
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the context:
{context}

Question: {input}
""")

document_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, document_chain)

# Wrap with TruLens
try:
    tru_recorder = TruChain(
        chain,
        app_id="rag_pipeline_v1",
        feedbacks=[f_relevance, f_groundedness, f_answer_relevance]
    )

    # Record a query
    with tru_recorder as recording:
        response = chain.invoke({"input": "What is RAG?"})

    print(response["answer"])
except Exception as e:
    print(f"TruLens recording failed: {e}")

Launching the Dashboard

from trulens.dashboard import run_dashboard

# Launch dashboard on default port (requires TruLens 1.0+)
run_dashboard()

# For custom port or remote access:
# run_dashboard(port=8501, force=True)

Custom Feedback Functions

For production groundedness evaluation, use a cross-encoder NLI model:

from trulens.core import Feedback
from sentence_transformers import CrossEncoder
import re

# Load NLI model for entailment detection
nli_model = CrossEncoder('cross-encoder/nli-deberta-v3-base')

def extract_claims(text: str) -> list:
    """Extract factual claims from text using sentence splitting."""
    sentences = re.split(r'[.!?]+', text)
    return [s.strip() for s in sentences if s.strip()]

def verify_claim_with_nli(claim: str, context: str) -> float:
    """
    Use NLI cross-encoder to verify claim against context.
    Returns entailment probability (0-1).
    """
    scores = nli_model.predict([(context, claim)])
    # Scores: [contradiction, entailment, neutral]
    entailment_idx = 1  # Index for entailment in NLI output
    return float(scores[entailment_idx])

def production_groundedness(context: str, answer: str) -> float:
    """
    Production-ready groundedness using NLI entailment.
    Returns average entailment score across all claims.
    """
    claims = extract_claims(answer)
    if not claims:
        return 0.0
    
    entailment_scores = [
        verify_claim_with_nli(claim, context)
        for claim in claims
    ]
    return sum(entailment_scores) / len(entailment_scores)

# Register as TruLens feedback
f_custom_groundedness = (
    Feedback(production_groundedness)
    .on(TruChain.select_context())
    .on_output()
)

Evaluation Dataset Sizing

Statistical significance requires sufficient sample sizes:

Dataset Size	Statistical Confidence	Use Case
< 30 samples	Low (high variance)	Quick smoke tests only
50-100 samples	Moderate	Development iteration
100-200 samples	Good	Pre-production validation
500+ samples	High	Production benchmarking, CI/CD

Recommendations:

Minimum 50 samples for meaningful metric averages
100+ samples for detecting 5% performance changes with 80% confidence
Stratify samples across query types and difficulty levels

Metric Score Interpretation

Score Range	Interpretation	Action
0.8 - 1.0	Excellent	Production ready
0.6 - 0.8	Acceptable	Monitor for degradation
0.4 - 0.6	Poor	Investigate retrieval or generation
< 0.4	Critical	Block deployment, debug immediately

Recommended thresholds for alerting:

Groundedness < 0.7: Potential hallucination risk
Context Precision < 0.6: Retrieval needs improvement
Answer Relevance < 0.7: Generation misaligned with query

Evaluation Cost Considerations

LLM-as-a-judge evaluation incurs API costs per metric per sample:

Metric	LLM Calls per Sample	Approximate Tokens
Faithfulness	2-3	500-1500
Response Relevancy	1	200-500
Context Precision	1 per context chunk	300-800

Cost estimation formula: samples × metrics × avg_tokens_per_metric × price_per_1k_tokens

Example calculation (GPT-4o-mini at $0.15/1K input tokens):

100 samples × 3 metrics × 800 avg tokens = 240,000 tokens
240,000 × $0.15 / 1,000 = **$ 36.00 USD**

For GPT-4o at $2.50/1K input tokens: same workload = **$ 600.00 USD**

Comparative Analysis

Aspect	Ragas	TruLens
Primary Use	Batch evaluation, CI/CD pipelines	Real-time observability, production monitoring
Output	Metric scores per sample	Dashboard with traces and feedback
Integration	Post-hoc evaluation	Inline recording during inference
Ground Truth	Optional (required for Context Recall)	Not required
Best For	Dataset benchmarking, model comparison	Production monitoring, debugging

Implementation Pipeline

Combined Architecture

from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas import evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain

# Initialize components
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")
retriever = vectorstore.as_retriever()

# Build LCEL chain
prompt = ChatPromptTemplate.from_template("Answer: {context}\n\nQuestion: {input}")
document_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, document_chain)

# Define TruLens feedback functions
provider = OpenAI(model_engine="gpt-4o")
f_groundedness = (
    Feedback(provider.groundedness_measure_with_reasons)
    .on(TruChain.select_context())
    .on_output()
)
f_relevance = (
    Feedback(provider.relevance_with_cot_reasons)
    .on_input_output()
)

def build_ragas_dataset(qa_pairs: list, retriever) -> EvaluationDataset:
    """Build Ragas EvaluationDataset from QA pairs with retrieved contexts."""
    samples = []
    for pair in qa_pairs:
        docs = retriever.invoke(pair["question"])
        contexts = [doc.page_content for doc in docs]
        
        samples.append(SingleTurnSample(
            user_input=pair["question"],
            retrieved_contexts=contexts,
            response=pair["answer"],
            reference=pair.get("ground_truth")
        ))
    return EvaluationDataset(samples=samples)

def evaluate_rag_offline(qa_pairs: list, retriever, llm) -> dict:
    """Run offline evaluation with Ragas."""
    eval_dataset = build_ragas_dataset(qa_pairs, retriever, llm)
    try:
        results = evaluate(
            dataset=eval_dataset,
            metrics=[Faithfulness(), ResponseRelevancy()],
            llm=llm
        )
        return results.to_pandas().to_dict()
    except Exception as e:
        print(f"Offline evaluation error: {e}")
        return {}

def create_monitored_chain(base_chain, feedbacks: list):
    """Wrap chain with TruLens monitoring."""
    return TruChain(
        base_chain,
        app_id="production_rag",
        feedbacks=feedbacks
    )

# Example usage
test_set = [
    {"question": "What is RAG?", "answer": "RAG retrieves documents...", "ground_truth": "RAG is..."},
    {"question": "How does vector search work?", "answer": "Vector search...", "ground_truth": "Vector search..."}
]

results = evaluate_rag_offline(test_set, retriever, llm)
monitored_chain = create_monitored_chain(rag_chain, [f_groundedness, f_relevance])

Getting Started

Install dependencies with pinned versions: pip install ragas==0.2.0 trulens-apps-langchain==1.0.0 langchain-openai==0.2.0
Prepare evaluation dataset (minimum 50 samples) with questions, retrieved contexts, and generated answers using SingleTurnSample objects
Run Ragas evaluation for baseline metrics; use async for datasets > 100 samples
Build LCEL chain with create_retrieval_chain (not deprecated RetrievalQA)
Wrap production chain with TruChain recorder for ongoing monitoring
Set alert thresholds: groundedness < 0.7, context precision < 0.6
Monitor costs: GPT-4o-mini ~ $36/100 samples; GPT-4o ~$ 600/100 samples
Iterate on retrieval and generation based on metric patterns

MatterAI builds frontier AI infrastructure for engineering teams — from inference-optimized models to autonomous coding agents and agentic code reviews.

Explore what we're building:

Orbital IDE — Autonomous AI coding agent with background agents and deep codebase memory
AI Code Reviews — Agentic pre-commit reviews across GitHub, GitLab, and Bitbucket
Axon Models — Frontier-grade reasoning models at 70% lower inference cost

Get started free - https://app.matterai.so

Follow us on X · LinkedIn · GitHub

Share this Guide:

More Guides

LLM Integration for AI Agents: A Complete Engineering FAQ

Everything engineers need to know about integrating, testing, and productionizing LLMs in AI agents: model selection, tool calling, structured outputs, error handling, observability, and cost optimization.

22 min read

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Continue Reading

LLM Integration for AI Agents: A Complete Engineering FAQ

22 min read

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

Start Building for Free Read the Docs

No credit card requiredSOC 2 Type IISetup in 2 min