RAG Evaluation Pipeline: Implementing Ragas and TruLens for LLM Output Quality Metrics
RAG Evaluation Pipeline: Implementing Ragas and TruLens for LLM Output Quality Metrics
Evaluating RAG systems requires measuring retrieval and generation quality without ground truth labels. Ragas and TruLens provide complementary approaches: Ragas offers model-based batch evaluation metrics, while TruLens provides real-time feedback functions with observability tracking.
Prerequisites and Version Pinning
Both libraries have undergone breaking API changes. Pin versions strictly for reproducibility:
pip install ragas==0.2.0 trulens-apps-langchain==1.0.0 trulens-providers-openai==1.0.0 langchain-openai==0.2.0 langchain-chroma==0.1.4
The RAG Triad Framework
Both tools converge on three core metrics:
| Metric | Definition | Target |
|---|---|---|
| Context Relevance | Retrieved context's relevance to the query | Retrieval quality |
| Groundedness/Faithfulness | Answer's factual alignment with retrieved context | Hallucination detection |
| Answer Relevance | Response's relevance to the original question | Generation quality |
Ragas Implementation (v0.2+)
Ragas uses LLM-as-a-judge to compute metrics on evaluation datasets.
Dataset Structure
Ragas v0.2+ uses SingleTurnSample objects wrapped in EvaluationDataset:
from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas.metrics import Faithfulness, ResponseRelevancy, ContextPrecision
# Create individual samples
samples = [
SingleTurnSample(
user_input="What is RAG?",
retrieved_contexts=["Retrieval-Augmented Generation combines retrieval with generation..."],
response="RAG is a technique that retrieves documents to augment LLM context.",
reference="RAG retrieves documents to augment LLM context."
),
SingleTurnSample(
user_input="How does vector search work?",
retrieved_contexts=["Vector search uses embeddings to find similar documents..."],
response="Vector search finds similar documents using embedding similarity.",
reference="Vector search compares embedding similarity."
)
]
# Build evaluation dataset
eval_dataset = EvaluationDataset(samples=samples)
Running Evaluation
from ragas import evaluate
from langchain_openai import ChatOpenAI
# Configure evaluator LLM
evaluator_llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Run evaluation with error handling
try:
results = evaluate(
dataset=eval_dataset,
metrics=[Faithfulness(), ResponseRelevancy(), ContextPrecision()],
llm=evaluator_llm
)
print(results.to_pandas())
except Exception as e:
print(f"Evaluation failed: {e}")
# Consider fallback or retry logic
Async Evaluation for Large Datasets
Ragas supports async evaluation for improved throughput on large datasets:
import asyncio
from ragas import evaluate
async def evaluate_async(dataset, metrics, llm):
"""Run async evaluation for better performance on large datasets."""
results = await evaluate(
dataset=dataset,
metrics=metrics,
llm=llm
)
return results
# For datasets > 100 samples, async provides 2-3x speedup
results = asyncio.run(evaluate_async(
eval_dataset,
[Faithfulness(), ResponseRelevancy()],
evaluator_llm
))
Key Ragas Metrics
| Metric | Description | Ground Truth Required |
|---|---|---|
| Faithfulness | Measures claims in the answer inferable from context | No |
| Response Relevancy | Scores how well the answer addresses the question | No |
| Context Precision | Signal-to-noise ratio: are relevant chunks ranked higher? | No |
| Context Recall | Completeness of retrieved context vs reference | Yes |
Context Precision vs Context Relevancy
These metrics measure different aspects:
-
Context Precision: Evaluates whether relevant contexts appear at the top of the retrieval list. Uses the formula: precision at each relevant chunk position, averaged. High scores indicate good ranking.
-
Context Relevancy (deprecated in v0.2+): Previously measured overall relevance percentage. Use Context Precision instead for current versions.
TruLens Implementation (v1.0+)
TruLens provides feedback functions with built-in tracking and a dashboard.
Feedback Functions Setup
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain
provider = OpenAI(model_engine="gpt-4o")
# Define feedback functions with proper chaining
f_relevance = (
Feedback(provider.relevance_with_cot_reasons)
.on_input_output()
)
f_groundedness = (
Feedback(provider.groundedness_measure_with_reasons)
.on(TruChain.select_context())
.on_output()
)
f_answer_relevance = (
Feedback(provider.relevance)
.on_input()
.on_output()
)
Recording RAG Application (LCEL with LangChain 0.2+)
RetrievalQA is deprecated in LangChain 0.2+. Use LCEL with create_retrieval_chain:
from trulens.apps.langchain import TruChain
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
# Initialize components
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")
retriever = vectorstore.as_retriever()
# Build LCEL chain (replaces deprecated RetrievalQA)
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the context:
{context}
Question: {input}
""")
document_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, document_chain)
# Wrap with TruLens
try:
tru_recorder = TruChain(
chain,
app_id="rag_pipeline_v1",
feedbacks=[f_relevance, f_groundedness, f_answer_relevance]
)
# Record a query
with tru_recorder as recording:
response = chain.invoke({"input": "What is RAG?"})
print(response["answer"])
except Exception as e:
print(f"TruLens recording failed: {e}")
Launching the Dashboard
from trulens.dashboard import run_dashboard
# Launch dashboard on default port (requires TruLens 1.0+)
run_dashboard()
# For custom port or remote access:
# run_dashboard(port=8501, force=True)
Custom Feedback Functions
For production groundedness evaluation, use a cross-encoder NLI model:
from trulens.core import Feedback
from sentence_transformers import CrossEncoder
import re
# Load NLI model for entailment detection
nli_model = CrossEncoder('cross-encoder/nli-deberta-v3-base')
def extract_claims(text: str) -> list:
"""Extract factual claims from text using sentence splitting."""
sentences = re.split(r'[.!?]+', text)
return [s.strip() for s in sentences if s.strip()]
def verify_claim_with_nli(claim: str, context: str) -> float:
"""
Use NLI cross-encoder to verify claim against context.
Returns entailment probability (0-1).
"""
scores = nli_model.predict([(context, claim)])
# Scores: [contradiction, entailment, neutral]
entailment_idx = 1 # Index for entailment in NLI output
return float(scores[entailment_idx])
def production_groundedness(context: str, answer: str) -> float:
"""
Production-ready groundedness using NLI entailment.
Returns average entailment score across all claims.
"""
claims = extract_claims(answer)
if not claims:
return 0.0
entailment_scores = [
verify_claim_with_nli(claim, context)
for claim in claims
]
return sum(entailment_scores) / len(entailment_scores)
# Register as TruLens feedback
f_custom_groundedness = (
Feedback(production_groundedness)
.on(TruChain.select_context())
.on_output()
)
Evaluation Dataset Sizing
Statistical significance requires sufficient sample sizes:
| Dataset Size | Statistical Confidence | Use Case |
|---|---|---|
| < 30 samples | Low (high variance) | Quick smoke tests only |
| 50-100 samples | Moderate | Development iteration |
| 100-200 samples | Good | Pre-production validation |
| 500+ samples | High | Production benchmarking, CI/CD |
Recommendations:
- Minimum 50 samples for meaningful metric averages
- 100+ samples for detecting 5% performance changes with 80% confidence
- Stratify samples across query types and difficulty levels
Metric Score Interpretation
| Score Range | Interpretation | Action |
|---|---|---|
| 0.8 - 1.0 | Excellent | Production ready |
| 0.6 - 0.8 | Acceptable | Monitor for degradation |
| 0.4 - 0.6 | Poor | Investigate retrieval or generation |
| < 0.4 | Critical | Block deployment, debug immediately |
Recommended thresholds for alerting:
- Groundedness < 0.7: Potential hallucination risk
- Context Precision < 0.6: Retrieval needs improvement
- Answer Relevance < 0.7: Generation misaligned with query
Evaluation Cost Considerations
LLM-as-a-judge evaluation incurs API costs per metric per sample:
| Metric | LLM Calls per Sample | Approximate Tokens |
|---|---|---|
| Faithfulness | 2-3 | 500-1500 |
| Response Relevancy | 1 | 200-500 |
| Context Precision | 1 per context chunk | 300-800 |
Cost estimation formula: samples × metrics × avg_tokens_per_metric × price_per_1k_tokens
Example calculation (GPT-4o-mini at $0.15/1K input tokens):
- 100 samples × 3 metrics × 800 avg tokens = 240,000 tokens
- 240,000 × $0.15 / 1,000 = $36.00 USD
For GPT-4o at $2.50/1K input tokens: same workload = $600.00 USD
Comparative Analysis
| Aspect | Ragas | TruLens |
|---|---|---|
| Primary Use | Batch evaluation, CI/CD pipelines | Real-time observability, production monitoring |
| Output | Metric scores per sample | Dashboard with traces and feedback |
| Integration | Post-hoc evaluation | Inline recording during inference |
| Ground Truth | Optional (required for Context Recall) | Not required |
| Best For | Dataset benchmarking, model comparison | Production monitoring, debugging |
Implementation Pipeline
Combined Architecture
from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas import evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain
# Initialize components
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")
retriever = vectorstore.as_retriever()
# Build LCEL chain
prompt = ChatPromptTemplate.from_template("Answer: {context}\n\nQuestion: {input}")
document_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, document_chain)
# Define TruLens feedback functions
provider = OpenAI(model_engine="gpt-4o")
f_groundedness = (
Feedback(provider.groundedness_measure_with_reasons)
.on(TruChain.select_context())
.on_output()
)
f_relevance = (
Feedback(provider.relevance_with_cot_reasons)
.on_input_output()
)
def build_ragas_dataset(qa_pairs: list, retriever) -> EvaluationDataset:
"""Build Ragas EvaluationDataset from QA pairs with retrieved contexts."""
samples = []
for pair in qa_pairs:
docs = retriever.invoke(pair["question"])
contexts = [doc.page_content for doc in docs]
samples.append(SingleTurnSample(
user_input=pair["question"],
retrieved_contexts=contexts,
response=pair["answer"],
reference=pair.get("ground_truth")
))
return EvaluationDataset(samples=samples)
def evaluate_rag_offline(qa_pairs: list, retriever, llm) -> dict:
"""Run offline evaluation with Ragas."""
eval_dataset = build_ragas_dataset(qa_pairs, retriever, llm)
try:
results = evaluate(
dataset=eval_dataset,
metrics=[Faithfulness(), ResponseRelevancy()],
llm=llm
)
return results.to_pandas().to_dict()
except Exception as e:
print(f"Offline evaluation error: {e}")
return {}
def create_monitored_chain(base_chain, feedbacks: list):
"""Wrap chain with TruLens monitoring."""
return TruChain(
base_chain,
app_id="production_rag",
feedbacks=feedbacks
)
# Example usage
test_set = [
{"question": "What is RAG?", "answer": "RAG retrieves documents...", "ground_truth": "RAG is..."},
{"question": "How does vector search work?", "answer": "Vector search...", "ground_truth": "Vector search..."}
]
results = evaluate_rag_offline(test_set, retriever, llm)
monitored_chain = create_monitored_chain(rag_chain, [f_groundedness, f_relevance])
Getting Started
-
Install dependencies with pinned versions:
pip install ragas==0.2.0 trulens-apps-langchain==1.0.0 langchain-openai==0.2.0 -
Prepare evaluation dataset (minimum 50 samples) with questions, retrieved contexts, and generated answers using
SingleTurnSampleobjects -
Run Ragas evaluation for baseline metrics; use async for datasets > 100 samples
-
Build LCEL chain with
create_retrieval_chain(not deprecatedRetrievalQA) -
Wrap production chain with TruChain recorder for ongoing monitoring
-
Set alert thresholds: groundedness < 0.7, context precision < 0.6
-
Monitor costs: GPT-4o-mini ~$36/100 samples; GPT-4o ~$600/100 samples
-
Iterate on retrieval and generation based on metric patterns
Share this Guide:
More Guides
Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines
Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.
14 min readBun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite
Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.
10 min readDeno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development
Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.
7 min readGleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems
Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.
5 min readHono Edge Framework: Build Ultra-Fast APIs for Cloudflare Workers and Bun
Master Hono's zero-dependency web framework to build low-latency edge APIs that deploy seamlessly across Cloudflare Workers, Bun, and other JavaScript runtimes. Learn routing, middleware, validation, and real-time streaming patterns optimized for edge computing.
6 min readContinue Reading
Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines
Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.
14 min readBun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite
Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.
10 min readDeno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development
Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.
7 min readReady to Supercharge Your Development Workflow?
Join thousands of engineering teams using MatterAI to accelerate code reviews, catch bugs earlier, and ship faster.
