Best LLM Evaluation & Testing Software

Langfuse

The open source LLM engineering platform

An open-source platform for tracing, debugging, and evaluating LLM applications, helping teams build better products faster.

View tool details →

Portkey

The Control Panel for AI Apps

An AI gateway that provides observability, routing, caching, and security for LLM applications, helping teams ship reliable products faster.

View tool details →

Deepchecks

The Validation & Testing Platform for AI

An open-source and enterprise platform for testing and validating machine learning models and data, with a focus on LLM applications.

View tool details →

Weights & Biases

The AI Developer Platform

A platform for tracking experiments, versioning data, and managing models, with growing support for LLM evaluation and observability.

View tool details →

Helicone

The open source observability platform for Generative AI

An open-source platform for monitoring LLM usage, managing costs, and improving performance through a simple proxy integration.

View tool details →

Braintrust

The platform for building trustworthy AI products.

An enterprise-grade platform for evaluating and monitoring LLM applications, helping teams build reliable AI products.

View tool details →

Arize AI

The AI Observability & LLM Evaluation Platform

An end-to-end platform for ML observability and evaluation, helping teams monitor, troubleshoot, and improve AI models in production.

View tool details →

LangSmith

The All-in-One Platform for Building, Debugging, and Monitoring LLM-Powered Applications

A platform from the creators of LangChain for debugging, testing, evaluating, and monitoring LLM applications.

View tool details →

Comet ML

The platform for building reliable LLM-powered applications, from prompt to production.

An MLOps platform for experiment tracking, model management, and LLM observability, helping teams build and deploy AI faster.

View tool details →

Galileo

The GenAI Observability & Evaluation Platform

An enterprise-grade platform for evaluating, monitoring, and optimizing LLM applications, with a focus on production readiness.

View tool details →

WhyLabs

The AI Observability Platform

An AI observability platform that prevents AI failures by monitoring data pipelines and machine learning models in production.

View tool details →

Fiddler AI

The Responsible AI Platform

A platform for ML model monitoring, explainable AI, and fairness analysis, helping organizations build responsible and trustworthy AI.

View tool details →

MLflow

An open source platform for the machine learning lifecycle

An open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

View tool details →

CalypsoAI

AI Security and Enablement

An enterprise platform for securing and managing the use of large language models, ensuring safe and compliant adoption of AI.

View tool details →

Datadog LLM Observability

Monitor, troubleshoot, and secure your LLM-powered applications

An extension of the Datadog platform providing end-to-end visibility into LLM applications, from infrastructure to model performance.

View tool details →

UpTrain

Open-source LLM evaluation and refinement

An open-source framework for evaluating and improving LLM applications by providing pre-built checks and refinement capabilities.

View tool details →

RAGAs

Evaluation framework for your RAG pipelines

An open-source, specialized framework for evaluating Retrieval Augmented Generation (RAG) pipelines based on their individual components.

View tool details →

DeepEval

The Open-Source LLM Evaluation Framework

An open-source Python framework for unit testing LLM applications, offering a wide range of research-backed evaluation metrics.

View tool details →

Lakera Guard

Real-time LLM Protection

A developer-first security tool to protect LLM applications from prompt injections, data leakage, and other security threats.

View tool details →

Garak

LLM vulnerability scanner

An open-source tool for red teaming and vulnerability scanning of large language models, designed to find security and ethical weaknesses.

View tool details →

TruLens

Evaluate and Track Your LLM Experiments

An open-source tool for evaluating and tracking LLM-based applications, with a focus on RAG and agent evaluation.

View tool details →

OpenAI Evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

An open-source framework by OpenAI for creating and running evaluations to benchmark the performance of large language models.

View tool details →

Guardrails AI

Making Large Language Models Reliable

An open-source framework for ensuring the reliability of LLM applications by specifying and enforcing structure and type guarantees.

View tool details →

Lasso Security

The GenAI Security Platform

A comprehensive security platform for generative AI, providing visibility, data protection, and governance for LLM usage.

View tool details →

LLM Evaluation & Testing

🔧 Tools

Langfuse

Portkey

Deepchecks

Weights & Biases

Helicone

Braintrust

Arize AI

LangSmith

Comet ML

Galileo

WhyLabs

Fiddler AI

MLflow

CalypsoAI

Datadog LLM Observability

UpTrain

RAGAs

DeepEval

Lakera Guard

Garak

TruLens

OpenAI Evals

Guardrails AI

Lasso Security