LLM Evaluation & Testing
Compare 24 llm evaluation & testing tools to find the right one for your needs
🔧 Tools
Compare and find the best llm evaluation & testing for your needs
Langfuse
An open-source platform for tracing, debugging, and evaluating LLM applications, helping teams build better products faster.
Portkey
An AI gateway that provides observability, routing, caching, and security for LLM applications, helping teams ship reliable products faster.
Deepchecks
An open-source and enterprise platform for testing and validating machine learning models and data, with a focus on LLM applications.
Weights & Biases
A platform for tracking experiments, versioning data, and managing models, with growing support for LLM evaluation and observability.
Helicone
An open-source platform for monitoring LLM usage, managing costs, and improving performance through a simple proxy integration.
Braintrust
An enterprise-grade platform for evaluating and monitoring LLM applications, helping teams build reliable AI products.
Arize AI
An end-to-end platform for ML observability and evaluation, helping teams monitor, troubleshoot, and improve AI models in production.
LangSmith
A platform from the creators of LangChain for debugging, testing, evaluating, and monitoring LLM applications.
Comet ML
An MLOps platform for experiment tracking, model management, and LLM observability, helping teams build and deploy AI faster.
Galileo
An enterprise-grade platform for evaluating, monitoring, and optimizing LLM applications, with a focus on production readiness.
WhyLabs
An AI observability platform that prevents AI failures by monitoring data pipelines and machine learning models in production.
Fiddler AI
A platform for ML model monitoring, explainable AI, and fairness analysis, helping organizations build responsible and trustworthy AI.
MLflow
An open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
CalypsoAI
An enterprise platform for securing and managing the use of large language models, ensuring safe and compliant adoption of AI.
Datadog LLM Observability
An extension of the Datadog platform providing end-to-end visibility into LLM applications, from infrastructure to model performance.
UpTrain
An open-source framework for evaluating and improving LLM applications by providing pre-built checks and refinement capabilities.
RAGAs
An open-source, specialized framework for evaluating Retrieval Augmented Generation (RAG) pipelines based on their individual components.
DeepEval
An open-source Python framework for unit testing LLM applications, offering a wide range of research-backed evaluation metrics.
Lakera Guard
A developer-first security tool to protect LLM applications from prompt injections, data leakage, and other security threats.
Garak
An open-source tool for red teaming and vulnerability scanning of large language models, designed to find security and ethical weaknesses.
TruLens
An open-source tool for evaluating and tracking LLM-based applications, with a focus on RAG and agent evaluation.
OpenAI Evals
An open-source framework by OpenAI for creating and running evaluations to benchmark the performance of large language models.
Guardrails AI
An open-source framework for ensuring the reliability of LLM applications by specifying and enforcing structure and type guarantees.
Lasso Security
A comprehensive security platform for generative AI, providing visibility, data protection, and governance for LLM usage.