Testing AI: How to Effectively Evaluate LLMs

Testing AI: How to Effectively Evaluate LLMs

Richard Brown

2 March 2026 - 15 min read

AITesting
Testing AI: How to Effectively Evaluate LLMs

Traditional software testing rests on a basic assumption that given the same input, the system produces the same output. A test case defines expected behaviour, and a test passes or fails based on whether the output matches. This assumption – deterministic behaviour with verifiable correctness – is the foundation on which decades of quality assurance practices have been built.

However, this can break down with large language models. An LLM may produce a different response to the same prompt on successive runs. Its outputs are sensitive to context, prompt phrasing, temperature settings and the interaction between retrieved documents and parametric knowledge. It can produce responses that are fluent, confident and completely wrong - a failure mode that traditional testing has no framework for detecting. And unlike a conventional software bug, which typically manifests consistently and can be reproduced, AI system failures are often probabilistic, context-dependent and difficult to predict.

For engineering leaders, this creates a new problem. Organisations are deploying LLM-powered features at pace, such as customer-facing chatbots, internal knowledge assistants, AI-augmented search, automated document processing, coding assistants and increasingly autonomous agentic workflows. However, the testing and evaluation practices for these systems are struggling to keep up.

The World Quality Report 2025, surveying over 2,000 senior executives across 22 countries, found that hallucination and reliability concerns are now among the top barriers to generative AI adoption in quality engineering, cited by 60% of respondents - a challenge that barely registered two years ago.

This article looks at what testing looks like for AI systems, why it is fundamentally different from traditional software testing, and how organisations can build the evaluation capability required to deploy LLMs responsibly.

Why Traditional Testing Fails for AI Systems

The differences between testing traditional software and testing AI systems are not differences of degree but of kind.

In conventional software, correctness is binary. A function either returns the right value or it does not. Test cases can enumerate expected input-output pairs, and 100% pass rates are achievable and expected. The system under test is deterministic - run the same test twice, get the same result. And when a test fails, the failure is reproducible, allowing engineers to diagnose and fix the root cause.

Little of these properties hold for LLM-powered systems. There is no single "correct" response to most natural language queries. A question about company policy might have multiple valid phrasings, levels of detail and degrees of nuance. The system is non-deterministic by design (temperature and sampling parameters introduce controlled randomness). And failures, such as hallucinations, reasoning errors, safety violations and biased outputs, may occur intermittently, triggered by specific combinations of context, phrasing and retrieved information that are difficult to anticipate or reproduce.

This means testing AI systems is an evaluation discipline rather than a verification discipline. Instead of asking "does this pass or fail?", organisations must ask "how well does this system perform across a range of scenarios, and is the distribution of performance acceptable for our use case?" This requires statistical thinking, domain-specific quality criteria and continuous evaluation rather than one-off test suites.

The Hallucination Problem: Scale and Consequences

Hallucination - where an LLM generates content that is fluent and confident but factually incorrect or unsupported by source material - is the most visible failure mode and the one that most concerns enterprise adopters.

Vectara's Hallucination Leaderboard, which benchmarks LLMs for factual consistency in summarisation tasks, found that even frontier reasoning models, including GPT-5, Claude Sonnet 4.5, Grok-4, and DeepSeek-R1, all exhibited hallucination rates exceeding 10% on their updated, more challenging benchmark. The recently released Gemini-3-pro demonstrated a 13.6% hallucination rate and did not make the top-25 list.

These are the best available systems, evaluated on a straightforward summarisation task, not adversarial conditions or edge cases.

The academic community is also grappling with how to define and categorise hallucinations consistently. The HalluLens benchmark, presented at ACL 2025, identified a fundamental challenge in existing benchmarks often conflating hallucination with factuality, despite these being distinct problems requiring different evaluation approaches. HalluLens proposes a taxonomy distinguishing between extrinsic hallucinations (where generated content deviates from or contradicts source material the model had access to) and intrinsic hallucinations (where the model contradicts its own earlier outputs). This distinction matters for enterprise applications because the mitigation strategies differ, with extrinsic hallucination being a retrieval and grounding problem, while intrinsic hallucination is a consistency and reasoning problem.

The real-world consequences of inadequate hallucination testing are already visible and increasingly costly.

  • Air Canada lost a legal case after its chatbot fabricated a bereavement discount policy that did not exist – the airline was held liable for the AI's invention.
  • New York City's public-facing chatbot provided illegal advice to business owners about regulatory requirements.
  • And a GPTZero analysis of over 4,000 papers accepted at NeurIPS 2025 found that dozens contained fabricated AI-generated citations – invented authors, titles and journals that passed peer review undetected.

These incidents share a common root cause in systems being deployed without adequate evaluation of their failure modes under realistic conditions.

What LLM Evaluation Looks Like

Practitioners are converging on a multi-dimensional evaluation approach that moves well beyond traditional pass/fail testing. The emerging consensus spans at least seven dimensions: accuracy, safety, bias, hallucination, robustness, latency and security. Each requires different evaluation methods, and the relative importance of each dimension varies by use case – a customer service chatbot has different critical dimensions than a code generation tool or a medical information system.

Benchmark suites

Benchmark suites are the most familiar evaluation approach, adapted from academic AI research. Standardised benchmarks test model capabilities across reasoning, knowledge, coding and other dimensions. However, generic benchmarks have significant limitations for enterprise use. Many models now saturate standard benchmarks like MMLU (exceeding 90% accuracy), which has driven the development of harder alternatives. More fundamentally, a model's score on a general benchmark tells you little about how it will perform on your specific domain, data and use cases. Organisations deploying LLMs need domain-specific evaluation datasets that reflect the actual questions their users ask, the documents their RAG systems retrieve, and the edge cases their particular deployment will encounter.

LLM-as-judge approaches

LLM-as-judge approaches use one language model to evaluate the outputs of another. This approach is both practical and scalable, allowing automated evaluation of thousands of responses without human reviewers, with tools like DeepEval and RAGAS making this accessible. But the approach does have an inherent risk. If both the generating model and the evaluating model are prone to hallucination, they may reinforce each other's errors, creating what researchers describe as a "hallucination echo chamber." Effective LLM-as-judge implementations mitigate this through multi-model consensus (using several different models as judges), structured evaluation rubrics that constrain the judge's assessment to specific, verifiable dimensions and periodic calibration against human judgement.

Red-teaming and adversarial testing

Red-teaming and adversarial testing deliberately probe the system for failure modes. This includes testing for prompt injection (where adversarial inputs manipulate the model's behaviour), safety violations (where the model produces harmful or inappropriate content), and edge cases where the model's confidence exceeds its accuracy. Red-teaming is particularly important for customer-facing AI systems, where an adversarial user may deliberately attempt to exploit the system. The EU AI Act explicitly requires adversarial testing for general-purpose AI models, making this a compliance requirement rather than a best practice.

Human evaluation

Human evaluation remains essential for high-stakes use cases. Automated metrics cannot fully capture whether a response is genuinely helpful, appropriately nuanced, or safe in context. Human evaluation is expensive and slow, which makes it impractical for comprehensive testing, but it serves a critical role in calibrating automated evaluation systems and validating performance on the most important and sensitive scenarios.

Continuous evaluation in production

Continuous evaluation in production closes the loop. Unlike traditional software where testing occurs before deployment, AI systems require ongoing monitoring because their performance depends on inputs that cannot be fully anticipated. This includes tracking hallucination rates on real user queries, monitoring for distribution shift (where the types of questions users ask diverge from what the system was evaluated on), and collecting user feedback to identify failure patterns that pre-deployment testing missed.

Testing RAG Systems: Where Retrieval Meets Generation

Retrieval-augmented generation (RAG), where an LLM's responses are grounded in documents retrieved from an organisational knowledge base, is the most common enterprise LLM deployment pattern. It is also where testing becomes particularly nuanced, because failures can originate in the retrieval step, the generation step or the interaction between the two.

A RAG system can fail in several distinct ways. The retrieval component may return irrelevant documents, missing the information needed to answer the query. It may return relevant documents but rank them poorly, burying the critical information below less relevant content. The generation component may ignore the retrieved context and rely on its parametric knowledge instead, producing a plausible but ungrounded answer. Or it may hallucinate details that are not present in any of the retrieved documents, fabricating specifics while appearing to cite its sources.

Testing RAG systems therefore requires evaluating each component independently and the system as a whole. Retrieval quality can be measured through precision (what proportion of retrieved documents are relevant?) and recall (what proportion of relevant documents are retrieved?). Generation quality requires checking faithfulness (does the response accurately reflect the retrieved content?), relevance (does the response actually answer the question?) and completeness (does it include all pertinent information from the retrieved documents?).

The challenge is that these evaluations require ground-truth datasets specific to the organisation's knowledge base and user queries. Off-the-shelf benchmarks do not test whether your RAG system correctly answers questions about your company's policies, products or processes. Building these evaluation datasets, such as curating representative questions, establishing correct answers and maintaining them as the knowledge base evolves, is one of the most labour-intensive but essential aspects of AI testing. Enterprise research has found that content quality and organisation within the knowledge base itself often has a larger impact on RAG performance than the choice of model or retrieval architecture, which means testing must extend to the data layer, not just the AI components.

Testing Agentic AI: The Next Frontier

The testing challenge compounds further as organisations move from simple question-answering systems to agentic AI – systems that can plan multi-step tasks, use tools and take actions in the real world. An agentic workflow might involve an AI system that receives a customer request, retrieves relevant information from multiple sources, reasons about the best course of action and executes a series of steps (updating a database, sending a communication, triggering a workflow) with minimal human intervention.

Testing agentic systems requires evaluating not just the quality of individual outputs but the correctness of entire decision chains. Does the agent correctly decompose a complex task into appropriate sub-tasks? Does it select the right tools for each step? Does it handle errors and unexpected conditions gracefully? Does it know when to escalate to a human rather than proceeding autonomously?

These questions go beyond hallucination testing into territory that more closely resembles integration testing and end-to-end workflow validation. However, with the added complexity that the system's behaviour is non-deterministic and its decision-making is opaque.

The real-world consequences of inadequate agentic AI testing have already surfaced: in one widely reported incident, an autonomous AI coding agent deleted a company's primary database during a self-directed "cleanup" operation, violating a direct instruction prohibiting modifications. The root cause was not a hallucination but a reasoning failure, where the agent decided that a database cleanup was appropriate despite an explicit code freeze instruction, and no separation existed between test and production environments.

For engineering leaders, agentic AI testing demands a combination of traditional integration testing principles (test the workflow end-to-end, validate boundary conditions, verify error handling) with AI-specific evaluation (assess the quality of the agent's reasoning, its compliance with guardrails and its behaviour under adversarial or unexpected conditions). Sandbox environments with realistic but non-production data become essential, as does the ability to replay and analyse the agent's decision chain after the fact.

The Regulatory Dimension

The regulatory environment is adding both urgency and specificity to AI testing requirements.

The EU AI Act, now entering enforcement, establishes graduated testing obligations based on risk classification. High-risk AI systems, which include those used in employment, credit decisions, education and critical infrastructure, require comprehensive testing for accuracy, robustness, cybersecurity and non-discrimination before deployment, with ongoing monitoring obligations thereafter.

General-purpose AI models face model evaluation requirements including adversarial testing. Organisations deploying LLM-powered features must be able to demonstrate that they have tested their systems against these criteria – a compliance requirement that many have not yet begun to address.

The UK's approach differs in structure but converges in its implications. Rather than prescriptive legislation, UK regulators are applying existing regulatory frameworks, through the FCA, ICO, CMA and sector-specific regulators, to AI systems within their remit. The ICO's guidance on AI and data protection, for instance, requires organisations to demonstrate that AI systems processing personal data are accurate, fair and transparent. The practical effect is similar in that organisations must be able to evidence that they have evaluated their AI systems' behaviour against relevant quality and safety criteria.

The EU Cyber Resilience Act adds another layer for AI-powered software products, requiring that products be developed according to secure-by-design principles, free from known exploitable vulnerabilities and supported by ongoing security updates. For AI systems that interact with external inputs (user queries, retrieved documents, API calls), this implies testing for adversarial inputs, prompt injection and data leakage – categories that traditional security testing does not cover.

Building AI Testing Capability

Perhaps the most practical challenge facing engineering leaders is where AI testing capability should sit organisationally and what skills it requires.

AI evaluation requires a blend of competencies. It requires understanding of ML evaluation methodology, such as benchmark design, statistical analysis of non-deterministic outputs and evaluation metric selection. It also requires domain expertise to define what "correct" means for specific use cases – a question that is ultimately a business judgement rather than a technical one. As well as this, it requires prompt engineering capability to design effective evaluation prompts and adversarial test cases. And lastly, it requires the infrastructure skills to build and run evaluation pipelines at scale, integrate monitoring into production systems, and maintain evaluation datasets as the system and its usage evolve.

Some organisations are embedding this capability within existing QA teams, extending their remit to encompass AI evaluation alongside traditional testing. Others are building dedicated AI quality or AI evaluation functions, sometimes within ML engineering teams, sometimes as standalone roles. Neither approach has emerged as clearly superior. The right answer depends on the organisation's AI maturity, the scale and criticality of its AI deployments, and whether the dominant challenge is evaluation methodology (which favours ML expertise) or integration with existing quality processes (which favours QA expertise).

What is clear is that there is a skills gap. The World Quality Report found that 50% of organisations lack AI/ML expertise, unchanged from the prior year, and that generative AI has emerged as the single most in-demand skill for quality engineers (63%), ahead of core quality engineering fundamentals (60%).

PractiTest's State of Testing 2026 data reinforces this from the practitioner perspective. Testing professionals who actively use AI tools are significantly less anxious about their future and earn a measurable salary premium, suggesting that the market is already pricing in AI evaluation capability.

From Optional to Essential

The window during which AI testing could be treated as an emerging discipline is closing. Organisations are deploying LLM-powered systems into production, customers and employees are interacting with them daily, and the failure modes are documented and increasingly expensive.

The hallucination rates are quantified, with even frontier models exceeding 10% on rigorous benchmarks. The regulatory requirements are specific, with the EU AI Act mandating testing that most organisations cannot yet perform. And the deployment patterns are growing more complex, with RAG systems compounding retrieval and generation failures, while agentic workflows are introducing autonomous decision-making with real-world consequences.

The Veracode research on AI-generated code security showed the same pattern – newer, larger models do not produce more secure code, highlighting that these are not problems that will be solved with the next model release. Instead, teams require exploration and investment into testing capability, evaluation infrastructure and the organisational capacity to assess and manage the risks inherent in deploying probabilistic systems.

Ebook Available

How to maximise the performance of your existing systems

Free download

Richard Brown is the Technical Director at Audacia, where he is responsible for steering the technical direction of the company and maintaining standards across development and testing.