Testing an AI WhatsApp Chatbot

Testing an AI WhatsApp Chatbot

Emily O'Connor

29 May 2024 - 9 min read

Testing an AI WhatsApp Chatbot

Audacia partnered with Northern Trains to create a WhatsApp chatbot using Natural Language Processing (NLP), a category of Artificial Intelligence (AI), that is capable of handling customer requests messages related to live train data and onward taxi information. 

Akin to a live departure board found at a train station, the first to market chatbot is tailored to display relevant train schedules, including any disruptions and rail replacement services. 

Being a public-facing AI WhatsApp chatbot, testing was a critical part of delivery. Here, we take you through how automated testing was executed on the project.

But first, some terminology. 



What the user is trying to achieve. The goal behind their chatbot input.

‘Thank you’, ‘cheers’, ‘ta’ – are all examples of the intent to express gratitude.

‘How can I get from Leeds to London?’ and ‘When is the next train from London to Leeds?’ both have the intent of getting journey information, even though the answers to those questions will provide different information.


An entity is a piece of data used by the customer to modify their query. In this project, entities are a departure and arrival station. An entity must be applied on top of an intent – and does not change it.

Confidence score

A confidence score is a score between zero and one used by the NLP model to assess how accurately it has identified intent and entities. Zero denotes no confidence, One for complete confidence.

CRS code

CRS codes – computer reservation system codes are unique, three letter identifiers that have been supplied to every train station in the UK, typically shortened versions of the train station name. Manchester Piccadilly has the CSR code ‘MAN’ for instance.

CRS codes were particularly tricky to test because many of them are 3 letter words or popular names!

  { "crs": "AMY", "name": "Amberley" },

  { "crs": "AND", "name": "Anderston" },

  { "crs": "ASK", "name": "Askam" },

  { "crs": "BIN", "name": "Bingham" },

  { "crs": "BOT", "name": "Bootle Oriel Road" },

  { "crs": "BUG", "name": "Burgess Hill" },

  { "crs": "BUY", "name": "Burley Park" },

  { "crs": "BUT", "name": "Burton-on-Trent" },

  { "crs": "FIN", "name": "Finstock" },

  { "crs": "FOR", "name": "Forres" },

  { "crs": "HAD", "name": "Haddiscoe" },

  { "crs": "HOW", "name": "Howden" },

  { "crs": "NOT", "name": "Nottingham" },

  { "crs": "PUT", "name": "Putney" },

  { "crs": "RAN", "name": "Rannoch" },

  { "crs": "SAM", "name": "Saltmarshe" },

  { "crs": "SIT", "name": "Sittingbourne" },

  { "crs": "SAT", "name": "South Acton" },

  { "crs": "SAY", "name": "Swanley" },

  { "crs": "THE", "name": "Theale" },

  { "crs": "TIL", "name": "Tilbury Town" },

  { "crs": "TOO", "name": "Tooting" },

  { "crs": "WAS", "name": "Watton-at-Stone" },

  { "crs": "WHY", "name": "Whyteleafe" },

What did we test?

The Northern Trains chatbot understands six specific intents that need to be tested:


When a customer starts with a greeting message: “Hello”, “Hi”, “Yo”.

The chatbot provides a hard coded response to a greeting from the user. 

Journey information 

“When is the next train from Leeds to London?” 

The chatbot provides the relevant response answering the customer’s query regarding journey information.

Station selection 

“When is the next train to Manch?” 

When the station information supplied by the user is incomplete or insufficient, not an exact station match, the chatbot presents a popup selection box of possible matches.

Onward travel information 

“Can I get a taxi from Leeds station?” 

Customers can also ask about onward taxi connections at any station in the UK. The chatbot provides relevant information to the customer query on onward travel info. 


When a customer expresses gratitude: “Thank you”, “Thanks”, “Cheers” 

The chatbot provides a hard coded response to a statement of thanks from the user.


“I left my bag on the train, can you help me find it?” 

If the user’s input is not recognised or the intent is assigned a low confidence score, the chatbot signposts the user to other information using a Help menu, or prompts them to try their query again.

What can be automated?

To test the chatbot, the deterministic aspects of its functionality were automated. This included ensuring the correct intent was identified, train station names or CRS codes were identified and that messages are replied to.

Thinking about the intent of journey information – there are likely a hundred different ways a customer can ask the chatbot “How can I get from A to B?” and atop that there is the capitalisation of words and station names, plus punctuation. To address this, the project team came up with the concept of “signed off phrases” – those which must work for every model version. These phrases were considered the most common ways for people to type each intent.

Testing station names

To test the chatbot, the deterministic aspects of its functionality were automated. Firstly, to test the 2,579 stations in the UK – tests were written to ensure the name of each station could be detected by the chatbot.

To test this, code was written to accept a list of train stations and map them into departure – arrival pairs. This code outputted a JourneysArray which contains pairs such as AB, BC, CD, DE etc. until eventually looping round to connect ZA. This meant that every station was tested as a departure station and an arrival station.

When testing the journey intent, this enables assertions against the full test station name and the CRS station code. When run, the test loops through every combination of destination and arrival, making sure that each pair has a confidence score that exceeds our pre-defined threshold of 0.7.

This test allows us to confirm that every potential combination can be calculated accurately. It also identifies any anomalies early. This will also highlight any three-letter words that may be input by the user and which could be confused with matching CRS codes by the chatbot.

Next, we test the exact, full text station names. This is important because it allows us to assess how the chatbot performs when handling station names that include spaces, ampersands (&) and other punctuation marks. Again, this test is executed against all station names to ensure they all work correctly. We also test using upper and lower-case input to check this does not introduce any unexpected behaviour.

As the testing revealed issues, new specific tests were written to document regression bugs - often in the case of places with cardinal directions or shared first words.

To ensure the journey information intent was identified, phases such as “Whens the next train leaving {{{ departureStn }}} and arriving at {{{ arrivalStn }}}?” were sent to the chatbot. The test was marked as successful when the correct intent and entities were identified, with a high confidence score.

All testing around the specifics of journey information and rail replacement services, not being deterministic in nature, was not automated.

Testing onward journeys

The chatbot is also able to provide details of taxi firms available at each of the stations (where known). When a user searches for onward journey information, the chatbot will provide details of up to four minicab operators. 

Behind the scenes, the chatbot must detect the station at which the customer would like taxi information for, look up that stations CRS code and then fetch any taxi firms for that location.

Because this information does not get updated with the same frequency of real-time journey information, an automated test was written to lookup the required data and ensure that returned messages exactly match the supplied data. This test was repeated for all of Northern's 513 stations and also made an assertion to confirms that only one station name had been identified.

Testing greetings and gratitude

The first thing to test for these conversational elements is whether a CRS code is detected. To avoid problems, a confidence score of zero is assigned to ensure that the model does not treat this input as a journey information request.

We execute these tests in Mocha which allows us to add the text ‘greeting’ or ‘gratitude’ to each. This makes it much easier to debug each hard-coded response as we iterate through them.

Testing inexact matches

Up to now, the tests assume that the user has provided the exact station name as detailed by National Rail or a valid CRS code, but this is not how the majority of users will interact with the chatbot – so what happens when a user provides an in-exact station name?

Say the chatbot is supplied with “Manchester”, there are still multiple possible matches – Manchester Piccadilly, Manchester Victoria or Manchester Airport etc. 

Rather than guess, the chatbot should provide the user with a pop-up of potential matches, allowing them to select the correct option. We repeat this test with one and two in-exact station names to confirm the pop-up selection works and that the “option” stations relate to the in-exact station names provided.

The result, that a user can select their station, is not based on the character matching but instead the place associated with the string provided.

In the example where a user mis-types Leeds as “Leds”, the options shown to the user are those most associated with Leeds. If you move onto the next page of results places like Woodlesford, Headingley and Morley are shown – although these places are in and around Leeds, they appear lower down the results as people often search for them separately to Leeds.

Testing non-functionals

Upto now, all the testing that’s been described has been functional – specifically relating to chatbot functionality.

Seeing as the chatbot exists entirely within WhatsApp, the apps security and performance were mostly out of our control. Accessibility is also out of our control, as screen readers and text to speech apps can both apply atop WhatsApp, which is true regardless of the chatbot.

Out into the real world

These automated tests have helped us to identify and fix the vast majority of potential user issues, when people use the chatbot as intended.

The outcome - the chatbot is able to provide real-time journey information for people travelling to the platform.

For Northern Trains, the chatbot is invaluable, helping customers to find the information they need quickly, easily and reliably – without having to contact customer services.

Your turn

If you want to use the chatbot yourself – message the number 07870606060 on WhatsApp.

Feedback related to the functionality should be directed to enquiries@northernrailway.co.uk

Ebook Available

How to maximise the performance of your existing systems

Free download

Emily O'Connor is a Principal Test Engineer at Audacia and has worked across various industries including banking, manufacturing, gambling and transportation.