The Botium Coach Dashboard visualizes the NLP performance metrics and suggests steps for improving it. It will show any pieces of test data that either did not return the expected intent, did return the expected intent but with a low confidence score, or did return the expected intent, but with a confidence score close to another intent’s.
IMPORTANT: By default, Botium Coach gives a penalty to all user examples not predicted as expected by assigning a confidence score of 0.0 to those user examples. For the beginning you should skip this penalty by activating the corresponding switch in the Botium Coach Dashboard Settings.
1st Glance: The Attention Box
The Attention Box shows any alarming events Botium Coach was able to identify:
Predicted intent doesn’t match the expected intent
Entities have not been recognized
Test data is not suitable for analyzing with Botium Coach
Clicking on the message shows the detailed records Botium Coach identified as source of the problems.
Issues with the CORRECTNESS of the test results will be visualized here.
4th Glance: Confusion Matrix and Confidence Threshold Chart
A Confusion Matrix shows an overview of the predicted intent vs the expected intent. It answers questions like When sending user example X, I expect the NLU to predict intent Y, what did it actually predict ?. The expected intents are shown as rows, the predicted intents are shown as columns. User examples are sent to the NLU engine, and the cell value for the expected intent row and the predicted intent column is increased by 1. So whenever predicted and expected intent is a match, the cell value in the diagonal is increased — these are our successful test cases. All other cell values not on the diagonal are our failed test cases.
The most used statistical measures of NLU performance are precision and recall:
The question answered by the precision score is: How many predictions of an intent are correct ?
The question answered by the recall rate is: How many intents are correctly predicted ?
The confidence threshold is the lowest accepted confidence score. If the NLP engine is not sure enough at classifying an intent (its confidence score is below confidence threshold) then it will answer with incomprehension intent to show that it doesn’t understand. This chart helps in finding the best confidence threshold for your use case - it visualizes the balance between precision and recall score, and depending on your use case the one or the other may have priority.
Botium Coach will detect any issues with the test results and suggest actions which will improve the overall NLU performance. It will tell you which intents require more training data, and if test data is not suitable for performing NLU tests with it.
What else ?
Botium Coach visualizes some more useful metrics. Explore on your own!
Mismatch Probability Risks
This section shows some charts visualizing the risk that some intents will be mismatched - meaning that the NLU engine predicts the correct intent, but with a confidence score very close to another one. In real-life, a chatbot in this situation often responds with something like I am not sure what you mean - do you mean X or Y ? (In IBM Watson, this is called disambiguation).
Issues with the CLARITY of the test results will be visualized here.
Clicking in the radar chart shows the list of intents with the confidence score predicted by the NLU engine - this only works if the NLU engine actually returns an alternate intents list.
There are also charts showing the similarity of two intents based on the alternate intents lists returned by all user examples.