It feels like Artificial Intelligence is ubiquitous, but it’s not. While almost every consumer-facing organization has introduced some form of AI into its processes, most instances are rudimentary, rules-based, and don’t require a Deep Learning training model in order to work.
Let’s take a wine pairing app as an example. Lidl’s Margot the Winebot app leverages conversational AI but it’s not using any model behind the scenes—there is a finite set of queries linked by rules to a finite set of responses.
Deep Learning is different. When you have thousands of questions, thousands of requests, and hundreds of thousands of responses, rules-based AI is no longer feasible.
While tech behemoths like Tesla, Google, and Amazon make it seem easy, taking Deep Learning models into production is one of the hardest tasks faced by the data science community today. It’s relatively easy to get models up and running locally but moving them into large scale production is difficult. In fact, it has been estimated that fewer than 5% of commercial data science projects actually make it to production.
Finn AI customers are part of that elite minority. In this post, we share insights to some of the challenges that make deployments of AI-powered chatbots at scale so complex.
Fewer than 5% of commercial data science projects make it to production.
Labeling is a fundamental stage of data preprocessing when working with supervised learning algorithms. It’s essential…and incredibly time consuming. It involves taking a set of unlabeled data and augmenting each piece with informative labels. For example, labels might indicate whether a video contains a person singing, the name of the song, the language it’s being sung in, the musical instruments being played, the color of the singer’s hair, and so on. They may also contain things like the overall sentiment of an utterance.
As you can imagine, it’s a detail-laden, arduous task that must be mapped by humans to ensure accuracy. Bad labeling will affect the quality of your dataset and the overall performance of a predictive model, and thus your chatbot.
There are many approaches to labelling including internal labelling, outsourcing to specialized agencies, data programming, and synthetic labelling. We found the best approach to be a combination of internal and specialized outsourcing (with advanced controls in place to maintain quality).
Bad labeling effects the quality of your dataset and the overall performance of a predictive model.
Validating that your model works
Once your data is in order, it’s time to test your model. You can test a subset of your data and measure your performance but without doing an extensive beta test, it’s difficult to predict how real people will use the chatbot.
That being said, there are some techniques to test the overall performance of the model. For first time deployments, you can test your dataset to ensure it is balanced, representative of real world utterances, and comprehensive enough to capture all parts of the functionality. This can be an early indicator of performance to monitor on each new model train.
Some utterances simply have to work despite the non-deterministic nature of the models themselves. In these scenarios—often required for regulatory and compliance reasons—additional test cases can be written and validated on each new training cycle.
Without doing an extensive beta test, it’s difficult to predict how real people will use the model.
Forking out for the infrastructure
Training and using Deep Learning models requires a lot of computing power, particularly at scale—you must ensure there are no latency issues for the end user. CPUs are too slow for Deep Learning models so GPUs are a must-have. However, GPUs are scarce and costly (as of 2019 this can run anywhere up to $750 USD).
Consider hyperparameter optimization as an example of how costs can escalate, even before deployment. Hyperparameter optimization is one of the best ways to validate your model works—but it involves changing the parameters of the model by experimenting with as many different combinations as you can for multiple trainings of multiple models. You choose the one that performs best and use its parameters as the starting point for the next model. You may have to train 3,000 or 4,000 models to arrive at a usable set of numbers. As you can imagine, this requires a lot of expensive GPU time. For most organizations, this cost is prohibitive unless there is a significant return on investment.
Solutions like Amazon SageMaker allow you to build, train, and test your models without requiring a full GPU. You can rent by the hour to reduce costs in the early stages and move to a GPU when you’re ready to deploy (and hopefully start seeing some of that ROI).
Other cost-effective ways to train and deploy include serverless platforms and containerization. Serverless platforms (like Amazon Lambda) allow you to scale up when you need the computing power and scale down when you don’t. Containerization lets you deploy multiple individually packaged instances of your software on the same machine.
Read more about How Finn AI uses Amazon MXNet to train our banking chatbot models deployed into production.
Deploying into production
When you’re ready to deploy your chatbot, there are many steps to consider:
- Where is it hosted? On-premise or in the cloud?
- How will you set up a pipeline from the training model to the deployment model?
- How does it interact with existing systems? Is it standalone? Is it integrating with other business applications? Is it replacing something else?
- What processes will it change?
- Which teams does it affect? What’s the internal communications/training plan?
- How will you manage updates to the model? How frequently will you release new models with additional data? How will you validate that the new models are safe to deploy?
Every organization will have a different set of considerations to work through to ensure the deployment of the chatbot is as smooth as possible
Measuring the live performance
Once your model is in the wild, it’s time to watch it grow, measure its performance, and tailor it as required. Figure out what metrics you’ll use. Basic metrics include:
- How many entities and intents match successfully?
- How many errors are there?
- What are the most common questions people ask?
As metrics continue to evolve, we’ll be able to surface greater insights to further prove the ROI of Deep Learning models.
Deploying Deep Learning models at scale is a long and complex process. This makes the barrier of entry very high for all but an elite few.
If you’re interested in this topic, follow the work of Geoffrey Hinton at the Vector Institute. Many call him the ‘Godfather of Deep Learning’ as he was researching this field long before we had the hardware and data to make it work. There are also some useful articles that go into more detail on the technical aspects—I recommend Algorithmia’s blog post and Mahesh Kumar’s Medium post.