In this episode host, Byron Reese speaks with Ron Green of KUNGFU.AI about how companies integrate AI and machine learning into their business models.
Ron is a serial tech entrepreneur and expert in machine learning. He’s built several successful companies and worked in telecommunications, biotechnology, e-commerce, social media, and healthcare.
Ron is currently CTO of KUNGFU.AI, an AI consultancy that helps companies build their strategy, operationalize, and deploy artificial intelligence solutions. Prior to KUNGFU.AI, Ron was CEO and founder of Thrive Technologies (acquired by CLOUD), ran software development at Ziften Technologies, Powered (acquired by Dachais Group), and Visible Genetics (acquired by Bayer).
Ron holds a MSc with Distinction in Evolutionary and Adaptive Systems from the University of Sussex, and a degree in Computer Science from the University of Texas.
Byron Reese: This is Voices in AI brought to you by Gigaom and I'm Byron Reese. Today my guest is Ron Green. Ron is the CTO over at KUNGFU.AI. He holds a BA in Computer Science from the University of Texas at Austin, and he holds a Master of Science from the University of Sussex in Evolutionary and Adaptive Systems. His company, KUNGFU.AI is a professional services company that helps companies start and accelerate artificial intelligence projects. I asked him [to be] on the show today because I wanted to do an episode that was a little more ‘hands-on’ about how an enterprise today can apply this technology to their business. Welcome to the show, Ron.
Ron Green: Thank you for having me.
So let's start with that—what sizes of organizations you see that are starting [to do] machine learning kind of projects?
Yeah, we're seeing companies really at all sizes and all stages, meaning... really large companies, you'd be surprised, Fortune 500 level companies that may have some data science experience, but really looking to come up to speed and take advantage of a lot of recent advances in machine learning. So we work typically with two of what we call mid-tier companies, [those with] 100 million to maybe 2 billion of revenue, and we're seeing really pretty much across the spectrum everybody moving into machine learning and AI.
But there has to be a lower end. I don't think my dry cleaner is spinning up any projects with you?
What size of company should not even, I mean, they'll use tools that have been built using it, but in terms of like... amassing their own data and doing their own development on it, what would be a small company [for which] probably, it doesn't make sense for this one?
Yeah. Well, I mean at the lowest level, if you're talking about some guy starting a company on their own, with... the open-source machine learning libraries that are out there, if you are trying to do something to let's say use natural language processing as a part of your project, there really is no lower bound now. I mean, a couple of guys in a garage could take advantage of these techniques in an affordable way and integrate them into the product. I really don't think there's a lower bound.
So let's walk through the life cycle of a project. I'm an enterprise with—let's say a development department of programmers that maybe has 200 people. And I get the edict from the CEO that we need to do some of ‘that AI stuff’ that he/she is hearing a lot about. How do you start and identify places that the technology can be applied to?
We really have a methodology for that, and it's pretty straightforward. You need a combination of a few things. You've really got to understand the business and the business objectives. If you just walk in and you start building technology for technology's sake, you're not helping anybody, but you've also got to marry that with an understanding of what data is available.
So you may have a really high-level important strategic initiative that you want to solve, but you don't have the data. And we see that occasionally, and in those instances, it's not the best outcome, but knowing sooner [rather] than later that you need to start collecting data, or you need to start augmenting or brokering your data, that's a good thing to learn sooner than later. But lastly, you really need to understand what the state of the art is.
So you marry the business objective with the available data and the feasibility. Things are moving pretty quickly in a couple of fields, especially computer vision and natural language processing. So something that wasn't even possible a couple of years ago may be possible now. And you look at the intersection, and we very much are – we consider ourselves sort of practical AI, in that we're not interested in taking your two-year research projects.
We're very much about building solutions today with the tools that are available today and getting them into production. And so we find an intersection of those three areas and try to typically move quickly, you know, have things completed within a quarter and live, so that companies can get a quick win under their belt and get confidence about moving into this space.
So critique an idea for me. Let me just throw a couple of random ones at you.So I'm a large company that has 10,000 employees and a long operating history. And what I want to do, and I've hired a bunch of people, and some people work out and some people don't. And I have performance reviews that actually quantify how people are doing. And I say to you, here's what I want to do: I want to take every resume that's ever been submitted to us, [and] we've hired the person, and I want you to figure out – can you just help me predict the success of any given resume based on the all that hiring data and all that success or failure data?
Oh yeah, that's a great one. We've never worked on a project exactly like that, we've done things that are pretty similar. So a problem like that, you're dealing with a bunch of different kinds of data, you're dealing with textual data; you may be dealing with categorical data, meaning information about where they graduated or what their degree was in; and you might even have numerical data, like, how long they work at a job or what their GPA was, things like that.
So a multi-modal problem like that is really ideal for a couple of different techniques. I would actually say that there's a little bit of a secret in AI right now that a couple of techniques are dominating the field, and they’ve really boosted trees and deep learning models. And so trees are a little bit easier to work with initially, so I would take a stab at saying, “Let's collect that data,” and let me back up and say that to solve a problem like this you would need a sufficient dataset, you would need thousands of instances of resumes and then the resulting hiring decision and performance information. But then you could quite readily build a system that would take in that text, take in the categorical data, take in the continuous numeric values, and essentially build a prediction system around those resumes of the predictive performance.
I've heard it said and these kind of generalizations vary widely in the real world, but I'm curious if you would agree with this—if you think it's close. I've heard it said that 80% of all development efforts on AI projects, 80% of the effort and time and money and energy is just cleaning up the data.
Yeah, I don't think that's wildly off, and I wouldn't say no – it's not even so much cleaning up the data. That's a huge part of it definitely, but what gets lumped into ‘cleaning up the data’ is often data selection or data assessment and things like feature engineering—trying to take the data and get it in a format that can be put into some type of machine learning model. And all of those different tasks typically kind of get lumped into data cleanup or the data phase.
And if you view it through that lens, I think it's very likely that it's somewhere between 50 to 80% of the work, because selecting the features and trying to understand the data and get your head around some problem domain is really, really important, because if you don't understand the dataset, you don't understand the problem domain very well. It doesn't matter how great your algorithms or models are, you're not going to get good results. So you really have to focus on that initial phase.
We went through this era of ‘big data’—that phrase, about 10 years, no—longer than that. I remember 12 or so years ago, and we had a lot of data scientists. You started hearing about data scientists and data analytics became a thing, and now we hear about AI and machine learning. How would you say those two technologies are different or is it all just kind of packaging and marketing and it's really the same thing... when it comes to the enterprise?
No, I really do think there's a difference. So the way I think about it is that if you're doing analytics, it's more... you're looking backwards. And I don't mean that in a negative sense. It’s that you're taking the data, you're trying to understand patterns or find correlations or you're trying to explain things that happened in the past, and that's a little bit different than what we're seeing more and more now, especially in machine learning, which is: using data to make predictions about the future.
And the dominant techniques that are really seeing a lot of attention these days are the supervised learning techniques, where you take a set of data, where you know what the right answer is, and you can train a model just by giving it examples of the data with the right answer. And the reason that’s happened is because it used to be the only way we could get a computer to do anything was by telling it step-by-step exactly what it needed to do.
But the funny thing is: we don't actually know very often how we do things. Some things like for example how our sight, our vision is so deeply embedded into our abilities, we can't even introspect it, we don't understand how we do it. And so if we're working on a computer vision problem, we literally can solve those now using these new techniques by showing the algorithms examples of the input with the associated right answer, and it can then—if it's trained well—can learn to generalize.
Well, I hear what you're saying, but if somebody asked you in layman's terms to explain what machine learning is, you might say, “We take a bunch of data about the past, we study that data and we look for patterns and we use those patterns to make projections into the future. All data is historical data right? So that sounded like your distinction on the data science thing—that that was kind of dealing with historical data, so maybe address it?
Yeah, I see what you're saying there. It's not really so much about timeframe, about looking in the past or looking in the future, it's more about–to me, this is the way I define machine learning is.
But there is a stochastic element, there is a random element to it during training, and what you're trying to do is you're saying “I'm going to give you a set of inputs,” and the best, simplest example I always find to understand is like a computer vision example. “I'm going to show you a picture and you tell me ‘yes’ or ‘no’ if it's a picture of a cat.” And if you were going to design an algorithm by hand, it would be really hard. You'd have to think:what am I going to look for —ears, triangle-shaped ears?
And once you start thinking about where the pixels could be or the angle of the cat or how zoomed in it is, it gets pretty complicated. And instead, we can use these techniques where we literally just show the photos, as they are, to algorithms, and it makes a prediction. In the beginning the prediction is wrong. It's going to be basically random. But then we can update the algorithm, we can update variables, the parameters of the algorithm—to get it closer to getting the right answer. And if you give it enough examples of photos of cats that are representative of reality out there, the model of train cats we will generalize, and then in the end, you end with a model that can be shown a picture of a cat it's never seen before and it will correctly understand—or not understand, but will be able to give you the correct answer that it's a cat.
Tell me the kinds of mistakes you see, maybe not your own clients and your own experience, but what are the kinds of mistakes that people are making, enterprises are making, when they spin up AI projects?
Every now and then I hear about people going into projects with maybe too limited a dataset. We're getting better about this as a discipline, but in general, you still need quite a bit of data. So sticking with the cat example, if you're trying to build some type of prediction system, you need more than just a few dozen photos, you need something that is representative. And so, if you're trying to build a system to, let's say, predict inventory supplies or revenue for the future, you need enough historical data that you have access to the variance over seasonality, over product releases, etc. So really, having a limited dataset, I would say, is the biggest hindrance right now to deploying a successful machine learning initiative.
Do you see people primarily using their own data or are they using publicly available data sources or purchased data?
Oh mostly, I would say, probably 80% of time, its proprietary data. That's actually a really good question, because there's a misconception I think by some folks that like only Google or only Amazon, they have all the data and only they can be successful in machine learning. And that's really not the case at all because everybody has their own sales data and inventory and their proprietary dataset, that's probably more important to their business than anything else.
And so, I would say, 80% of our data comes from the client themselves. They have access to the data, it may be distributed across multiple datasets, but they have it. And about 20% of time we will either augment that with some sort of public dataset or with a purchased dataset. For example, sometimes you need to augment geo-data. You may have addresses but you don't have a more fine-grained information associated with that, and it's really pretty amazing how much good quality third-party data is available out there that you can then utilize.
And one more thing, which is: social media data is actually really interesting too. There's so much information out there that's available in social graphs that that's really interesting and useful.
You're in Austin or KUNGFU.AI is an Austin company, as is data.world. Are you familiar with those guys, and do they have like lots of datasets people should... ?
Yeah, we love the data.world guys and they have a tremendous number of datasets that are just fantastic. I would recommend anybody to go check them out.
Yeah, the nice thing about it is they have done a lot of the work of normalizing them and sanity checking them and they are kind of ready in a lot of cases.
Ron: That's right, yeah, and it's...
What are some of the kinds of misconceptions? When you go into meetings with some company that wants to do a big AI project, what are some of the myths that you have to walk them away from, that a lot of people have?
That's a good question. I would say it's kind of on a total spectrum—the far ends of the spectrum. Every now and then we will speak with somebody who will say, “it's all hype, there's nothing really there.” And I get that. Tech cycles come to go, and things can get over hyped.
But I've been doing this a long time, I've been in this space long enough to have lived through the AI winter of the ‘90s. And what we can do now compared to 20 years ago, it's just breathtaking. So I get it: every now and then there's this conception that it's just magic, and it's not real. Conversely, we occasionally get people who are too terrified to start. They think that you need to be a Nobel laureate to take that first step, and these are just algorithms, and like any software, it's fully deterministic, it's complicated, but it's not magic, and it's very much real.
If I were a company that did any normal kind of AI project, [and] I want to use artificial intelligence to streamline my supply chain, or manage the inventory I send to stores, or route my delivery trucks more efficiently, or order ingredients from my restaurant or any of the thousand different things, what kinds of improvements are you seeing?
Is it the kind of thing where it's like, look everything in your enterprise can probably be made better by 10% but don't think it's going to go up threefold or something. What kind of wins do you – I know that's a really open question, but just to kind of set people's expectations, what would be a nice win for a normal project?
That's a great question. So it's kind of two things. One is I would say, like you said, it's pretty open-ended. We have seen deployments where we have made incremental, let's say, 15-20% improvement on some metric, but that was considered unbelievably substantial, especially on the predictive side. For example, if you can more accurately predict sales or revenue, even single digits improvements in estimates can sometimes save companies millions, so it's very contextually relevant and relative.
On the other side, what I think one areas that I'm most excited about is just being able to do things that you couldn't do before. So it's not about relative improvement—there was no way to automate it at all. So on the example you gave earlier, it's really interesting, if you had some way to predict or automate or streamline, let's say, just smooth out the hiring process, that's tremendously useful. And so, natural language processing and computer vision like we talked about, have seen revolutions recently, and now we can do things in those spaces that weren't even possible before to automate and make predictions in datasets that were entirely manual a few years ago.
How good are chatbots now? If I wanted to build a chatbot for my site to answer commonly asked questions or to schedule appointments or all of that, are those now kind of a super simple thing, [or] are those still big projects?
Chatbots are difficult, because I think that was an area where a small win, or some small examples, some small domains, some relatively small input domains gave people a sense maybe too much could be accomplished. So I would say, as long as your chatbot needs are relatively constrained, and they're not open-ended, the types of requests that you would be receiving and answering are limited in scope, reasonably so, then they can be successful.
But we are definitely not at the stage yet where you can deploy some sort of chatbot system in the helpdesk, let's say, and just deploy it and forget about it, and it's going to work perfectly. There's just a little too much variance in the type of questions that that people ask to really hit that level yet.
Talk about all the platforms that are out there. What does that ecosystem look like and are they all designed for different purposes? Are there a bunch of direct competitors that largely do the same thing and how vertical are the platforms?
So we obviously, as a services company, are platform agnostic. So we get a chance to work really with all of them. So from a platform perspective, I look at it in a couple of different ways. One is they are using the cloud, the infrastructure, and you've got Google and Microsoft and Amazon are probably the top three, and Amazon is it still the leading vendor out there. You're going to see most enterprises deploy on AWS.
And then from a tooling perspective it's really, really a great environment. The open source machine learning libraries that are available, things like TensorFlow, and Keras and PyTorch, these are really, really well-maintained open source libraries that scale at as large as you want them to go, and they have baked into them state of the art implementations of a wide array of machine learning algorithms.
And so, typically when we go and engage with the client, there's very, very little off-the-shelf upfront investment that needs to be made either in software or hardware to get a machine learning initiative off the ground. It's really a great bargain.
So give us a case study or two that you have personal first-hand experience with, like the problem area, the approach, how it worked or didn't work.
Absolutely. So one of our biggest clients is Keller Williams, and they are making a very strong move into technology and embracing AI and machine learning in a really aggressive way. We're helping them all on several fronts. One project we did last year that I think is just super fascinating is a computer vision project. Go back and essentially ingest decades' worth of real estate offers that were stored on, in data stores, either as PDFs or images, really just were sitting there on the file system and had never been accessed.
And so, these files have literally new millions of – those are really high value data about historic real estate trends over the last three decades. But the problem was how do you get that off there, right? And we built a pretty sophisticated computer vision solution that allows brokers now to email in property listing offers and the pages can literally be from any state, any city, any version—and the pages in fact can actually be out of order. And there's a page classifier that will go through and correctly identify every page of the contract; and then a completely separate computer vision model that then goes through and identifies each part of the contract that we might want to extract, like, let's say the earnest money or the buyer's address, and then extracts that information, uses a combination of computer vision and then OCR techniques and put that into the database.
And now that we have that information, we're able to go and build predictive models looking at 30 years' worth of real estate information, and previously all that information was just locked up and stored on some file system.
That's fascinating. Well, it looks like we're at the bottom of the hour now. So I want to thank you for your time, Ron. The company is KUNGFU.AI, that is the url, and thanks for sharing some of what you've learned actually doing all of this in the field.
My pleasure. Thanks for having me.