In this episode, Byron and Peter discuss AI use in consumer and retail businesses.
Peter Cahill is the CEO at Voysis. He holds an undergraduate degree in computer science from the Dublin Institute of Technology and a PhD in the field of computer science text-to-speech from University College, Dublin.
Byron Reese: This is Voices in AI, brought to you by GigaOm. I’m Byron Reese, and today, our guest is Peter Cahill. He is the CEO over at Voysis. He holds an undergraduate degree in computer science from the Dublin Institute of Technology and a PhD in the field of computer science text-to-speech from University College, Dublin. Welcome to the show, Peter.
Peter Cahill: Thanks. Looking forward to it.
Well, I always like to start with the question, what is artificial intelligence?
It’s a tough question. I think, as time passes, it’s getting increasingly more difficult to define it. I think some years ago, people would use ‘artificial intelligence’ and essentially pattern matching as kind of meant the same thing. I think in more recent years, as technologies have progressed, sizes of data sets are many times bigger, computer power is obviously a whole lot better as well, as are technologies developing that, I think these days, it can be really hard to draw that line. Some time ago, maybe a year ago, I think, I was chairing a panel on speech synthesis. One of the questions I had for the panelists in general was, in theory, could computers ever speak in a more human way, or in a way better than humans? We’ve seen many of these.
Over time, we’ve seen that computers can do computer vision better than people. Computers can do speech recognition better than people, and it’s always in a certain context and on a certain data set. But still, we’re starting to see computers outperforming people in various cases. I asked this question to the panel, could computers speak better than people? I think one of the panelists, as far as I recall, said that he believed they could, and what would be realized would be that, if a computer could not just sound perfectly human but also could be more convincing than your average person would be, then the computer would speak better than the person. I think on the back of that, to ask what’s the artificial part of artificial intelligence, it does seem that, as time passes and these technologies continue to progress, that really having a good definition on that just becomes increasingly difficult. I’m afraid I don’t have a good definition for you for it. But I think eventually people will just start referring to it as intelligence.
You know, it’s interesting. When Turing put out the Turing test, he was trying to answer the question, “Can a machine think?” Everybody knows what the Turing test is, can you tell whether you’re talking to a person or a computer? He said something interesting. He said that “if the computer can ever get you to pick it 30% or 40% of the time, you have to say it’s thinking.” You have to ask why wasn’t that 50-50? Of course, the question he was asking is not whether or not a computer could think better than a person, but whether they can think at all. But the interesting question is what you just touched on, which is if the computer ever gets picked 51% of the time, then the conclusion is what you just alluded to. It’s better at seeming human than we are. So, do you think in the context of artificial intelligence – and I don’t want to belabor it. But do you think it’s artificial like artificial turf isn’t grass? Is it really intelligent, or is it able to fake it so well that it seems intelligent? Or, do you find anything meaningful in that distinction?
I think there is a chance, as our understanding of how the human brain works and develops, in addition to what people currently call artificial intelligence – as that develops, eventually there may be some overlap. I think even myself and a lot of others don’t really like the term “artificial neural networks,” or neural networks, because they’re quite different to the human brain, even though they may be inspired by how the human brain works. But I wouldn’t be surprised if eventually we ended up at a point of understanding how the human brain works, to the extent that it no longer seems as magically intelligent as it does to us today. I think probably what we will see happening is, as machines get better and better at artificial intelligence, that it may become almost like if something seems too natural or too good, then people would assume maybe that it came from a machine and not a person. Probably a really good example is if you consider video games today, that we have this artificial intelligence in video games, which is really not intelligent at all. For example, if you take a random first-person shooter type of game, where the artificial intelligence is trying to seem very – they make lots of mistakes, they move very slowly. If you really tried to power a modern video game with really state of the art artificial intelligence, the human player wouldn’t stand a chance, just because the AI would be so accurate and so much faster and so much more strategic in what it was doing. I think we’ll see stuff like that across the spectrum of AI, where machines can be really, really good at what they’re doing and, as time passes, they’ll just continuously get better, whereas people are always starting from scratch.
So, working up the chain from the brain – which you said we may get to a point where we understand it well enough that our intelligence looks like artificial intelligence, if I’m understanding you correctly. There’s a notion above it, which is the mind, and then consciousness. But just talking about the mind for a minute, the mind, there’s all this stuff your brain can do that doesn’t seem like something an organ should be able to do. You have a sense of humor, but your liver does not have a sense of humor. Where does that come from? What do you think? Where do you think these amazing abilities of the brain – and I’m not even talking about consciousness. I’m just talking about things we can do. Where do you think they come from, and do you have even a gut instinct? Are they emergent? What are they?
Yes, obviously it would just be a guess, really. But I would think that, if we end up with AIs that are as complex or even more complex and more capable than the human brain, then we’re going to probably see various artifacts on the side of that, which may resemble these types of things you’re talking about right now. I think maybe to some extent, right now, people draw this distinction between AI and intelligence, because the human brain still has so many unknowns about it. It appears to be almost magic in that way, whereas AI is very well-understood, exactly what it’s doing and why. Even if, say, models are too big to really be able to understand exactly why they’re making certain decisions, the algorithms of them are very well understood.
Let me ask a different question. You know, a lot of people I have on the show – there’s a lot of disagreement about how soon we’re going to get a general intelligence. So, let me just ask a really straightforward question, which is some people think we’re going to get a general intelligence soon – 5/10/15 years. Some people think an AGI is as far out as 500 years. Do you have an opinion on that?
Yes, I think as soon as we can put a time on it, it’ll happen incredibly quickly. Right now, today’s technologies are not sufficient to be generally intelligent. But what we’ve seen even in general in AI in recent years is, as now, pretty much every company out there is trying to develop their AI strategy, building out AI teams or working with a lot of other companies that work in AI. I think the number of people working in AI as a field has increased dramatically, and that will cause progress to happen far quicker than it would have otherwise happened.
So, let me ask a different variant of the question which is, do you think we’re on an evolutionary path to build… is the technology evolving where it gets a little better, a little better, a little better, and then one day it’s an AGI? Or like the guest I had on the show yesterday said, “No, what we’re doing today isn’t really anything like an AGI. That’s a whole different piece of technology. We haven’t even started working on that yet?”
Yes, I’d say that’s correct. But the leap – it’s not going to be an iteration of what we currently have. But it may just be a very small piece of technology that we don’t currently have, when combined with everything that we do currently have makes it possible.
Let’s talk about that. People who think that we’re going to get an AGI relatively soon often think that there is a master algorithm, that there is a generalized unsupervised learner we can build. We can just point it at the internet and it’s going to know all there is to know. Then other people say, “No, intelligence is a kludge. Our brains are only intelligent because we do a thousand different things and they’re all cognitive biased. All this messy spaghetti code is all we really are.” You have an opinion on that?
I think currently there’s no algorithms out there that even suggest it could be generally intelligent. I think as it is, even if there was one minor breakthrough in that space, it would have a very dramatic knock-on effect in the world. Then people would start believing it was only a number of years away. As it is right now, if it happened in 5 years, I honestly would not be surprised. If it happened in 15, I wouldn’t be surprised, or if it happened in 50. Right now, we’re at least one major breakthrough away from that happening. But that could happen at any point.
Could it never happen?
In theory, yes. But in practice, I would guess that it will.
One argument that says that it may be, just like you’re suggesting, a straightforward one breakthrough away. It says that the human genome, which is the formula for building a general intelligence – and it does a whole lot other stuff – is, say, 700MB. But the part that is different than, say, a chimp, is just one percent of that, 7MB-ish. The logical leap is that there might just be a small little thing that’s a small amount of code, because even in that 7MB, a bunch of it’s not expressing proteins and all of that. It might just be something really simple. But do you think that that is anything more than an analogy? Is that actually a proof point?
I would expect it to be something along that line. Even today, I think you could take the vast majority of deep learning algorithms and you could represent them all in less than a MB of data. Many of these algorithms are fairly straightforward formula, when they’re implemented in the right way. They do what we currently call deep learning or whatever. I don’t think we’re that many major leaps away from having an artificial general intelligence. Right now, we’re just missing the first step on that path, and once something does emerge, there’s going to be thousands, tens of thousands of people all around the globe who will start working on it immediately, so we’ll see a very quick rate of progress as a result, in addition to it just learning by itself, anyway.
Okay, so just a couple more questions along these lines and then we’ll get back to the here and now. There’s a group of people – and you know all the names – high profile individuals who say that such a thing is a scary prospect, an existential threat, summoning the demon, the last invention. You know all of it. Then you get the other people, Andrew Ng, where it’s worrying about overpopulation on Mars, Zuckerberg who says flat out it’s not a threat. Two questions. Where are you on the fear spectrum, and two, why do you think these people – all very intelligent people – have such wildly different opinions about whether this is a good or bad thing?
I think eventually it will get to a point where it has to become – or at least certain applications of it will have to become a threat or dangerous in some way. There’s nothing on the horizon that – that’s really, again, the path of general intelligence, which nobody has right now. I think eventually it will go that way, as many technologies do. No one really knows how to manage it or handle it. There have been calls by some people to regulate AI in some way, but realistically, AI is a technology. It’s not an industry, and it’s not a product. You can regulate an industry, but it’s very hard to regulate a technology, especially when it’s outside of your own country’s borders. Other countries don’t need to regulate it, and so there’s a very good chance, if it’s going to be developed, it’s probably going to be developed by many countries, not just one, especially within a few years of each other. I’m not, to be sure, even if everybody unanimously agreed, that in 100 years’ time, it was going to become a threat. I’m not too sure that it could be stopped even already, because there’s so many people working on it across different countries all over the world. There’s no regulation in any single country that could stop it. Even right now, regulation isn’t required. The technologies don’t even exist to do it, to begin with.
Let’s talk about you for a minute. Can you bring us up to date? How did Voysis come about? How did you decide to enter into this field? Why did you specialize in text-to-speech? Can you just talk a little bit about your journey?
Sure. I started working in text-to-speech in 2002, so 15 years ago. I think at the time, what really attracted me to it was that it was a very difficult problem. Many people had worked on it for decades, especially back then. Computer voices sounded incredibly robotic, and then even when I looked into it in more detail, what made it even more interesting is many machine learning problems tend to be kind of classification problems, where they didn’t put a large amount of data, and then output a small amount of data in the output. For example, if you’re doing image classification today, the size of data you have in images and is far greater than the final results you get out of the model, which may just tell you this is a picture of a car or something like this. We didn’t put huge amounts of data in output, something that’s very small.
Text-to-speech is the extreme opposite of that, where the amount of input is just a few characters. From that, the system has to generate this human-sounding waveform. In the case of the human-sounding waveform, if even a small amount of that data is slightly off, the human ear will notice it very, very easily, because we’re completely used to listening to human voices, and we’re not used to listening to distorted signals generated by machines. I guess it’s the opposite of the traditional machine learning problem, where it’s kind of being creative, given a very small amount of data and it needs to create a whole lot more. That’s kind of where I started off originally, working on my PhD. After it, I became faculty at the university I was in, and made faculty for several years.
Then eventually, I resigned as faculty to start Voysis, where I think at the time, I had always said I’d like to open a company at some point. I think at that time in particular, we saw the likes of Google, Apple, Microsoft and so on – all of them went on an acquisition spree, and they acquired many of the smaller companies that had this technology, regardless of what country they were from. I think the knock-on side effect of that was that there were pretty much no independent providers anymore. Even what then companies were going to use these platforms for was very consumer-facing applications like we have today, with Google Home and Amazon Echo. But for other businesses out there who want to have a voice interface in their products, where their users can speak directly to their product and interact with them, pretty much the companies who could have provided that, were all acquired by these big platform companies.
That’s really what motivated me to start Voysis. Since then, we’ve built out Voysis as a complete voice AI platform, which normally when we say that, what we mean is that all of the technologies to power these systems – the speech recognition, the text-to-speech, the natural language understanding, the dialogue management and so on – all of the technologies were built in-house, here in Voysis. What we do is we partner with companies and select partners that we feel are both ready for voice, and consumers within that space that will benefit greatly from having a voice interface. When we build out products, we tend to find articles where we do a lot of user studies on how do consumers want to interact with these devices, and build out the whole user experience to deliver really high-quality voice interactions, integrated directly in third-party business products.
Looking at your website, I noticed you have linguists, you have a wide range of specialists in your company, and then watching your demo stuff, it just seems to me that what you’re trying to do, or what the field breaks down into are four things. I think you just ran through them. One of them is emulating human speech. One of them is simply recognizing the word that I’m saying. The third one is understanding those words, and then the fourth one is managing the dialogue of what pronouns are standing for what thing and all of that. Did I miss any of it?
No, I’d say that’s it in a nutshell, although in practice, we don’t really draw a line between recognizing words and understanding. In the case of the Voysis platform, what we do is audio would go in, and after it’s passed through several models, the understanding components come out. We never transcribe it into text first, because it’s an approach that I think many companies are moving away from. If you transcribe it into text first, you tend to accumulate error from speech recognition. When you try to understand it, there’s errors in the transcription and you can never really recover from it.
Got you. But just as underlying technology, I would love to just look at each one of them in isolation. Let’s do that second one first, which is just understanding what I am saying. I call my airline of choice, and I say my frequent flier number, which unfortunately has an A, an H, and an 8 in it.
AAHH88 – you know, that’s not it, and it never gets it. I shouldn’t say that, but if everything’s really quiet, it eventually gets it. Why is it so bad?
There’s probably multiple things at play there. If you’re talking to them over a phoneline, phone signals are generally quite distorted and it makes it much more difficult for speech recognition to work well. But there’s also a very good chance that the speech recognition engine they’re using behind that was a general speech recognition engine built for any random use case, as opposed to one that was designed to work on telephone calls, maybe even with some knowledge of the use cases around where it was going to be used.
Because it only needs to recognize 36 things, right? 26 letters and ten numbers.
Sure, but that speech recognition engine may not have been built to recognize some things, which is probably why it struggles with it. Historically, most companies – not Voysis, but many others – tend to build a single speech recognition engine that they try to use in many different situations, and that’s generally where accuracy tends to really suffer. Because if you don’t build a system with any context on exactly how it’s going to be used, it’s a much more difficult task to do 100 things well than it is to do one. That’s essentially the Achilles’ heel of it.
I guess also, unlike dialogue, it doesn’t get any clues about what the next letter or number should be from anything prior to it, right?
There is that, but I think in that case, if you’re just listing letters and numbers, there’s not that many of them. That should work quite well, I think.
In the sentence, “The cat ran up the…,” there’s a finite number of things the cat can run up. What I don’t get, as an aside, is I call from the same number every time. You would think they would have mastered caller ID by now. Let’s talk a little bit about understanding. Any time I come across a Turing test, like a chat bot, I always ask the same question, which is, “What’s larger: a nickel or the sun?” I haven’t found any system that can answer that question. Why is that?
Generally, the modern technologies that are used for chat bots, I think it’s still relatively immature in comparison to the technologies behind speech recognition and text-to-speech and so on. Chat bots really only work well when they’re custom-designed and custom-built for a particular use case. If you ask them general questions like that, it won’t align closely to what they were trained on or built on. As a result of that, you’ll get random answers, essentially, from it, or it’ll struggle to work. I think the chat bot-type technology is still very immature, because it requires a deeper understanding and a deeper intelligence. Whereas if you designed a chat bot, say, for e-commerce in particular, and if people ask it e-commerce-related queries, modern technologies can handle that extremely well. But once you go outside of the domain it was designed for, it will really struggle, because these technologies are not at that level yet, where they could handle switching like that.
When they do contests to try to evaluate things that might someday pass the Turing test, they’re always highly constrained. Like you’re saying, they always say you can’t ask about all of these different things. Do you believe that, to get a system that I could ask it any kind of question I want and it will answer it, does that require a general intelligence or not? Are we going to be able to kludge that up just with existing techniques on enough data?
I think the problem isn’t really about data. It’s modern techniques aren’t good enough to handle any kind of completely random query a user might say to it. Data helps in certain ways, as does newer technologies that are emerging. That is kind of a general intelligence you’re talking about, where it can understand language, regardless of the use case or context.
You don’t think we are in the process of building that now, to hearken back to the earlier part of our conversation, and that we shouldn’t hold our breath for anything like Jarvis, anything like C3PO, anything like that anytime, maybe for decades?
I would say that I’ve seen nothing that would suggest to me that that’s going to happen any time in the next few years. Normally I do keep up on literature and academic journals and so on. I still review many of them, and there’s nothing on the horizon that I’ve seen that would suggest that. I do think modern technologies are still improving in a more iterative way, where you wouldn’t say something completely random to it. But they’re becoming less rigid. If you think of a way you may interact with a Google Home or Echo or Siri, currently it’s in a very prescribed way, where you need to know what words you can say to it in what order, to make it do what you want it to do. Technologies are getting better at being a bit more fuzzy about that, so people can talk to them in a more natural way. But still, they’re still being designed around certain use cases as opposed to being completely general and being able to handle any kind of request.
Talk to me about dialogue management, that whole thing. Where are we with that? Once you understand the words that the person has said, is that a relatively easy problem to solve, or is that also another one that’s particularly tricky?
I think dialogue is probably the most tricky problem there is right now. What makes dialogue really, really difficult is context. You can collect a very large data set of how people interact with the system, but in all cases, the context could go back several turns. Somebody could have said something ten commands ago or ten sentences ago that’s now become relevant again. I think that general context around dialogue is what makes it quite difficult, whereas for example, with speech recognition, people would generally just consider all sentences are independent. That way, it’s very easy, even if you’re collecting data, it’s very easy to collect the large data set where people are saying loads of sentences. Whereas in the case of dialogue, if you need to have full context prescribed in your data set, everything that happened before, everything that happened after, it just means the task of even collecting data is far more difficult.
Understanding the data is far more difficult. Technologies are developing on that front. I think reinforcement learning, which you’re probably familiar with, looks really promising there. It seems to be developing at a fairly quick pace in that use case. But I think the real key with dialogue and making dialogue systems work well will be people need to talk to them in a more natural way than they currently do, whereas many companies’ current approach to dialogue is about collecting data, train the system, deploy it. For dialogue to work well, I think you need to have dialogue systems that can learn on the fly. As people interact with them, the dialogue systems will learn how to be a better dialogue system, and then maybe after enough interactions, which may initially be bad interactions, but after enough of them, the system will learn and do a much better job.
I think modern technologies can do that. We tend not to see many of them/systems deployed publicly, so again, if you speak to your Amazon Echo or something, it’s really built around having independent instructions that are not really connected to something you said a few sentences ago. You couldn’t have a chat with it. You can just give it a command and tell it what to do. But it doesn’t really come back and interact with you in any meaningful way.
I’ve had a couple of guests on the show from China, who have both said variants of the same thing, which is in China, because you have a much bigger character set to deal with, they’ve had to do voice recognition earlier and put a lot more energy into it. Therefore, they’re ahead in it, compared to other languages. First of all, is that your experience? Are there languages that we do better at it than others? Second, how generalizable is the technology across multiple languages? Like, once you master it in Russian, can you apply that to another language easily?
There’s a few approaches to this. I think the barrier for languages generally tends to be about acquiring good data. Acquiring loads of data is very easy, but you need to have good data. If you’re building a speech recognition system, where you’re expecting people to speak to it via cellphones, you want to record a data set of people speaking in a very similar way, as they would in a deployed application, but speaking through cellphones.
Generally doing that, that’s generally a big manual process that many companies do, where they record maybe tens of thousands of people, maybe more, saying commands through various different cellphone models and they’ll collect all the data, then train off that. That’s generally the barrier. The technologies themselves that are used in the Chinese systems – in Voysis, we do some stuff with Chinese as well. We are quite familiar with it, and the core technologies are all the same.
For speech synthesis, Chinese is a little bit different because it’s a tonal language. The larger character set as well brings in some of its own challenges, as well as in Chinese, they don’t have space characters between words. When you get a string of Chinese text, the first thing you need to figure out is: what are the words here? Where do you insert the spaces? For speech recognition, the technology stack is essentially the same.
We acquired language 100,000 years ago. Just talking about English, you know the whole path and how it got to where it is. What are things about English that make it uniquely difficult? Is it homophones? Is it…?
I think the biggest challenge with English is that the written form of English and how it’s pronounced aren’t really as well connected as many native speakers think they are. Whereas for many languages, if you see how a word is spelled, it’s very easy to predict how it’s going to be pronounced, whereas in English, that’s not really the case at all. There’s quite a lot of words that come from influences of different languages, be it from French or wherever else. I think as it is in English today, even calculating how to pronounce words remains still quite a big academic problem. People try to fight it with large data sets, where how every word is pronounced is still kind of specified manually, when the system’s being built. Whereas for many other languages, including Chinese, once you have the written form, you can generally quite easily calculate how would that be pronounced.
Then, talk to me about the fourth leg on this table, which is voice emulation. You had said that there’s kind of an uncanny valley effect, that if it’s just a little bit off, it sounds wrong.
Yes, these systems generate audio and they do it where their intention is that the audio will contain a speech signal and nothing else. But in practice, they’re generating audio. Any errors in that generation may result in random noises in the audio, glitches or other things. It may be distortions, maybe it’ll mispronounce a word, where in certain cases, changing a single sound in a word can change the meaning of a sentence very dramatically.
Also, for them to do a good job, they really need to understand the meaning of the words they’re saying, whereas if you’re just pronouncing words on their own without any understanding of the meaning, it will result in a speech signal that could sound very humanlike. But at the same time, native speakers will notice that something sounds just a bit off about it. It’s not delivered in a very natural way.
How do you solve that problem long-term? What are the best practices?
Currently, the best way to approach it is if you’ve got a good understanding of where that system is going to be used – again, not a one size fits all system, but you know maybe in a certain case, you want to be able to generate computer voices that will say things similar to what a store assistant may say. Generally, in that case, it makes a lot of sense to record a data set of things a store assistant would say, maybe even record a store assistant while they’re working so you can see what kind of prose it is they use.
Then from that, you build your AI with the knowledge of this is how a human in this situation would speak to someone, whereas traditionally, even now, for many of the computer voices we hear today, many of them are kind of close to being pre-recorded where they would have tens of thousands of audio clips recorded in advance, and they’re kind of stitching the words together. But even when they record the audio, they’re recording it with the use case in mind. If it’s a voice on a GPS system, like a sat nav, the audio it speaks to you with, that was trained off audio recordings of people reading sat nav-type instructions. But in that case, it can sound quite natural and it can sound quite good.
With those four technologies, the ability to recognize words, to understand them, to manage the dialogue, and to emulate voice, let’s say we get really good at all of them. Let’s say we get really good at them. I can think of probably three cases off the top of my head, or three ways that can be terribly misused. I’m sure you can think of more. But if we can go through each of them, I would appreciate getting your thoughts on them. The first is of course privacy. When you think about all the cellphone traffic in the world, most of us are lucky because there’s so much data that nobody can listen to all the conversations. Now, somebody can listen to all the conversations, understand them, interpret them, and so forth. I assume you agree that that is a potential misuse. What are your thoughts on it?
Yes, absolutely. I mean, I think even going back 20, 30 years, government agencies did tend to fund a lot of the university research in speech recognition. I assume use cases like that may have been what they had in mind. I think it also touches on this point of many cases where AI adds real value is that it can just scale far more than people, where you could have an AI that can transcribe all the content of all calls that are happening right now. Again, I imagine in certain parts of the world, that type of system is probably in place. I guess I don’t think there’s much that we can really do about it. It’s kind of inevitable, I think. At some point, it’s going to just become normal, if it isn’t already.
Then the second one is, I came across the site where you could type in dialogue and you could pick – in this particular case, it could be said in Hillary Clinton’s voice, or Donald Trump’s voice. You knew it wasn’t them, clearly. But it was kind of interesting, because all you have to do is say there’s a Moore’s law and it’ll be twice as good, twice as good, twice as good, twice as good. Then all of a sudden, hearing isn’t believing anymore. The whole fake news aspect of it, what do you think about that?
Yes, to really do that well, current technologies can’t do that well. There’s only two companies in the world that have that capability, as far as I know. One of them is Google, and the other is Voysis. It uses a technology called WaveNet. I’m not sure if you’re familiar with it already, but if you search for it and you come across some great examples of it, it will sound very, very convincingly human, particularly if you’re just reading a sentence. If you need to read longer amounts of text, then you hit this odd moment I mentioned earlier, where it sounds like the system doesn’t really understand what it’s saying.
But it will sound very convincingly human and far better than the samples you were referring to, of the Hillary Clinton voices and so on. That technology does exist today, and naturally there’s security concerns with that type of technology. Obviously if it fell into the wrong hands, people could make phone calls with the identity of somebody else, which could obviously have a dramatic impact on various things, be it at corporate level or government level. Again, I think this is a side effect of AI in general, that we’re going to see machines being better or as good as people at doing various tasks.
You think that’s also inevitable?
I think it’s already there.
When my Dad calls me and asks me my PIN number or whatever, I’ll be like, “I don’t know, what did you get me for my ninth birthday?” Let me ask of you, if somebody gave you a piece of audio that they recorded and said, “Can you figure out if this is a human or a computer,” could you figure it out? Or, could you imagine a tool that, no matter how good it gets, could still tell that it was not real audio, not a human?
I was having a chat with some professors about this exact question about two weeks ago. Everyone at the table unanimously agreed that that’s not possible, in our opinions. I know there’s a very big voice biometric industry right now, but I don’t really believe that computers can generate signals that will successfully bypass human systems.
I’m just going to let that sink in for a minute.
Do a Google search for WaveNet, if you’re not familiar with it. You’ll see some audio samples from both Google and Voysis, and the Voysis audio samples do sound very convincingly human. They can be used to mimic people’s voices as well.
Well, the interesting thing is, if you ask it about an image, we can do a pretty good job of… you take a photograph and can you tell if this was generated entirely by a machine or if it’s actually photographed? There’s all kinds of nuance in it, and gradients. There’s so many clues internal to it. Are you saying that there isn’t an equivalent richness to speech, you just don’t have as many dimensions of light and color and shadow and all of that, or are you saying no, even with video and images?
I mean, image is a lot easier than video. I would think, if you got one of the stronger AI teams in the world today and asked them to build a system that would produce convincing images in that sense, certainly there’s several teams out there that could do it. Video tends to be a lot more difficult, just because of the complexity of it, where video is essentially hundreds or thousands of images. I’d say the challenge or the barrier there is probably more computer power than any technologies, the lack of technology, for example.
My third question, my third area of concern is a topic I bring up a lot on the show, which is Weizenbaum and ELIZA. Back in the 60s, Weizenbaum made this program called ELIZA that was a really simple chat bot. You would tell it you were having problems, and it would ask you very rudimentary questions. Weizenbaum saw these people get emotionally attached to it, and he pulled the plug on it. He said, “Yeah, that’s wrong. That’s just wrong.” He said, “When the computer says, ‘I understand,’ it’s just a lie because there’s no ‘I’ and there’s nothing that understands anything.” Do you think it’s a concern, that when you can understand perfectly, you can engage in complex dialogue the way you’re talking about, and it can sound exactly like a human, that Weizenbaum’s worst fears have kind of come about? We haven’t really ennobled the machines, because it’s just still a lie. Do you have any concerns about that or not?
The way I look at it is I think when the day comes where, when these systems can speak and understand and interact with people in many languages in a very human and natural way, it will improve the lives of billions of people on the globe. Some people, particularly people who don’t need the technologies, may say they’d rather not use it or may not like speaking to a piece of plastic, essentially, as if it’s a person. But right now, for many people in the world, access to information is still a huge problem, much more so if you look in many developing countries.
I think even in India, they have over 1,100 languages. Even if certain people go to a doctor, they may not speak the language that the doctor speaks. There’s many communication problems globally. These technologies will dramatically improve the lives of so many people. People who don’t want to speak to these devices, as if they’re human, don’t have to. I think there’s probably more benefits than cons, on that front.
Well, just taking a minute with that, obviously I’m not talking anything about, “Oh, we don’t want people in India to understand other…,” nothing like that. If you look to science fiction, you have three levels. You have C3PO, and he just talked like a person. It was just Anthony Daniels talking. Then you get Star Trek, with Commander Data. It’s Brent Spiner, but he deliberately acted in a way where Data didn’t use contractions. He didn’t have emotion in his voice, but it was still human. Then you think of something like innumerable examples, like Buck Rogers in the 25thCentury, Twiggy, and it was clearly a mechanical voice. All three of them would solve your use case of understanding. The question is twofold. Do you have a feeling on which one of those, long-term, people will want? Will people want to always know they’re speaking to a machine?
Yes, I think so. In my opinion, people want communication to be frictionless or effortless. It shouldn’t feel that you need to concentrate hard on what’s the machine trying to say to me. Did it understand me or not? These types of things. If you have a machine that speaks in a very natural and almost humanlike way, I think many people would like it to have some artifact there that makes them aware that it’s actually a machine.
Where does that leave you with the technology that you’re building, that you said is trying to get that last one percent to sound like a human? What’s the use case for that? What’s the commercial demand for it?
I think right now, you have many of the computer voices we hear from various products that are out there are incredibly robotic. Those take quite a lot of effort to listen to them, especially if you try to listen to something like an audiobook. They tend to be very monotonous and almost it’s tiring to listen to them. That’s really what this technology addresses. It’s not that it has to be deployed in a way where it sounds convincingly human. It just can be deployed that way. If people have a preference to listen to it in a way where it has something in the signal where – it shouldn’t be tiring to listen to. They can do that. There’s no technology barrier to doing that, even today.
I mentioned the uncanny valley earlier, which is you don’t want your drawings of people to look just one notch below perfect, otherwise they look grotesque, that you definitely want to dial them down several notches. Is there an equivalent in audio in your mind that, if it’s just a little bit too close – or do you think it ought to go as far as it can, if that’s what people want right now, or that it should get to 95% and stop, if that’s more what people want right now?
I think the way machines will speak will always be different. But it doesn’t mean they shouldn’t sound natural. For example, when you and I talk, there’s plenty of times with, say, fillers like, “mmm,” “uh,” these different noises that also makes our speech natural, whereas for machines, there’s no need for them to do that. They can control even the speaking rate and various things that would make them not be speaking naturally in the human sense.
But they could still be speaking in a way that’s very easy for anyone to follow, very easy to understand, engaging. They don’t need to always sound – like today, many of these systems, especially the older generation ones, many of them do sound incredibly robotic. They’re tiring to listen to, or it takes quite a lot of effort. People listen to them when they have no other choice, really, whereas with the new wave of technology with WaveNet, it’s enabling these systems just to sound just much nicer to listen to.
If you take something like a soliloquy from Shakespeare, something like, “Friends, Romans, countrymen, lend me your ears. I have come to bury Caesar, not to praise him. The evil that men do lives after them the good is oft interred with their bones.” When I say that as a human, I’m emphasizing words. I’m stretching words out. I’m making other words fast, I’m inserting pauses. Is that what you’re talking about? Do you think you’ll get to a point where you could feed it that passage and it would do an equivalent reading, and not even worrying about if the tonal quality’s perfect? But, could it do all of that other stuff I just did?
Forbes published an article with some audio samples from our WaveNet system that did exactly that, although it was reading Black Beauty, just reading maybe the first 20, 30 sentences of Black Beauty.These systems can sound quite natural, but the system that did that, which did sound very natural, it was trained off audio of somebody reading audiobooks. It wasn’t a general system that could be used for different cases. I think current systems still need the training data to be quite close to the application. Otherwise, the level of naturalism diminishes very, very quickly.
How do you do that? I mean, I know we can’t understand it, especially in the context of this. But how do you do it? Is it word pairs that you’re looking for, or are words classified by their definition, whether they’re angry words? How does it work?
So, in practice, I think we’ve found out, within a certain domain – if you take audiobooks, for example, the way a single person would read a book, there’s various patterns around how they express certain things. The system itself needs to consider more than the sentence. It can’t just be reading individual sentences, which again is what many of these modern systems do. It needs to really think in terms of paragraph or in terms of the overall context with which it’s working within. Certainly predicting pauses or breath-type sounds, these systems will do that quite naturally as-is.
I think in the case of books, it’s probably more about timing than anything else. The pitch is, I think, probably easier in books than it is in certain bits of pitch, at least easier in books than they are in other domains, I think. If you haven’t heard of it, I’d highly recommend you to have a listen to the sample we published, or Forbes published, of our WaveNet system. It is reading a book, so it is really the exact use case you’re talking about here. We got great feedback from people, saying how eerily human it sounded, I think was the term Forbes used.
I assume eerily in the sense that the technology is eerie, not that it sounded eerie.
There are audio clips of J. R. R. Tolkien reading from his writings. I think there’s Hemingway reading some part of what he wrote. I think it would be great to hear Hemingway readThe Old Man and the Sea. How much Hemingway reading something he wrote would you need to make a convincing Hemingway reading The Old Man and the Sea? Is it a minute? Is it an hour?
Oh, it’s a lot more. To do it really well with current technologies, you need a lot more data. The one we published used 10 hours of one person reading, which I think was maybe a bit over two audiobooks.
So, if somebody had an unabridged recording of The Fellowship of the Ring, then The Two Towers, and they died, you could make a passable Return of the King?
Oh, absolutely yes. I think that data requirement will just go down over time, but currently 10 hours is the entry level, I think.
Legally speaking, who owns that? Right now, what would be the state of the art, either in Ireland where you are, or anywhere you know of?
It’s a good question. I think voice talents constantly encounter this, where in many ways, even if you pay a voice talent to record an audiobook, for example, the audio recordings do contain that person’s identity, to some extent. I think it’s very hard to classify who actually owns audio in that sense, when the audio is the other person speaking and it does contain their identity, just like if someone takes a photo of you or I. We can probably have some kind of entitlement of claiming ownership over it, if it is a photo of us, regardless of whatever payments were made. There’s some legal grey area, but that’s not got to do with AI technologies. That’s even for if you’re recording a radio commercial. It’s a legal grey area, too, as to how much of the audio recordings can you own, when it’s clearly someone’s identity? You can’t really own someone’s identity.
Right. I guess the question at law, which somebody will have to decide at some point, is if you pay somebody for a recording and then you own that recording, presumably you own all of the derivative things you – I mean, like you said, it’s a grey area. We don’t know, and I’m sure regulators in case law will eventually sort it out.
I wouldn’t be surprised if we ended up in a world where maybe celebrities could do endorsements of audio clips for radio or for various other things, where the audio is completely generated by a machine, where the celebrity didn’t need to go to a recording studio for a day to record that audio. I think that day is not that long away.
You know, it’s a world with lots of questions. I was just reading about this company that takes old syndicated TV shows and figures out ways to insert modern product placement in them. Then they can go sell that. Isn’t that something? All of a sudden – this isn’t a real example, but you could have Lucy drinking a Red Bull in I Love Lucy, or something like that, right? It’s all ones and zeroes, at some level. I gave three areas this technology could be misused that just came to me. There’s the fake news, there’s invasion of privacy, and there’s this dehumanizing Weizenbaum ELIZA aspect. What did I miss?
I think the general concern I’ve heard in academic circles always tends to be about privacy. You kind of covered that one. Nothing springs to mind.
Talk to us a minute. You have a platform that people can use. What I’ve noticed you emphasizing over and over is the platform needs to be trained to a purpose. If you’re a tennis shoe company, it needs to be taught with tennis shoe content, about tennis shoe-related issues. They’re all highly verticalized, or they have to be customized. I assume that’s the case. If so, what does that process look like and then, where are you on your product trajectory? What are you going to do next? How are you going to wow us in a year, when you come back on the show?
Yes, on the website we’re talking about this new product that we launched two weeks ago now, in New York, called Voysis Commerce. The way it really works is we’ve built out the whole commerce use case, through user studies and building up an understanding of what the consumers actually want to say to a retailer or a brand while they’re looking at their website or mobile app. We build out that use case in a way where today, any retailer or brand can just take their product catalog, which is the names of their products, whatever descriptions they have from the product pages on the products and they upload it to us.
Then, fully automatically, in a matter of hours, a voice AI is created, which knows what products they sell. It’s learned from the natural language descriptions on the product pages, about how their products are described. When a user comes along and says, “I want red tennis shoes with certain features on them,” the user can just say that using completely natural language and get relevant search results.
Then I think where it gets really interesting is when the user does get relevant searches on the screen in front of them, they can do a refinement query. They can just do maybe a follow-up query where they’re adding more details about what they’re looking for. Maybe they do their initial search for tennis shoes or whatever they’re looking for. When they see the search results on the screen, then they can say, “Actually, I only want to spend about $50. What have you got around that price?”
Again, the search results will be updated and they can continuously just provide more and more details, maybe change their mind on certain details. They could say, “What if I was to increase my budget by $50? What would the products be then?” They can just interact with it in this far more powerful way than what people are used to, with keyword-based search.
I think one of the side effects then, for the retailers, is that they get a much better understanding of what their customers are actually looking for, what their customers want. Currently, many retailers in e-commerce brands are doing a lot of data analytics. But really, what they’re analyzing is what keywords have people searched into a box or what buttons have they clicked on, whereas natural language obviously is not constrained. They can get a lot of value out of understanding their customers better and, in turn, provide a much better experience to the customers as well.
Fantastic. I’m going to assume I really am speaking to the real Peter Cahill, that it’s not somebody else at the company using the mimic thing, and this will be in the next Forbes article.
That’s a good idea. We should do that at some point.
Somebody can do all these for you. If people want to keep up with you personally and what your company is doing, can you just run down that?
Yes. Both me and the company are quite active on Twitter, so it’s @Voysis on Twitter, or @PeterCahill, on Twitter. Obviously if anyone ever wants to drop me a mail, please do. You can reach me at email@example.com.
Voysis is V-O-Y-S-I-S?
All right Peter, I want to thank you so much for taking the time to chat with us about this very fascinating topic.
Yes, thank you. I enjoyed it.
Byron explores issues around artificial intelligence and conscious computers in his new book The Fourth Age: Smart Robots, Conscious Computers, and the Future of Humanity.