Episode 83: A Conversation with Margaret Mitchell

In this episode Byron and Margaret Mitchell of Google discuss the nature of language and it's impact on machine learning and intelligence.

:: ::


Margaret Mitchell is a Senior Research Scientist in Google's Research & Machine Intelligence group, working on artificial intelligence.

Her research generally involves vision-language and grounded language generation, focusing on how to evolve artificial intelligence towards positive goals. This includes research on helping computers to communicate based on what they can process, as well as projects to create assistive and clinical technology from the state of the art in AI.

Her work combines computer vision, natural language processing, social media, many statistical methods, and insights from cognitive science.

In a nutshell, she's worked on:

  • deep learning, structured learning, shallow learning, and probabilistic systems (Math)
  • natural language generation, referring expression generation, reference to visible objects, conversation, image captioning, visual question answering, and storytelling (Grounded Language)
  • dialogue assistance for people who are non-verbal (Cerebral Palsy and Autism), visual descriptions for people who are blind, automatic diagnosis/monitoring of Mild Cognitive Impairment (a precursor to Alzheimer's), Parkinson's, Apraxia, Autism, Depression, Post-Traumatic Stress Disorder, Suicide Risk, and Schizophrenia (Assistive and Clinical Technology)


Byron Reese: This is Voices in AI brought to you by GigaOm and I'm Byron Reese. Today my guest is Margaret Mitchell. She is a senior research scientist at Google doing amazing work. And she studied linguistics at Reed College and Computational Linguistics at the University of Washington. Welcome to the show!

Margaret Mitchell: Thank you. Thank you for having me.

I'm always intrigued by how people make their way to the AI world, because a lot of times what they study in University [is so varied]. I've seen neuroscientists, I see physicists, I see all kinds of backgrounds. [It’s] like all roads lead to Rome. What was the path that got you from linguistics to computational linguistics and to artificial intelligence?

So I followed a path similar to I think some other people who've had sort of linguistics training and then go into natural language processing which is sort of [the] applied field of AI, focusing specifically on processing and understanding text as well as generating. And so I had been kind of fascinated by noun phrases when I was an undergrad. So that’s things that refer to person, places, objects in the world and things like that.  

I wanted to figure out: is there a way that I could like analyze things in the world and then generate a noun phrase? So I was kind of playing around with just this idea of ‘How could I generate noun phrases that are humanlike?’ And that was before I knew about natural language processing, that was before this new wave of AI interest. I was just kind of playing around with trying to do something that was humanlike, from my understanding of how language worked. Then I found myself having to code and stuff to get that to work—like mock up some basic examples of how that could work if you had a different knowledge about the kind of things that you're trying to talk about.

And once I started doing that, I realized that I was doing essentially what's called natural language generation. So generating phrases and things like that based on some input data or input knowledge base, something like that. And so once I started getting into the natural language generation world, it was a slippery slope to get into machine learning and then what we're now calling artificial intelligence because those kinds of things end up being the methods that you use in order to process language.

So my question is: I always hear these things that say "computers have a x-ty-9% point whatever accuracy in transcription" and I fly a lot. My frequent flyer number of choice has an A, an H and an 8 in it.

Oh no.

And I would say it never gets it right.


And it's only got 36 choices.


Why is it so awful?

Right. So that's speech processing. And that has to do with a bunch of different things including just how well that the speech stream is being analyzed and the sort of frequencies that are picked up are going to be different depending on what kind of device you're using. And a lot of times the higher frequencies are cut off. And so words that when [spoken] face to face or sounds that we hear face to face really easily are sort of muddled more when we're using different kinds of devices. And so that ends up especially on things like telephones cutting off a lot of these higher frequencies that really help those distinctions. And then there's like just general training issues, so depending on who you've trained on and what the data represents, you're going to have different kinds of strengths and weaknesses.

Well I also find that in a way, our ability to process linguistics is ahead of our ability in many cases to do something with it. I can't say the names out loud because I have two of these popular devices on my desk and they'll answer me if I mentioned them, but they always understand what I'm saying. But the degree to which they get it right, like if I say "what's bigger—a nickel or the sun?" They never get it. And yet they usually understand the sentence.

So I don't really know where I'm going with that other than, do you feel like you could say your area of practice is one of the more mature, like hey, we're doing our bit, the rest of you common sense people over there and you models of the world over there and you transfer learning people, y'all are falling behind, but the computational linguistics people—we have it all together?

I don't think that's true. And the things you're mentioning aren't actually mutually exclusive either, so in natural language processing you often use common sense databases or you're actually helping to do information extraction in order to fill out those databases. And you can also use transfer learning as a general technique that is pretty powerful in deep learning models right now.

Deep learning models are used in natural language processing as well as image processing as well as a ton of other stuff.

So… everything you're mentioning is relevant to this task of saying something and having your device on your desktop understand what you're talking about. And that whole process isn't just simply recognizing the words, but it's taking those words and then mapping them to some sort of user intent and then being able to act on that intent. That whole pipeline, that whole process involves a ton of different models and requires being able to make queries about the world and extract information based on… usually it's going to be the content words of the phrase: so nouns, verbs things that are conveying the main sort of ideas in your utterance and using those in order to find relevant information to that.

So the Turing test… if I can't tell if I'm talking to a person or a machine, you got to say the machine is doing a pretty good job. It's thinking according to Turing. Do you think passing the Turing test would actually be a watershed event? Or do you think that's more like marketing and hype, and it's not the kind of thing you even care about one way or the other?

Right. So the Turing Test as was originally construed has this basic notion that the person who is judging can't tell whether or not it's human-generated or machine-generated. And there's lots of ways to do that. That's not exactly what we mean by human level performance. So, for example, you could trivially pass the Turing test if you were pretending to be a machine that doesn't understand English well, right? So you could say, “Oh this is a this is a person behind this, they're just learning English for the first time—they might get some things mixed up.”

As long as you can use even template-based approaches, keyword-based approaches, you could generate things that do seem like they are a human. And so [they] would in that way sort of pass the Turing test. And I think that that's not maybe the spirit of the Turing test. But I think that there's a lot of additional factors that should be taken into account when actually trying to think about what intelligence is. So for example, being able to make inferences about related kinds of events or related kind of activities, things that require more complex reasoning and that has more to do with the ability to analyze, the ability to create new content or to generate new content, given a bunch of inputs, and less to do with convincing a listener or a reader that this is a human or not.

I think that it's maybe a bit of the wrong direction to try and pass the Turing test in order to say you have intelligence, because it could be the case that the ideal intelligence or the intelligence that we want to have with our machines doesn't express itself in exactly humanlike ways. And so you'll find that when we have generated utterances for example with our various devices, they don't stutter, they don't use “ums” and “ahs” and like and things like that. And that's sort of fine and I don't think that means that these things aren't intelligent, it just means that they're working with a different kind of intelligence.

So the Turing test is really a fascinating way to think about uncovering what intelligence is. But I don't see it as the end goal as originally construed. I think we could probably pass it and there may have recently been some sort of competitions that trivially pass it using template-based approaches. And that's not exactly getting at what we're talking about when we're talking about artificial intelligence.

Yeah, I personally think that they have to rig… that’s not the right word because everybody knows it's being done, but they confine the kinds of questions you can ask so narrowly that you don't really have anything that is making a compelling… Like there's not a single system I've ever seen that I can't decipher in one question. I mean there's a hundred questions that none of them can even come close to answering.

So let me give you a Turing test kind of question. Doctor Smith is eating at her favorite restaurant when suddenly her phone rings. After speaking for a moment, she looks worried, runs out of the restaurant, forgetting to pay her bill. Is management likely to call the police? So with that question you've got to know a lot about culture, you've got to know: oh it's her favorite restaurant so they probably know her, oh she's a doctor, looking worried, probably an emergency call, [so] no they're not going to call the police. They're just going to... right? So, how far away, how many years, decades or centuries is it before a computer would be like ‘oh no, they're not going to call the cops?’

Oh yeah. So reasoning about what a human would do given a sequence of events...

With a lot of cultural context, right?

Yeah, each aspect of that sequence of events points to a bunch of different sort of cultural knowledge or different data points that entail a bunch of...


I would imagine that something like that would be possible or is possible in the relatively near term. It all comes down to what you've been able to define as the common knowledge or the cultural knowledge, so: what it means to be a regular at a restaurant—that has to be stored somewhere, that has to be understood somewhere. That could be meaning like "will come back again to restaurant," as long as you're able to extract that kind of information from general text in the world. So it's something that you might pick up by scraping the web, things like that, then that kind of reasoning change should be possible.

The thing that sort of tricks up the systems when you do these kinds of long sequences of reasoning events is connecting entities to one another. So knowing that the woman mentioned at the start is still the same woman mentioned at the end—in order to do that you have to do what's called a co-reference chain, meaning that you follow the original proper noun that mentions the person and then the pronouns that refer back to the person and sort of ascertain that there's been no introduction of any new proper nouns that the pronouns could be referring to.

So at each point in that sequence, that story, it's reasonable to assume that technology could be able to figure out some of the basics. Where the trickiness is… is just making sure that all of the people and all of the events are properly ascribed to their roles and who they are in the real world. That's a bit of the trickier part.

Yeah because sometimes if I said “The giant chased the man, each of his footfalls shook the earth,” you know ‘his’ isn't actually the most recent now, right?

Right, exactly, yeah and that requires a lot of sort of previous understanding about how the world works.

So it's funny because I find practitioners as a general rule… well I’ve had them on the show. They have to deal with all that stuff and so they're always the ones who say "This stuff is far away. I can’t even get it to tell the difference between A, H, and 8" and you're telling me about social norms and all that, and they're like, “Who knows, how can we ever do that?” So but you are a practitioner and you're like "Oh yeah we'll do that, we'll figure that out."

Yeah so that's possible. For one thing the restaurant domain is relatively well studied. So there's a lot of dialogue systems [that] ‘focus in on restaurants and movies. So if there's any sort of domain that a system might [do] particularly well on restaurants—knowing things about restaurants and what it means to be a regular at a restaurant, as well as relevant things for movies, are probably two domains that are more within grasp than some other domains.

But that speaks to I think this larger issue which is that we can do things relatively well given a specific domain but ‘open world’ problems, where we have no idea what the context will be or what the event, what the possible events will be, what the possible actors and players in the event will be that ends up being a lot more difficult. So maybe in a restaurant domain there would be enough knowledge collected over time to have sort of the relevant background to do reasoning. But if you were to talk about something that was less studied, like I don't know, something about exploring space or like how to do large inference over a series of actions—circling Mars or something like that, where there isn't a ton of customers asking for this kind of information, that's probably going to be a lot trickier.

And that's also because we can do well in domains when they're closed because we can also leverage people to write out lots of information. When we're trying to do anything in open domain we have to be able to extract it all from the web, train on data that we're able to get, and without a human ‘in the loop’ as it's said, to filter that, to clean that up, it's often difficult to know exactly what the right entities are, what the right relationships are.

So I'm going to ask you an unfair question, because it's not answerable really, but how do we do that then?

How do we—given a sequence or a story—come to conclusions about...?

Yeah, because if I said to you "Hey, imagine a fish, a one pound trout swimming in a river. Imagine a one pound trout in a jar of formaldehyde in a laboratory." I'm just guessing these aren't two things that you had everyday experience with. And I said, "Do they smell the same?" You would say "no." "Are they the same temperature?" "Definitely not." "Are they the same color?" "Probably.” I could ask you all of these things and it's effortless for you, so why… what are we doing that we haven't got machines to do?

So we're exposed to information from well before we're born, but very much after we're born, we're just sort of given this constant influx of multimodal information. So vision, smell, sound, and even people born with disabilities will use the other modalities and have to be very sensitive to all these kinds of things and that's constant. You basically never turn it off. So from a machine side that would be like getting tons and tons of data constantly for days, weeks, years, and being able to generalize from there.

The thing that humans are doing, that computers find a lot more difficult is taking information about previous situations and then generalizing them in a new kind of situation with very little information. Generally machines need a lot of information in order to make the connections more directly. But I believe our ability to do that is based on our constant interaction in the world. We pull on all that constant interaction when we make decisions about how to answer these kinds of questions. It seems effortless, but that's because our brains are amazing and have recorded and memorized and found patterns in tons of things over the course of our whole lifespan.

Given that same kind of constant information, a computer may be able to do that, but the problem is, computers aren't specified for the human modalities that we trivially take in. So the way that Computer Vision Systems see the world is fundamentally different than how we as humans see the world using our eyes as opposed to using cameras. And so really the difficulty there is taking the way that the computer can understand the world from these different modalities, from the vision modality or from the text modality and then making that be compartmentalized in a way that is somewhat humanlike, where you group things together in particular ways. You come up with generalizations that follow particular patterns that probably correspond to how the human brains have evolved. And so the ability to do that is something that modern computer systems pretty much break down on.

We have consciousness, so we experience the world as opposed to measuring it. Do you think consciousness—we all know what it is, nobody agrees on what brings it about, but we know what it is, it's the experience of being you. Do you think that model of the world that you're talking about, which is all that sensory input, is not only integrated but made sense of?. Would that potentially be a prerequisite for a computer to be able to duplicate the versatility that humans show in that? Must we build conscious computers to do that?

OK so in order to get a versatility that is humanlike, I think it is important to be able to process the information in a humanlike way. Although whether or not that means consciousness, like the ability to self reflect is a completely different question. It could very much be the case that we have a system that is able to process vision in a way similar to how humans do, process sound, things like that and then come up with answers that are humanlike. But that doesn't mean that it can self reflect.

I think that's what we mean when we're talking about human consciousness—the ability to identify oneself in the world and have a sense of self in the world. And what gives rise to that in humans is not understood or… I'm sure you have this covered on your program before. There's lots of great work on this… philosophy. But unless we can figure out exactly the switch that goes from our sensory inputs to consciousness, it's gonna be tricky to say that a computer has consciousness even if it's behaving exactly like a human.

So I could talk about this forever. I'm fascinated by the topic and I'm chewing up my time with you, but I'd really like to close with you talking about some of your work, as a senior research scientist at Google, the ‘model cards’ for instance.

Oh awesome. Yeah. Thank you. So one of the things that is being noticed more and more in the news is that systems that are sort of at least publicly don't work equally well for everyone. So there's some really nice work for example by Joy Buolamwini and my colleague Timnit Gebru showing that gender detection doesn't work as well for black females as it does for white males. And part of why this is okay, or why this is sort of normal is because there's no standard of anyone ever reporting how well their systems work, much less reporting how well they work across a variety of sort of different subgroups which is called disaggregated evaluation.

So the idea with model cards is just to have it be that when an ML model or an AI model is made available, people have access to how well it works. They can know how well it works across different kinds of subgroups. They can know different information that went into building the systems or what the considerations were, information about the training data, limitations, how the system was benchmarked and things like that, basically opening up some transparency into what these systems are actually doing, so that people don't just use products thinking 'Oh it works or doesn't work,' and having a better sense of what the nuances are when it works and when it doesn't.

And where are you with that and how can people find out more? Is that a standard you all are trying to release?

So Google and Microsoft have similar ideas in this space. Microsoft has Datasheets, which is for data sets and Google's put out this idea of model cards for model reporting. We're both in discussions with the Partnership on AI, on figuring out how we might put these forward as a sort of first pass at a standard, hoping to put out a little bit of an idea that regulators and other stakeholders might be interested in, could grab onto and further iterate on. And we have papers out, so you could look at model cards for model reporting or Datasheets for data sets. So there's been some Google blogs and stuff about this thing, but we're hoping to have some more significant launches in the near future.

Well we will keep an eye on that. How can people follow you and what you're doing and keep up with that?

Yeah, you can follow me on Twitter. I'm @MMitchell_AI. I try and tweet interesting things so yeah feel free.

All right. Well it was a fascinating half hour, I want to thank you for making the time and we'd love to have you back.

Yeah. Thank you so much.