Kind: captions Language: en the following is a conversation with raha Prasad he's the vice president and head scientist of Amazon Alexa and one of its original creators the Alexa team embodies some of the most challenging incredible impactful and inspiring work that is done in a high today the team has to both solve problems at the cutting edge of natural language processing and provide a trustworthy secure and enjoyable experience to millions of people this is where state-of-the-art methods in computer science meet the challenges of real-world engineering in many ways Alexa and the other voice assistants are the voices of artificial intelligence to millions of people and an introduction to AI for people who have only encountered it in science fiction this is an important and exciting opportunity so the work that Rohit and the Alexa team are doing is an inspiration to me and to many researchers and engineers in the AI community this is the artificial intelligence podcast if you enjoy it subscribe on YouTube give it five stars an apple podcast supported on patreon or simply connect with me on Twitter Alex Friedman spelled Fri D ma n if you leave a review on an apple podcast especially but also cast box or comment on youtube consider mentioning topics people ideas questions quotes in science tech or philosophy that you find interesting and I'll read them on this podcast I won't call out names but I love comments with kindness and thoughtfulness in them so I thought I'd share them someone on YouTube highlighted a quote from the conversation with Ray Dalio where he said that you have to appreciate all the different ways that people can be a player's this connected me to on teams of engineers it's easy to think that raw productivity is the measure of excellence but there are others I've worked with people who brought a smile to my face every time I got to work in the morning their contribution to the team is immeasurable I recently started doing podcast ads at the end of the introduction I'll do one or two minutes after introducing the episode and never any ads in the middle that break the flow of the conversation I hope that works for you it doesn't hurt the listening experience this show is presented by cash app the number one finance app in the App Store I personally use cash app to send money to friends but you can also use it to buy sell and deposit a big coin in just seconds cash app also has a new investing feature you can buy fractions of a stock say $1 worth no matter what the stock price is brokerage services are provided by cash up investing a subsidiary of square and member at CIBC I'm excited to be working with cash app to support one of my favorite organizations called first best known for their first robotics and Lego competitions they educate and inspire hundreds of thousands of students in over 110 countries and have a perfect rating at Charity Navigator which means the donated money is used to maximum effectiveness when you get cash app from the App Store Google Play and use code Lex podcast you'll get $10 and cash app will also donate $10 to 1st which again is an organization that I've personally seen inspire girls and boys the dream of engineering better world this podcast is also supported by a zip recruiter hiring great people is hard and to me is one of the most important elements of successful mission driven team I've been fortunate to be a part of and lead several great engineering teams the hiring I've done in the past was mostly through tools we built ourselves but reinventing the wheel was painful sip recruiters a tool that's already available for you it seeks to make hiring simple fast and smart for example codable co-founder gretchen nner use zip recruiter to find a new game artist to join our education tech company by using sip recruiters screening questions to filter candidates Gretchen found it easier to focus on the best candidates and finally hiring the perfect person for the role in less than two weeks from start to finish zip recruiter the smartest way to hire CY zip recruiters effective for businesses of all sizes by signing up as I did for free at zip recruiter comm / Lex pod that zipper Kirkham / Lex pod and now here's my conversation with Rohit Prasad in the movie her I'm not sure if you ever seen a human falls in love with a voice of an AI system let's start at the highest philosophical level before we get too deep learning and some of the fun things do you think this what the movie her shows is within our reach I think not specifically about her but I think what we are seeing is a massive increase in adoption of AI assistants Rai and all parts of our social fabric and I think it's what I do believe is that the utility these areas provide some of the functionalities that are shown are absolutely within reach so the some of the functionality in terms of the interactive elements but in terms of the deep connection that's purely voice based do you think such a close connection as possible with voice alone it's been a while since I saw her but I would say in terms of the in terms of interactions which are both human-like and in these AI assistants you have to value what is also super human we as humans can be in only one place AI assistance can be in multiple places at the same time one with you on your mobile device one at your home one at work so you have to respect these superhuman capabilities to Plus as humans we have certain attributes we are very good at where you're at reasoning AI assistance not yet there but in Terrell mauve AI assistance what they're great at is computation memory it's infinite and pure these are the attributes you have to start respecting so I think the comparison with human-like versus the other aspect which is also super human has to be taken into consideration so I think we need to elevate the discussion to not just human like so there's certainly elements we just mentioned Alexa's everywhere computation is speaking so this is a much bigger infrastructure than just the thing that sits there in the room with you but it certainly feels to us mere humans that there's just another little creature there when you're interacting with it you're not interacting with the entirety of the infrastructure you're interacting with the device the feeling is okay sure we anthropomorphize things but that feeling is still there so what do you think we as humans the purity of the interaction with a smart assistant what do you think we look for in that interaction I think in the certain interactions I think will be very much where it does feel like a human because it has a persona of its own and in certain ones it wouldn't be so I think a simple example to think of it is if you're walking through the house and you just want to turn on your lights on and off and you're issuing a command that's not very much like a human-like interaction and that's where the AI shouldn't come back and have a conversation with you just it should simply complete that command so those I think the blend of we have to think about this is not human human alone it is a human machine interaction and certain aspects of humans are needed and certain aspects are in situations demand it to be like a machine so I told you it's gonna be full soft cause in parts what was the difference between human and machine in that interaction when we interact to humans especially those our friends and loved ones versus you and a machine that you also are close with I think they you have to think about the roles the AI plays right so and it differs from different customer to customer different situation to situation especially I can speak from Alexis perspective it is a companion a friend at times an assistant an advisor down the line so I think most a eyes will have this kind of attributes and it will be very situational in nature so where is the boundary I think the boundary depends on exact context in which you are interacting what they are so the depth and the richness of natural language conversation is been by Alan Turing being used to try to define what it means to be intelligent you know there's a lot of criticism of that kind of but what do you think it's a good test of intelligence in your view in the context of the Turing test and Alexa or the elect surprise this whole realm do you think about this human intelligence what it means to define it what it means to reach that level I do think the ability to converse is an sign of an ultimate intelligence I think that is no question about it so if you think about all aspects of humans there are sensors we have and those are basically a data collection mechanism and based on that we make some decisions with our sensory brains right and from that perspective I think that there are elements we have to talk about how we sense the world and then how we act based on what we sense those elements clearly machines have but then there's the other aspects of computation that is way better I also mentioned about memory again in terms of being near infinite depending on the storage capacity you have and the retrieval can be extremely fast and pure in terms of like there's no ambiguity of who did I see when right I mean if your machine scan remember that quite well so it again on a philosophical level I do subscribe to the fact that to can be able to converse and as part of that to be able to reason based on the world knowledge you've acquired and the sensory knowledge that is there is definitely very much the essence of indulgence but indulgence can go beyond human level intelligence based on what machines are getting capable of so what do you think maybe stepping outside of Alexa broadly as an AI field what do you think is a good test of intelligence put it another way outside of Alexa because so much of Alexa is a product is an experience for the customer on the research side what would impress the heck out of you if you saw you know what is the test what he said wow this thing is now starting to encroach into the realm of what we loosely think of as human intelligence so well we think of it as a GI and human intelligence all together right so in some sense and I think we are quite far from that I think an unbiased view I have is that the Alexus intelligence capability is a great test I think of it as there are many other proof points like self-driving cars game playing like go or chess let's take those two for as an exemption clearly requires a lot of data-driven learning and intelligence but it's not as hard a problem as conversing with as an AI is with it humans to accomplish certain tasks or open domain chat as you mentioned like a surprise in those settings the key difference is that the end goal is not defined unlike game playing you also do not know exactly what state you are in in a particular goal completion scenario in certain times sometimes you can if it is a simple goal but if you're even certain examples like planning a weekend or you can imagine how many things change along the way you look for whether you make change your mind and you you change their destination or you want to catch a particular event and then you decide no I want this other event I want to go to so these dimensions of how many different steps are possible when you're conversing as a human with a machine makes it an extremely daunting problem and I think it is the ultimate test for intelligence and don't you think the natural language is enough to prove that conversation your conversation from a scientific standpoint natural language is a great test but I would go beyond I don't want to limit it to as natural language as simply understanding an intent or parsing for entities and so forth we are really talking about dialogue so so I would say human machine dialogue is definitely one of the best tests of intelligence so can you briefly speak to the Alexa prize for people who are not familiar with it and and also just maybe were things stand and what have you learned what's surprising what have you seen the surprising from this incredible competition absolutely it's a very competition like surprise is essentially Grand Challenge in conversational artificial intelligence where we threw the gauntlet to the universities who do active research in the field to say can you build what we call a social board that can converse with you coherently and engagingly for 20 minutes that is an extremely hard challenge talking to someone in a who you're meeting for the first time or even if you're you've met them quite often to speak at 20 minutes on any topic an evolving nature of topics is super hard we have completed two successful years of the competition the first was one with the industry of Washington's second industry of California we are in our third instance we have an extremely strong team of 10 cohorts and the third instance of the of the lexer prizes underway now and we are seeing a constant evolution first year was definitely learning it was a lot of things to be put together we had to build a lot of infrastructure to enable these you know STIs to be able to build magical experiences and and do high quality research just a few quick questions sorry for the interruption what is failure look like in the 20-minute session so what does it mean to fail not to reach the twenty minimum awesome question so there are one first of all I forgot to mention one more detail it's not just 20 minutes but the quality of the conversation too that matters and the beauty of this competition before I answer that question on what failure means is first that you actually converse with millions and millions of customers as these social BOTS so during the judging phases there are multiple phases before we get to the finals which is a very controlled judging in a situation where we have we bring in judges and we have interactors who interact with these social BOTS that is a much more controlled setting but till the point we get to the finals all the judging is essentially by the customers of Alexa and there you basically rate on a simple question how good your experience was so that's where we are not testing for a 20 minute boundary being claw across because you do want to be very much like a clear-cut winner be chosen and and it's an absolute bar so did you really break that 20-minute barrier is why we have to test it in a more controlled setting with actors essentially in tractors and see how the conversation goes so this is why it's a subtle difference between how it's being tested in the field with real customers versus in the lab to award the prize so on the latter one what it means is that essentially the that there are three judges and two of them have to say this conversation is stalled essentially got it and the judges the human experts judges or human experts okay great so this is in the third year so what's been the evolution how far it's in the DARPA challenge in the first year the autonomous vehicles nobody finished in the second year a few more finished in the desert so how far along within this I would say much harder challenge are we this challenge has come a long way do they extend that we've definitely not close to the 20-minute barrier being with coherence and engaging conversation I think we are still five to ten years away in that horizon to complete that but the progress is immense like what you're finding is the accuracy in what kind of responses these social BOTS generate is getting better and better what's even amazing to see that now there's humor coming in the bots are quite you know you're talking about ultimate science of intial and signs of intelligence I think humor is a very high bar in terms of what it takes to create humor and I don't mean just being goofy I really mean good sense of humor is also a sign of intelligence in my mind and something very hard to do so these social BOTS are now exploring not only what we think of natural language abilities but also personality attributes and aspects of when to inject an appropriate joke went to when you don't know the question the domain how you come back with something more intelligible so that you can continue the conversation if if you and I are talking about AI and we are domain experts we can speak to it but if you suddenly switch the topic to that I don't know how do I change the conversation so you're starting to notice these elements as well and that's coming from partly by by the nature of the 20 minute challenge that people are getting quite clever on how to really converse and essentially masks some of the understanding defects if they exist so some of this this is not a Lex of the products this is somewhat for fun for research for innovation and so on I have a question sort of in this modern era there's a lot of you look at Twitter and Facebook and so on there's there's discourse public discourse going on and some things are a little bit too edgy people get blocked and so on I'm just out of curiosity are people in this context pushing the limits is anyone using the f-word is anyone sort of pushing back sort of you know arguing I guess I should say in as part of the dialogue to really draw people in first of all let me just back up a bit in terms of why we're doing this right so you said it's fun I think fun is more part of the engaging part for customers it is one of the most used skills as well in our skill store but up that apart the real goal was essentially what was happening is with lot of AI research moving to industry we felt that academia has the risk of not being able to have the same resources at disposal that we have which is loss of beta massive computing power and a clear ways to test these AI advances with real customer benefits so we brought all these three together in the like surprise that's why it's one of my favorite projects and Amazon and with that the secondary fact is yes it has become engaging for our customers as well we're not there in terms of where we want to it to be right but it's a huge progress but coming back to your question on how do the conversations evolve yes there is some natural attributes of what you said in terms of argument and some amount of swearing the way we take care of that is that there is a sensitive filter we have built that show you see words and so it's more than keywords a little more in terms of of course there's key word base to but there's more in terms of these words can be very contextual as you can see and also the topic can be something that you don't want a conversation to happen because this is a criminal device as well a lot of people use these devices so we have put lot of guardrails for the conversation to be more useful for advancing AI and not so much of these these other issues you attributed what's happening in there I feel as well right so this is actually a serious opportunity I didn't use the right word fun I think it's an open opportunity to do some some of the best innovation in conversational agents in in the world absolutely why just universities why just you know streets because as I said I really felt young minds young minds it's also - if you think about the other aspect of where the whole industry is moving with AI there's a dearth of talent in in given the demands so you do want the universities to have a clear place where they can invent and research and not fall behind with that they can't motivate students imagine all grad students left - to industry like us or or faculty members which has happened - so this is in a way that if you're so passionate about the field where you feel industry and academia need to work well this is a great example and a great way for universities to participate so what do you think it takes to build a system that wins the allow surprise I think you have to start focusing on aspects of reasoning that it is there are still more lookups of what intense customers asking for and responding to those are rather than really reasoning about the elements of the of the conversation for instance if you have if you're playing if the conversation is about games and it's about a recent sports event there's so much context in war and you have to understand the entities that are being mentioned so that the conversation is coherent rather than you suddenly just switch to knowing some fact about a sports entity and you're just relating that rather than understanding the true context of the game like you if you just said I learned this fun fact about Tom Brady rather than really say how he played the game the previous night then the conversation is not really that intelligent so you have to go to more reasoning elements of understanding the context of the dialogue and giving more appropriate responses which tells you that we are still quite far because a lot of times it's more facts being looked after and something that's close enough as an answer but not really the answer so that is where the research needs to go more an actual true understanding and reasoning and that's why I feel it's a great way to do it because you have an engaged set of users working to make help these AI advances happen in this case item actually customers they're there quite a bit and there's a skill what is the experience for the for the user that is helping so just to clarify this isn't as far as I understand the Alexa so this skill is to stand alone for the art surprise I mean it's focused on the elect surprise it's not you ordering certain things and I was on the comet trait checking the weather or you're playing Spotify right separate skills directly and so you're focused on helping not well I don't know how do people how do customers think of it are they having fun are they helping teach the system what's the experience like I think it's both actually and let me tell you how they how you invoke this skill so you all you have to say Alexa let's chat and then the first time you say Alexa let's chat it comes back with a clear message that you're interacting with one of those you know three social BOTS and there's a fear so he's know exactly how you interact right and that is why it's very transparent you are being asked to help right and and we have lot of mechanisms where as the we are in the first phase of feedback phase then you send a lot of emails to our customers and then this they know that this the team needs a lot of interactions to improve the accuracy of the system so we know we have lot of customers who really want to help be zeros to bots and they are conversing with that and some are just having fun with just saying Alexa let's chat and also some adversarial behavior to see whether how much do you understand as a social bot so I think we have a good healthy mix of all three situations so what is the if we talk about solving the Alexa challenge they like surprise what's the data set of really engaging pleasant conversations look like is if we think of this as a supervised learning problem I don't know if it has to be but if it does maybe you can comment on that do you think there needs to be a data set of what it means to be an engaging successful fulfilling copy that's part of the research question here this was I think it's we at least got the first part right which is have a way for universities to build and test in a real-world setting now you're asking in terms of the next phase of questions which we are still we're also asking by the way what does success look like from a optimization function that's what you're asking in terms of we as researchers are used to having a great corpus of annotated data and then making a Rob then you know sort of tune our algorithms on those right and fortunately and unfortunately in this world of a lexer prize that is not the way we are going after it so you have to focus more on learning based on live feedback that is another element that's unique we're just not I started with giving you how you ingress and experience this capability as a customer what happens when you're done so they ask you a simple question on a scale of one to five how likely are you to interact with this social bot again that is a good feedback and customers can also leave more open-ended feedback and I think partly that to me is one part of the question you're asking which I'm saying is a mental model shift that as researchers also you have to change your mindset that this is not a dart by evaluation or NSF funded study and you have a nice corpus this is where it's real world you have real data the scale is amazing is this beautiful thing then and then the customer the user can quit the conversation in exactly the user game that is also a signal for how good you were at that point so and then on a scale of one to five one two three do they say how likely are you or is it just a binary Allah one two five one two five Wow okay that's such a beautifully constructed challenge okay you said the only way to make a smart assistant really smart to give it eyes and let explore the world I'm not sure he might been taken out of context but can you a comment and I can you elaborate and that idea is that I personally also find that ideas super exciting from a social robotics personal robotics perspective yeah a lot of things do get taken out of context my this particular one was just as philosophically discussion we were having on terms of what does intelligence look like and the context was in terms of learning I think just we said we as humans are empowered with many different sensory abilities I do believe that eyes are an important aspect of it in terms of if you think about how we as humans learn it is quite complex and it's also not unimodal that you are fed a ton of text or audio and you just learn that way no you are you learn by experience you learn by seeing you're taught by humans and we're very efficient and how we learn machines on the contrary are very inefficient on how they learn especially these AI is I think the next wave of research is going to be with less data not just less human not just with less label data but also with a lot of week supervision and where you can increase the learning rate I don't mean less data in terms of not having a lot of data to learn from that we are generating so much data but it is more about from a aspect of how fast can you learn so improving the quality of the data that's the quality data and learning process I think more on the learning process I think we have to we as humans learn with a lot of noisy data right and and I think that's the part that I don't think should change what should change is how we learn right so if you look at you mentioned supervised learning we have making transformative shifts from moving to more unsupervised more week supervision those are the key aspects of how to learn and I think in that setting you I hope you agree with me that having other senses is very crucial in terms of how you learn so absolutely and from a machine learning perspective which I hope we get a chance to talk to a few aspects that are fascinating there but just stick on the point a sort of a body you know an embodiment so Alexa has a body is a very minimalistic beautiful interface or there's a ring and so on I mean I'm not sure of all the flavors of the devices that Alyssa lives on but there's a minimalistic basic interface and nevertheless we humans so I have a Roomba of all kinds of robots and all over everywhere so what do you think the Alexa the future looks like if it begins to shift what his body looks like what uh what may be beyond the Alexa what do you think are the different devices in the home as they start to embody their intelligence more and more what do you think that looks like philosophically a future what do you think that looks I think let's look at what's happening today you mentioned I think all our devices as an Amazon devices we also wanted to point out Alexa is already integrated a lot of third-party devices which also come in lots of forms and shapes some in robots right some and microwaves some in appliances of that you use in everyday life so I think it is it's not just the shape Alexa takes in terms of form factors but it's also where all it's available it's getting in cars it's getting in different appliances in homes even toothbrushes right so I think you have to think about it is not a physical assistant it will be in some embodiment as you said we already have these nice devices but I think it's also important to think of it it is a virtual assistant it does superhuman in the sense that it is in multiple places at the same time so I think the the actual embodiment in some sense to me doesn't matter I think you have to think of it as not as human-like and more of what its capabilities are that derive a lot of benefit for customers and how there are different ways to delighted and delight customers and different experiences and I think I am a big fan of it not being in just human like it should be human-like in certain situations Alexa Frye social bot in terms of conversation is a great way to look at it but there are other scenarios where human like I think is underselling the abilities of this AI so if I could trivialize what we're talking about so if you look at the way Steve Jobs thought about the interaction with the device that Apple produced there was a extreme focus on controlling the experience by making sure there's only the Apple produced devices you see the voice of Alexa being taking all kinds of forms depending on what the customers want and that means that means it could be anywhere from the microwave to a vacuum cleaner to the home and so on the voice is the essential elrom to the interaction I think voice is an essence it's not all but it's a key aspect I think to your question in terms of you should be able to recognize Alexa and that's a huge problem I think in terms of a huge scientific problem I should say like what are the traits what makes it look like Alexa especially in different settings and especially if it's primarily voice what it is but LX is not just voice either right I mean we have devices with a screen now you're seeing just other behaviors of Alexa so I think they're in very early stages of what that means and this will be an important profit for the following years but I do believe that being able to recognize and tell when it's Alexa versus it's not as going to be important from an Alexa perspective I'm not speaking for the entire AI Thank You Marie but from but I think attribution and as we go into more of understanding who did what that identity of the AI is crucial in the coming world I think from the broad AI community perspective that's also a fascinating problem so basically if I close my eyes and listen to the voice what would it take for me to recognize that this is Alexa exactly or at least the Alexa that I've come to known from my personal experience in my home through my interactions that Korea and the Alexa here in the u.s. is very different the Alexa and UK and Alexa India even though they are all speaking English or the Australian version so again we're so now think about when you go into a different culture different community but you travel there what do you recognize Alexa I think these are super hard questions actually so there's a Tina works on personality so if we talk about those different flavours or what it means culturally speaking India UK u.s. what does it mean to add so the problem that we just stated which is fascinating how do we make it purely recognizable that it's Alexa assuming that the qualities of the voice are not sufficient it it's also the content of what is being said how do how do we do that how does the personality kind of come into play what's what's that researching would look like it's such a fascinating we have some very fascinating folks who from both the UX background and human factors are looking at these aspects and these exact questions but I'll definitely say it's not just how it sounds the choice of words the tone not just I mean the voice identity of it but the tone matters the speed matters how you speak how you enunciate words how what choice of words are using how tours are you or how lending in your explanations you are all of these are factors and you also you mentioned something crucial that it's may have you may have personalized it Alexa to some extent in your homes or in the devices you are interacting with so you as your individual how you prefer Alexa sounds can be different than how I prefer and we may and the amount of customizability you want to give is also a key debate we always have but I do want to point out it's more than the voice actor that recorded and you'd sounds like that actor it is more about the choices of words the attributes of tonality the volume in terms of how you raise your pitch and so forth all of that matters this is a fascinating problem from a product perspective I could see those debates just happening inside of the Alexa team of how much personalization do you do for the specific customer because you're taking a risk if you over personalized because you don't I if you create a personality for a million people you can test that better you can create a rich fulfilling experience that will do well but if the more you personalize it the less you can test it the less you can know that it's it's a great experience so how much personalization what's the right balance I think the right balance depends on the customer give them the control so I'd say I think the more control you give customers the better it is for everyone and I'll give you some key personalization features I think we have a feature called remember this which is where you can tell Alexa to remember something there you have an explicit sort of control in customers hand because they have to say like I remember XYZ what kind of things would that be used for so you can respond or something I have stored my tire specs for my car nice because it's so hard to go and find and see what it is right when you're having some issues I store my mileage plan numbers for all the frequent-flyer ones where sometimes just looking at it and it's not handy so and so those are my own personal choices army for Alexa to remember something on my behalf right so again I think the choice was be explicit about how you provide that to a customer as a control so I think these are the aspects of what you do like think about where we can use speaker recognition capabilities that it's if you taught Alexa that you are Lex and this person you're householders person to then you can personalize the experiences again these are very in this and the CX customer experience patterns are very clear about and transparent when a personalization action is happening and then you have other ways like you go through explicit control right now through your app that your multiple service providers let's say for music which one is your preferred one so when you say place ting depend on your whether you have preferred Spotify or Amazon music or Apple music that the decision is made where to play it from so what's Alexis backstory from her perspective this is there I remember just asking as probably a lot of us are just the basic questions about love and so on of Alexa just to see what the answer would be just as a it feels like there's a little bit of a back like there's a feels like there's a little bit of personality but not too much is Alexa have a metaphysical presence in this human universe we live in or is it something more ambiguous is there a past is there birth is there family kind of idea even for joking purposes and so on I think well it does tell you if I think you should double-check this but if you said when were you born I think we do respond I need to double check that but I'm pretty positive about it I think you do it because I think I've too soon but that's like that's like hell like I was born in your brand of champagne and whatever the year good thing yeah so in terms of the metaphysical I think it's early does it have the historic knowledge about herself to be able to do that maybe have we crossed that boundary not yet right in terms of being thank you have you thought about it quite a bit but I wouldn't say that we have come to a clear decision in terms of what it should look like but you can imagine though and I bring this back to the Alexa prize social BOTS one there you will start seeing some of that like you these bots have their identity and in terms of that you may find you know this is such a great research topic that some academia team may think of these problems and start solving them - so let me ask a question it's kind of difficult I think but it feels fascinating to me because I'm fascinated with psychology it feels that the more personality you have the more dangerous it is in terms of a customer perspective of products if you want to create a product that's useful by dangerous I mean creating an experience that upsets me and so what how do you get that right because if you look at the relationships maybe I'm just a screwed-up Russian but if you look at the real human to human relationship some of our deepest relationships have fights have tension have the push and pull have a little flavor in them do you want to have such flavor in an interaction with Alexa how do you think about that so there's one other common thing that you didn't say but is we think of it as paramount for any deep relationship that's trust trust yeah so I think if you trust every attribute you said mm-hmm a fight some tension yeah is or healthy but the waters sort of unknowable in this instance is trust and I think the bar to earn customer trust for AI is very high in some sense more than a human it's it's not just about personal information or your data it's also about your actions on a daily basis how trustworthy are you in terms of consistency in terms of how accurate are you in understanding me like if if you're talking to a person on the phone if you have a problem with your let's say your internet or something if the person is not understanding you lose trust right away you don't want to talk to that person that whole example gets amplified by a factor of 10 because as when you're a human interacting with an AI you have a certain expectation either you expect it to be very intelligent and then you get upset why is it behaving this way more you expect it to be not so intelligent and when it surprises you're like really you're trying to be too small so I think we grapple with these hard questions as well but I think the key is actions need to be trustworthy from these a is not just about data protection your personal information protection but also from how accurate it accomplishes all commands are all interactions well it's tough to hear because Trust you're absolutely right but Trust is such a high bar with AI systems because people and I see this because I work with autonomous vehicles I mean the bar this placed on AI system is unreasonably high yeah that is going to be as I agree with you and I think of it is it's it's a challenge and it's also keeps my job so from that perspective that I totally I think of it at both sides as a customer and as a researcher I think as a researcher yes occasionally it will frustrate me that why is the bar so high for these AIS and as a customer then I say absolutely it has to be that high right so I think that's the trade-off we have to balance but doesn't change the fundamentals that trust has to be own and the question then becomes is are we holding the AIS to a different bar and accuracy and mistakes then we hold humans that's going to be a great societal questions for years to come I think for us well one of the questions that we grapple as a society now that I think about a lot I think a lot of people know I think about a lot and Alexis taking on head-on is privacy is the reality is us giving over data to any AI system can be used to enrich our lives in in in profound ways so if maybe basically any product that does anything awesome for you would the more data has the more awesome things it can do and yet at the other side people imagine the worst case possible scenario of what can you possibly do with that data people it's it goes down to trust as you said for there's a fundamental distrust of in certain groups of governments and so on and depending on the government depending on who is in power depending on all these kinds of factors and so here's the lux in the middle of all of it in the home trying to do good things for the customers so how do you think about privacy in this context the smart assistants in the home how do you maintain how do you earn trust absolutely so as you said Trust is the key here so you start with trust and then privacy is a key aspect of it it has to be designed from very beginning about that and we believe in two fundamental principles one is transparency and second is control so if by transparency I mean when we build what is now called smart speaker or the first echo we were quite judicious about making these right trade-offs on customers behalf that it is pretty clear when when the audio is being sent the cloud the light ring comes on when it has heard you say the word wake word and then the streaming happens right so and the light ring comes up we also had we put a physical mute button on it just so you're if you didn't want it to be listening even for the weak word then you turn the power button on the mute button on and that disables the microphones that's just the first decision on essentially transparency and control over then even when we launched we gave the control in the hands of the customers that you can go and look at any of your individual utterances that is recorded and delete them anytime and we have cut to true to that promise right so and that is super again a great instance of showing how you have the control then we made it even easier you can say lecture delete what I said today so that is now making it even just just more control in your hands with what's most convenient about this technology is voice you delete it with your voice now so these are the types of decisions we continually make we just recently launched this feature called what we think of it as if you wanted humans not to review your data because smile you mentioned supervised so you in supervised learning humans have to give some annotation and that also is now a feature where you can essentially if you selected that flag your data will not be reviewed by a human so these are the types of controls that we have to constantly offer with customers so why do you think about as people so much that so that so everything you just said is really powerful to the control the ability to leak because we collect we have studies here running at MIT that collects huge amounts of data and people consent and so on the ability to delete that data is really empowering and almost nobody ever asked to delete it but the ability to have that control is really powerful but still you know there's these popular anecdotes anecdotal evidence that people say they like to tell that them and a friend were talking about something I don't know sweaters for cats and all sudden they'll have advertisements for cat sweaters on Amazon there's that that's a popular anecdote as if something is always listening what can you explain that anecdote that experience that people have what's the psychology of that what's that experience and can you you've answered it but let me just ask is Alexa listening no Alexa listens only for the wake word on the device right and awake word is the words like Alexa Amazon echo and you but do you only choose one at a time so you choose one and it listens only for that on our devices so that's first from a listening perspective we have to be very clear that it's just the wake word so you said why is there this anxiety if you make yeah it's because there's a lot of confusion what it really listens to right and you and I think it's partly on us to keep educating our customers and the general media more in terms of like how what really happens and we've done a lot of it and with our pages on information are clear but still people have to have more there's always a hunger for information and clarity and will constantly look at how best to communicate if you go back and read everything yes it states exactly that and then people could still question it and I think that's absolutely okay to question what we have to make sure is that we are because our fundamental philosophy is customer first customer obsession is our leadership principle if you put as researchers I put myself in the shoes of the customer and all decisions in Amazon are made with that and I throw and Trust has to be earned and we have to keep earning the trust of our customers in this setting and to your other point on like is there something showing up based on your conversations no I think the answer is like you a lot of times when those experiences happen you have to also be know that okay maybe a winter season people are looking for sweaters right and it shows up on your amazon.com because it is popular so there are many of these you mentioned that personality or personalization turns out we are not that unique either right so those things we we as humans start thinking oh must be because something was heard and that's why this other thing showed up the answer is no probably it is just the season for sweaters I'm not gonna ask you this question because it's just cuz your doll so because people have so much paranoia but for Milan as you say from my perspective I hope there's a day when customer can ask Alexa to listen all the time to improve the experience to improve because I personally don't see the negative because if you have the control and if you have the trust there's no reason why I shouldn't be listening all the time to the conversations to learn more about you because ultimately as long as you have control and Trust every data you provide to the device that the device wants is going to be useful and that's it to me I as a machine learning person I think it worries me how sensitive people are about their data relative to how empowering it could be for the devices around them how enriching it could be for their own life to improve the product so I just it's something I think about sort of a lot how do we make that devices obviously Lux that thinks about it a lot as well I don't know if you want to comment on that sort of okay have you seen them in the form of a question okay I have have you seen an evolution in the way people think about their private data in the previous several years so as we as a society a more more comfortable to the benefits we get by sharing more data first let me answer that part and then I'll want to go back to the other aspect you were mentioning so as a society on a general we are getting more comfortable as a society doesn't mean that everyone is and I think we have to respect that I don't think one-size-fits-all is always gonna be the answer for all right by definition so I think that's is something to keep in mind in these going back to your on what more magical experiences can be launched in these kind of AI settings I think again if you give the control we it's possible certain parts of it so if you have a feature called follow-up mode where you if you turn it on and Alexa after you've spoken to it will open the mics again thinking you lanced something again yeah like if you're adding lists to your shopping items so right or a shopping list or to-do list you're not done you want to keep so in that setting it's awesome that it opens the mic for you to say eggs and milk and then bread right so these are the kind of things which you can empower so I and then another feature we have which is called Alexa guard I said it only listens for the wake word all right but if you have a let's say you're going to say Lex you leave your home and you want a lexer to listen for a couple of sound events like smoke alarm going off or someone breaking your glass right so it's like just to keep your peace of mind so you can say Alexa on guard or I'm away or and then it can be listening for these sound events and when you're home it you come out of that mode right so this is another one where you again gave controls in the hands of the user or the custom and to enable some experience that is you higher utility and maybe even more delightful in the certain settings like follow up more and so forth again this general principle is the same control in the hands of the Castro so I know we kind of started with a lot of philosophy and a lot of interesting topics and we'll just jumping all over the place but really some of the fascinating things at the alexa team and Amazon's doings in the the algorithm side the data side the technology at the deep learning machine learning and and so on so can you give a brief history of Alexa from the perspective of just innovation the algorithms the data of how I was born how it came to be how is grown where it is today yeah start with in Amazon everything starts with the customer and we have a process called working backwards Alexa and more specifically then the product echo there was a working backwards document essentially that reflected what it would be started with a very simple vision statement for instance that morphed into a full-fledged document along the way changed into what all it can do right you can but the inspiration was the Star Trek computer so when you think of it that way you know everything is possible but when you launch a product you have to start with someplace and when I joined we the product was already in conception and we started working on the far field speech recognition because that was the first thing to solve by that we mean that you should be able to speak to the device from a distance and in those days that wasn't a common practice and even in the previous research world I was in was considered to an unsolvable problem then in terms of whether you can converse from a length and here I'm still talking about the first part of the problem where you say get the attention of the device as in by saying what we call the wake word which means the word Alexa has to be detected with a very high accuracy because it is a very common word it has sound units that map with words like I like you or Alec Alex right so it's a undoubtably hard problem to detect the right mentions of Alexa's address to the device versus I like Alexa you have to pick up that signal when there's a lot of noise not only noise north conversation they are in the house while you remember on the device you are simply listening for the wake word Alexa and there's a lot of words being spoken in the house how do you know it's Alexa and directed at Alexa because I could say I love my Alexa I hate my Alex I want a lecture to do this and in all these three sentences I said Alexa I didn't want it to wake up yeah so can I just pause on a second what would be your device that I should probably in the introduction of this conversation give to people in terms of with them turning off their Lutz a device if they're listening to this podcast conversation out loud like what's the probability that an Alexa device will go off because we mention Alexa like a million times so it will we have done a lot of different things where we can figure out that there is the device the speech is coming from a human versus over there also I mean in terms of like also it is think about ads or so we have also launched a technology for watermarking kind of approaches in terms of filtering it out but yes if this kind of a podcast is happening it's possible your device will wake up a few times it's an unsolved problem but it is definitely something we care very much about but the idea is you wanna detect Alex were meant for the device or just even hearing Alexa versus I like yeah something I mean that's the fascinating part so that was the first relief that's the first of the world's best detector of course yeah the FIR world's best wait word detector yeah in the far field setting not like something where the phone is sitting on the table this is like people have devices 40 feet away like in my house or 20 feet away and you still get an answer so that was the first part the next is you're speaking to the device of course you're gonna issue many different requests some may be simple some may be extremely hard but it's a large vocabulary speech recognition problem essentially where the audio is now not coming on to your phone or a handheld mic like this or close talking my but it's from 20 feet away where if you're in a busy household your son may be listening to music your daughter may be running around with something and asking your mom something and so forth right so this is like a common household setting where the words you're speaking to Alexa need to be recognized with very high accuracy yes right now we are still just in the recognition problem you haven't yet come to the understanding one writes in if a possum so I once again what year was this is this before neural networks began to start to seriously prove themselves in audio space yeah this is around so I joined in 2013 in April right so the early research in neural networks coming back and showing some promising results in speech recognition space had started happening but it was very early yeah but we just took now build on that on the very first thing we did when when I join and we with the team and remember it was a very smudge of a start-up environment which is great about Amazon and we double down on deep learning right away and we we knew will have to improve accuracy fast and because of that we worked on and the scale of data once you have a device like this if it is successful will improve big time like you'll suddenly have large volumes of data to learn from to make the customer experience better so how do you scale deep learning so we did are one of the first works in in training with distributed GPUs and where the training time was you know was linear in terms of like in the amount of data so that was quite important work where it was algorithmic improvements as well as a lot of engineering improvements to be able to train on thousands and thousand of speech and that was an important factor so the if you ask me like back in 2013 and 2014 when we launched echo the combination of large scale data deep learning progress near infinite GPX we had available on AWS even then was all came together for us to be able to solve the far field speech recognition to the extent it could be useful to the customers it still not solved like I mean it's not that we are perfect at recognizing speech but we are great at it in terms of the settings that are in homes right so and that was important even in the early stages the first even I'm trying to look back at that time if I remember correctly that it was it seems like the task will be pretty daunting so like so we kind of take it for granted that it works now yes right so let me like how first time you mentioned startup I wasn't familiar how big the team was I kind of because I know there's a lot of really smart people working on looks and I was very very large team how big was the team how likely were you to fail in the highs of everyone else like what I'll give you a very interesting anecdote on that when I joined the team the speech recognition team was six people my first meeting and we had hired a few more people it was 10 people 9 out of 10 people thought it can't be done who was the one the one was me and actually I should say and one was say my optimistic yeah and and 8th we're trying to convince let's go to the management and say let's not work on this problem let's work on some other problem like either telephony speech for customer service calls and so forth but this was the kind of belief you must have and I had experience with far-field speech recognition and I my eyes lit up and I saw a problem like that saying okay we have been in speech recognition always looking for that killer app and this was a killer use case to bring something delightful in the hands of customers you mentioned you the way kind of think of the product way in the future have a press release and an FAQ and you think backwards that's did you have that the team have the echo and mind so this far-field speech recognition actually putting a thing in the home that works it's able to interact with was that the press release what was the way close I would say in terms of the as I said the vision was started computer right or the inspiration and from there I can't divulge all the exact specifications but one of the first things that was magical on a lexer was music it brought me to back to music because my taste was still and when I was an undergrad right so I still listen to those songs and I it was too hard for me to be a music fan with a phone right so I and I don't I hate things in my ears so from that perspective it was quite hard and and and music was part of the at least the documents I have seen right so so from that perspective I think yes in terms of our how far are we from the original vision I can't reveal that words that's why I have done a fun at work because every day we go in and thinking like these are the new set of challenges to solve yeah that's a great way to do great engineering is you think of the product press release I like that idea maybe we'll talk about it a bit later was just a super nice way to have focused I'll tell you this you're a scientist and a lot of my scientists have adopted that they they have now they love it as a process because it was very a scientist you're trained to write great papers but they are all after you've done the research or you're proven lie and your PhD dissertation proposal is something that comes closest or a DARPA proposal or NSF proposal is the closest that comes to a press release but that process is now ingrained in our scientists which is like delightful for me to see you write the paper first then make it happen that's right in fact that's not state-of-the-art results or you leave the results section open well you have a thesis about here's what I expect right and here's what it will change Yeah right so I think it is a great thing it works for researchers as well they're so far field recognition yeah what was the big leap what what were the breakthroughs and yeah what was that journey liked it today yeah I think the as you said first there was a lot of skepticism on whether far-field speech recognition will ever work to be good enough right and what we first did was got a lot of training data in a far field setting and that was extremely hard to get because none of it existed so how do you collect data in far field set up right with no customer bases there's no customer base right so that was first innovation and once we had that the next thing was ok you if you have the data first of all we didn't talk about like what would magical mean in this kind of a setting what is good enough for customers right that's always since you've never done this before what would be magical so so it wasn't just a research problem you had to put some in terms of accuracy and customer experience features some stakes on the ground saying here is where I think should it should get to so you established a bar and then how do you measure progress toward is given you have no customer right now so from that perspective we went so first was the data without customers second was doubling down on deep learning as a way to learn and I can just tell you that the combination of the two cut our error rates by a factor of five from where we were when I started to within six months of having that data we at that point and I got the conviction that this will work right so because that was magical in terms of when it started working and that reached them who came close to the magical bar back to the bar right that we felt would be where people will use it that was critical because you you really have one chance at this if we had launched in November 2014 years when we launched and if it was below the bar I don't think this category exists if you don't need the bar yeah and just having looked at voice-based interactions like in the car or earlier systems it's a source of huge frustration for people in fact we use voice based interaction for collecting data on subjects to measure frustration so as a training set for computer vision for face data so we can get a data set of frustrated people that's the best way to get frustrated people is having them interact with a voice based system in the car so this is that bar I imagine it's pretty high it was very high and we talked about how also errors are perceived from a eyes versus errors by humans but we are not done with the problems that ended up we had to solve to get it to launch so do you want the next one so the next one was what I think of as multi-domain natural language understanding it's very I wouldn't say easy but it is during those days solving it understanding in one domain and narrow domain was doable but for these multiple domains like music like information other kinds of household productivity alarms time errors even though it wasn't as big as it is in terms of the number of skills alexa has and the confusion space has like grown by three orders of magnitude it was still daunting even those days and again no customer base here again no customer base so now you're looking at meaning understanding and intent understanding and taking actions on behalf of customers based on their request and that is the next hard problem even if you have gotten the words recognized how do you make sense of them in those days there was still a lot of emphasis on rule-based systems for writing grammar patterns to understand the intent but we had a statistical first approach even then where for a language understanding we had in even those starting days and an entity recognizer and an intent classifier which was all trained statistically in fact we had to build the deterministic matching as follow-up to fix bugs that statistical models have right so it was just a different mindset where we focused on data-driven statistical understanding wins in the end if you have a huge dataset yes it is contingent on that and that's why it came back to how do you get the data before customers the fact that this is why data becomes crucial to get a to the point that you have the understanding system built in build up and notice that for here we were talking about human machine dialogue even those early days even it was very much transactional do one thing one shot a transition great way there was a lot of debate on how much should Alex our talk back in terms of if you misunderstood you or you said play songs by the stones and let's say it doesn't know you know early days knowledge can be sparse who were the stones right I the Rolling Stones right so our and you don't want them match to be Stone Temple Pilots or Rolling Stones right so you don't know which one it is so these kind of other signals to know there we had great assets right from Amazon in terms of you acts like what is it what kind of yeah hurry solve that problem in terms of what we think of it as an entity resolution problem right so is one is it right I mean the even if you figured out the stones is an entity you have to resolve it to whether it's the stones or the temple violence or some other stones maybe I misunderstood is the resolution the job of the algorithm or is the job of UX communicating with the human to help there as well there is both right it is law you want 90 percent or high 90s to be done without any further questioning or UX right so but that it's absolutely okay just like as humans we asked the question I didn't understand your likes yeah it's fine for a lecture to occasionally say I did not understand you right and and that's a important way to learn and I'll talk about where we have come with more self learning with these kind of feedback signals but in those days just solving the ability of understanding the intent and resolving to an action where action could be play a particular artist or a particular song was super hot again - the bar was high as as you're talking about right so while we launched it in sort of 13 big domains I would say in terms of or thing we think of it as 13 the big skills we had like music is a massive one when we launched it and now we have 90,000 plus skills on Alexa so what are the big skills can you just go is the only thing I use it for is music weather and shopping haha so we think of it as music information right so it's all whether it's a part of information right so then we launched we didn't have smart home but within spikes bottom I mean you connect your smart devices you control them with watch if you haven't done it it's worth it will change your signing on the lights yeah you like to do anything that's connected and has a it's just what your favorite smart device for you and now you've the smart plug with and you don't we also have this echo plug which is oh yeah and now you can turn on that one on and off this conversation motivation in Kevin's garage door you can check your status of the garage door and things like and we have gone may collect some more and more proactive where it even have a hunt has on chores now that all those hunches like you left your light on or let's say you've gone to your bed and you left the garage light on so yeah it will help you out in these settings right so that smart devices right information smart devices said music yeah so I don't remember everything we had big ones like that was you know the timers were very popular right away music also like you could play song artist album everything and so that was like a clear win in terms of the customer experience so that's again this is language understanding now things have evolved right so where we want a lecture definitely to be more accurate competent and trustworthy based on how well it does these core things but we have in many different dimensions first is what I think of her doing more conversational for high-utility not just for chat right and there we a tree Mars this year which is our AI conference we launched what is called Alexa conversations that is providing the ability for developers to author multi-tone experiences on Alexa with no code essentially in terms of the code dialogue code initially it was like you know all these IVR systems you have to fully author if the customer says this do that right so the whole dialogue flow is hand author and with Alexa conversations the way it is that you just provide a sample interaction data with your service or an API let's say you're Adam take its that provides a service for buying movie tickets you provide a few examples of how your customers will interact with your api's and then the dialogue flow is automatically constructed using a recurrent neural network a train on that beta so that simplifies the developer experience we just launched our preview for the developers to try this capability out and then the second part of it which shows even increased utility for customers is you and I when we interact with Alexa or any customer as I coming back to our initial part of the conversation the goal is often unclear or unknown to the AI if I say Alexa what movies are playing nearby am i trying to just buy movie tickets am I actually even do you think I'm looking for just movies for curiosity whether the Avengers are still in theater or when it's maybe it's gone and maybe it will come on my mr. so I may watch it on prime which happened to me so so from that perspective now you're looking into what is my goal and let's say I now complete the movie ticket purchase maybe I would like to get dinner nearby so what is really the goal here is it night out or is it movies as and just go watch a movie here the answer is we don't know so can Alexa now figure we have the intelligence that I think this metal goal is really night or at least say to the customer when you have completed the purchase of movie tickets from Adam tickets or Fandango or picture anyone then the next thing is do you want to get to get an uber to the theater right or do you want to book a restaurant next to it and and then not ask the same information over and over again what time what how many people in your party right so so this is where you shift the cognitive burden from the customer to the AI where it's thinking the of what is your it anticipates your goal and takes the next best action to complete it now that's the machine learning problem but essentially you're the way we solve this first instance and we have a long way to go to make it scale to everything possible in the world but at least for this situation it is from at every instance Alexa is making the determination whether it should stick with the experience with Adam tickets or offer or you based on what you say whether either you have completed the interaction or you said no get me an uber now so it will shift context into another experience or skill on another service so that's a dynamic decision-making that's making Alexa you can say more conversational for the benefit of the customer rather than simply complete transactions which are well thought through if you as a customer has fully specified what you want to be accomplished its accomplishing that so it's kind of as I would do this with pedestrians like intent modeling is predicting what your possible goals are most likely going and switching that depending on the things you say so my question is there it seems maybe it's a dumb question but it would help a lot of elects remembered me what I said previously right it is it's trying to use some memory for the custom year it is using a lot of memory within that so right now not so much in terms of okay which restaurant do you prefer right that is a more long-term memory but within the short-term memory within the session it is remembering how many people did you so if you said buy four tickets not has made an implicit assumption that you were gonna have you need for at least four seats at a restaurant right so these are the kind of context its preserving between these skills but within that session what are you asking the right question in terms of for it to be more and more useful it has to have more long-term memory and that's also an open question and again this is still early days so for me I mean everybody is different but yeah I'm definitely not representative of the general population the sense that I do the same thing every day like I eat the same that I do everything the same the same thing we're the same thing clearly this or the black shirt so it's frustrating when it looks it doesn't get what I'm saying because I had to correct her every time the exact same way this has to do with certain songs like she doesn't know certain weird songs only and doesn't know I've complained to Spotify about this talked to the Rd head of our idea Spotify stairway to heaven I have to correct it every time it really doesn't play Led Zeppelin correctly so I should figure you should send me or next time it fails the seat for you to send it to me we'll take care of it okay well let's Apple it is one of my favorite it works for me so I'm like shocked it doesn't work for you this is an official public port I'll put it I'll make it public retweet it we're gonna fix this there would have impairment anyway but the point is you know I'm pretty boring and do the same thing but I'm sure most people do the same set of things do you see Alexa sort of utilizing that in the future for improving the experience yes and not only utilizing it's already doing some of it we call it where Alexa is becoming more self learning so Alexa is now auto correcting millions and millions of car trances in US without any human supervision the way desert is let's take an example of a particular song didn't work for you what do you do next you either it played the wrong song and you said Alexa no that's not the song I want or you say likes a play that you try it again and that is a signal to Alexa that she may have done something wrong and from that perspective we can learn if there's that failure pattern or that action of song a was played when song B was requested yes it's very common with station names because play NPR you can have n be confused as an M and then you for a certain accent like mine people confuse my n and M all the time and because I will Indian accent there confusable to humans it is for Alexa too and in that part but it starts auto correcting and we collect we correct a lot of these automatically without a human looking at the failures so the one of the things that's for me missing in Alessa I don't know from a representative customer but every time I correct it it would be nice to know that that made a difference yes you know I mean like that yeah sort of like I I heard you like some acknowledgement of that we worked a lot with with Tesla study the autopilot and so on and a large amount of the customers they used Tesla autopilot they feel like they're always teaching the system uh-huh they're almost excited by the possibility teaching I don't know if Alexa customers generally think of it as they're teaching to improve the system I think and that's a really powerful thing against I would say it's a spectrum some customers do think that way and some would be annoyed by Alexa acknowledging that or so there's a again no one you know while there are certain patterns not everyone is the same in this way but we believe that again customers helping Alexa is a tenet for us in terms of improving it dancing more self learning is by again this is like fully unsupervised right there is no you in the loop and no labeling happening and based on your actions as a customer Alexa becomes smarter again it's early days but I think this whole area of teachable AI is gonna get bigger and bigger in the whole space especially in the AI assistant space so that's the second part where I mentioned more conversational this is more self learning the third is more natural and the way I think of more natural is we talked about how Alexa sounds and there are and we have done lot of advances in our text to speech by using again neural network technology for it to sound very human like an individual texture the sound to the the the timing the tonality tone of everything I would think in terms of there's a lot of controls in each of the places for how I mean the speed of the voice the prosthetic patterns the the actual smoothness of how it sounds all of those are factored and we do ton of listening tests to make sure is that what naturalness how it sounds should be very natural how it understands requests is also very important like and in terms of like we have 95,000 skills or and if we have imagined that and many of these skills you have to remember the skin Ling and say Alexa asked they're tied skill to tell me X right or now if you have to remove the skill name that means the discovery and the interaction is unnatural and we're trying to solve that by what we think of as again this was you don't have to have the app metaphor here these are not individual apps right even though they're so you cut you're not sort of opening one at a time and interacting so yeah it should be seamless because it's voice and when it's voice you have to be able to understand these requests independent of the specificity like a scale name and to do that what we have done is again built a deep learning based capability where we shot list a bunch of skills when you say Alexa get me a car and then we figure it out okay it may it's meant for a nubile skill versus a left or they on your preferences and then you can rank the responses from the scale and then choose the best response for the customer so that's on the more natural other examples of more natural is like we were talking about lists for instance and you wanna you don't want to say Alexa add milk likes to add eggs Alexa hired cookies you know Alexa add cookies milk and eggs and that in one shot right so that works that helps with the naturalness we talked about memory like if you said you can say like so remember I have to go to Mom's house or you may have entered a calendar event through your calendar that's linked or like so you don't remember whether it's in my calendar or did I tell you how to remember something or some other reminder right so you have to now independent of how customers create these events it should just say Alexa when do I have to go to Mom's house and it tells you when you have to go to Mom's house that's the fascinating problem who's that problem on so the these people create skills uh-huh who's who's tasked with integrating all of that knowledge together so if the skills becomes seamless is it the creators of the skills sewer system the infrastructure that Alexa provides problem it's both I think the large problem in terms of making sure your skill quality is high we that has to be done by our tools because it's just so these skills just to put the context they are built through Alexa skill scale which is a self-serve way of building an experience on Alexa this is like any developer in the world could go to Alexa scale skate and build an experience on Alex like if you're a dominoes you can build a domino skills for instance that does pizza ordering when you've authored that you do want to now if people say like so open Domino's or Alexa ask dominoes dominoes to get a particular type of pizza that will work but the discovery is harder you can't just say like so get me a pizza and then Alexa figures out what to do that latter part is definitely our responsibility in terms of when the request is not Feliz how do you figure out what's the best skill or a service that can fulfill the customer's request and it can keep evolving imagine going to the situation I said which was the night out planning that it the goal could be more than that individual request that came a Pizza ordering could mean a night in event with your kids in the house and your so this is welcome to the world of conversational yeah this is this is super exciting because it's not the academic problem of NLP of natural language processing understanding dialogue this is like real world the stakes are high in a sense that customers get frustrated quickly people get frustrated quickly so you have to get it right if to get that interaction right so it's I love it but so from that perspective what what are the challenges today what what are the problems that really need to be solved and yes here's I think first and foremost as I mentioned that get the basics right are still true basically even the one-shot requests which we think of as transactional requests needs to work magically no question about that lee if it doesn't turn your light on and off you'll be super frustrated even if I can complete the night out for you and not do that that is unacceptable for as a customer right so that you have to get the foundational understanding going very well the second aspect when I said more conversational is as you imagine is more about reasoning it is really about figuring out what the latent goal is of the customer based on what I have the information now and the history and what's the next best thing to do so that's a complete reasoning and decision-making problem just like your self-driving car but the goal is still more finite here it Evos your environment is super hard and self-driving and the cost of a mistake is huge here but there are certain similarities but if you think about how many decisions Alexa is making or evaluating at any given time it's a huge hypothesis space and we're only talked about so far about what I think of reactive to in terms of you asked for something and Alexis reacting to it if you bring the proactive part which is Alexa having hunches so any given instance then your it's really a decision at any given point based on the information Alexa has to determine what's the best thing it needs to do so these are the ultimate AI problem well decisions based on the information you have do you think my prospectus a lot I work a lot with sensing of the human face do you think they'll and we touch this topic a little bit earlier but do you think it'll be a day soon when Alexa can also look at you to help improve the quality of the hunch it has or at least detect frustration or detects you know improve the quality of its perception of what you what you're trying to do I mean let me again bring back to what it already does we talked about how based on you bargain over Alexa clearly it's a very high probability it must have done something wrong that's why you understand the next extension of whether frustration is a signal or not of course is a natural thought in terms of how that should be in a signal to egg you can get that from voice you can get from voice but it's very hard like I mean a frustration as a signal historically if you think about emotions of different kinds you know there's a whole field of affective computing something that MIT has also done a lot of research and is super hot and you are now talking about a far field device as in you're talking to a distance noisy environment and in that environment it needs to have a good sense for your emotions this is a very very hard problem very hard problem but you haven't shadow voice from hard problems well you know so deep learning has been at the core of a lot of this technology are you optimistic about the current deep learning approaches to solving the hardest aspects of what we're talking about or do you think there will come a time where new ideas need to for this you know if you look at reasoning so opening eye deep mind a lot of folks are now starting to work in reasoning trying to see how can make neural networks a reason do you see that new approaches need to be invented to take the next big leap absolutely I think there has to be a lot more investment and I think in many different ways and there are these I would say nuggets of research forming in a good way like learning with less data or like zero short learning one-shot learning and the active learning stuff you've talked about is yes incredible since so transfer learning is also super critical especially when you're thinking about applying knowledge from one task to another or one language to another right it's really ripe so these are great pieces deep learning has been useful too and now we are sort of marrying deep learning with with transfer learning an active learning of course that's more straightforward in terms of applying deep learning and an active learning set up but but I do think in terms of now looking into more reasoning based approaches is going to be key for our next wave of the technology but there is a good news the good news is that I think for keeping on to delight customers that a lot of it can be done by prediction tasks yes so and so we haven't exhausted that so we don't need to give up on the deep learning approaches for that so that's just I wanted sort of the query on our rich fulfilling amazing experience that makes Amazon a lot of money and a lot of everybody a lot of money because it does awesome things deep learning is enough the the point the point I don't think I would say deep learning is enough I think for the purposes of Alexa accomplish the task for customers I'm saying there are still a lot of things we can do with prediction based approaches that do not reason right I'm not saying that and we haven't exhausted those but for the kind of high utility experiences that I'm personally passionate about of what Alexa needs to do reasoning has to be solved today to the same extent as you can think of naturally understanding and a speech recognition to the extent of understanding intents has been how accurate it has become but reasoning we are very very early days the nest another way how hard of a problem do you think that is hardest of them I would say hardest of them because again the hypothesis space of is really really large and when you go back in time like you were saying I wanna I want Alexei to remember more things that once you go beyond a session of interaction which is my session I mean a a time span which is today two versus remembering which restaurant I like and then when I'm planning a night out to say do you want to go to the same restaurant now you're up the steaks big time and and this is where the reasoning dimension also goes very very big so you think the space will be elaborating that a little bit just philosophically speaking do you think when you reason about trying to model what the goal of a person is in the context of interacting with Alexa you think that space is huge it's huge absolutely you think so like another a devil's advocate would be that we human beings are really simple and we all want like just a small set of things and they're so do you think you think it's possible cuz we're not talking about a fulfilling general conversation perhaps actually the Alexa prize is a little bit after that creating a customer like there's so many of the interactions it feels like are clustered in groups that are don't require general reasoning I think you're you right in terms of the head of the distribution of all the possible things customers may want to accomplish but the tail is long and it's diverse right so from many many long tails from that perspective I think you have to solve that problem otherwise and everyone's very different like I mean we see this already in terms of the skills right I mean if you if you're an average surfer which I am now right but somebody is asking Alexa about surfing conditions right and there's a skill that is there for them to get to right that tells you that the tail is massive like in terms of like what kind of skills people have created it's humongous in terms of it and which means there are these diverse needs and and when you start looking at the combinations of these right even if your pairs of skills and and 90000 choose two it's still a big concept of combination so I'm saying there's a huge to do here now and I think customers are you know wonderfully frustrated with things and then I'm gonna keep getting to do better things for that so and they're not known to be super patient so you have to do it fast you have to do it fast yeah so you've mentioned the idea of a press release the research and development Amazon Alexa and Amazon in general you kind of think of what the future product will look like and you kind of make it happen you work backwards so can you draft for me you probably have one paquimé makeup on for 10 20 30 40 years out that you see the Alexa team putting out just in broad strokes something that you dream about I think let's start with the five years first okay so and I'll get to the Fortius through in broad strokes this term I think the five year is where I mean I think of in these spaces it's hard especially if you're in thick of things to think beyond the five year space because a lot of things change right I mean if you ask me five years back will Alexa will be here I wouldn't have I think it has surpassed my imagination of that time right so I think then from the next five years perspective from a AI perspective what we're gonna see is that notion which you said goal-oriented dialogues and open domain like Alec surprised I think that bridge is gonna get closed they won't be different and I'll give you why that's the case you mentioned shop how do you shop do you shop in in one shot sure your double-a batteries paper towels yes how much how long does it take for you to buy a camera you do ton of research yeah then you make a decision so is there is that a goal oriented a lot dialogue when I like somebody says Alexa find me a camera is it simply in cue sitive ness right so even in this something that you think of it as shopping which you said you yourself use a lot off if you go beyond where it's reorders or items where you sort of not brand conscious and so forth that was just in shock just to comment quickly I've never bought in you think through Alexa there haven't bought before on Amazon on a desktop after I clicked in a bunch you read a much reviews that kind of stuff so it's repurchase so now you think in even for something that you felt like is is a finite goal I think the space is huge because even products the attributes are many like and you want to look at reviews some on Amazon some outside some you want to look at what Zenit is saying or another consumer forum is saying about even a product for instance right so that's just that's just shopping where you could you could argue the ultimate goal is sort of known and we haven't talked about Alexa what's the weather in Cape Cod this weekend right so why am I asking that weather question right so I think I think of it as how do you complete goals with minimum steps for our customers right and when you think of it that way the distinction between goal-oriented and conversations for open domain say goes away I may want to know what happened in the presidential debate right and is it I'm seeking just information on I'm looking at who's winning winning the debates right so these are all quite hard problems so even the five-year horizon problem I'm like I sure hope we'll solve these new year you're optimistic because that's the hard problem which part the reasoning you know enough to be able to help explore complex goals that are beyond something simplistic that feels like it could be well five years is a nice it's a nice bar form right I think you will it's a like nice ambition and do we have press releases for that absolutely can I tell you what specifically the roadmap will be no right and what and will be solve all of it in the five-year space now this is we will work on this forever actually if we this is the hardest of the eye problems and I don't see if that being solved even in a 40 year horizon because even if you limit to the human intelligence we know we are quite far from that in fact every aspects of our sensing to do neural processing to how brain stores information and how it processes it we don't yet know how to represent knowledge all right so we're and still in those are early stages so I wanted to start that's why at the five-year yeah because the five-year success would look like that and solving these complex goals and the forty year would be where it's just natural to talk to these in terms of more of these complex goals right now we've already come to the point where these transactions you mentioned of asking for weather or reordering something or listening to your favorite tune it's natural for you to actually say it's it's now unnatural to pick up your phone right and that I think is the first five-year transformation the next five your transformation would be okay I can plan my weekend with Alexa or I can plan my next meal with Alexa or my next night out with seamless effort so just to pause and look back at the big picture of it all it's a you're part of a large team that's creating a system that's in the home that's not human that gets to interact with human beings so we human beings we these descendants of apes have created an artificial intelligence system that's able to have conversations I mean that that to me the two most transformative robots of this century I think will be autonomous vehicles but they're a little bit transforming from a more boring way it's like a tool I think conversational agents in the home is I can experience how does that make you feel the year at the center of creating that as its do you sit back and awe sometimes what what it what is your what is your feeling about the whole mess of it can you even believe that we're able to create something like this I think it's a privilege I'm so fortunate like where where I ended up right and and it's been a long journey like I've been in this space for a long time in Cambridge right and it's it's so heartwarming to see the kind of adoption conversational agents are having now five years back it was almost like should I move out of this because we are unable to find this killer application that customers would love that would not simply be good to have thing in research labs and it's so fulfilling to see it make a difference to millions and billions of people a worldwide the good thing is they're still very early so I have another 20 years of job security doing what I love like so I think from that perspective I feel I tell every researcher this that joins or every member of my team this is a unique privilege like I think and we have and I would say not just launching a lecture in 2014 which was first of its kind along the way we have when we launch a lecture skills get it become became democratizing AI when before that there was no good evidence often SDK for speech and language now we are coming to this very you and I'm having this conversation where I'm not saying Oh legs planning a night out with an AI agent impossible I'm saying it's in the realm of possibility and not only possible we will be launching this right so some elements of that every and it will keep getting better we know that is a universal truth once you have these kind of agents out there being use they get better for your customers and I think that's where I think the amount of research topics we are throwing out at our budding researchers is just gonna be exponentially hard and the great thing is you can now get immense satisfaction by having costumers use it not just a paper and new reps or another conference I think everyone myself included are deeply excited about that future so that I don't think there's a better place to and Rohit thank you thank you so much this was fun thank you same here thanks for listening to this conversation with rohit prasad and thank you to our presenting sponsor cash app downloaded use coal export cast you'll get ten dollars and $10 will go to first stem education nonprofit and inspires hundreds of thousands of young minds to learn and to dream of engineering our future if you enjoy this podcast subscribe on youtube give it five stars an apple podcast supported on patreon or connect with me on twitter and now let me leave you with some words of wisdom from the great alan turing sometimes is the people no one can imagine anything of who do the things no one can imagine thank you for listening and hope to see you next time you