Kind: captions Language: en the inspiration was the Star Trek computer so when you think of it that way you know everything is possible but when you launch a product you have to start with someplace and when I joined we the product was already in conception and we started working on the far field speech recognition because that was the first thing to solve by that we mean that you should be able to speak to the device from a distance and in those days that wasn't a common practice and even in the previous research world I was in was considered to an unsolvable problem then in terms of whether you can converse from a length and here I'm still talking about the first part of the problem where you say get the attention of the device as in by saying what we call the wake word which means the word Alexa has to be detected with a very high accuracy because it is a very common word it has sound units that map with words like I like you or Alec Alex right so it's a undoubtably hard problem to detect the right mentions of Alexa's address to the device versus I like Alexa you have to pick up that signal when there's a lot of noise not only North conversation they are in the house well you remember on the device you're still simply listening for the wake word Alexa and there's a lot of words being spoken in the house how do you know it's Alexa and directed at Alexa because I could say I love my Alexa I hate my Alex I want a lecture to do this and in all these three sentences I said Alexa I didn't want it to wake up yeah so can I just pause on a second what would be your device that I should probably in the introduction of this conversation give to people in terms of with them turning off their Lux a device if they're listening to this podcast conversation out loud like what's the probability that an Alexa device will go off because we mentioned Alexa like a million times so it will we have done a lot of different things where we can figure out that there is the device the speech is coming from a human versus over there also I mean in terms of like also it is think about ads or so we have also launched a technology for what a marketing kind of approaches in terms of filtering it out but yes if this kind of a podcast is happening it's possible your device will wake up a few times it's an unsolved problem but it is definitely something we care very much about but the idea is you want to detect Alex were meant for the device first even hearing alexa versus i like yeah something and that's the fascinating part so that was the first relief that's the first of the world's best detector of course yeah the fir world's best wait word detector yeah in the far field setting not like something where the phone is sitting on the table this is like people have devices 40 feet away like in my house or 20 feet away and you still get an answer so that was the first part the next is okay you're speaking to the device of course you're gonna issue many different requests some may be simple some may be extremely hard but it's a large vocabulary speech recognition problem essentially where the audio is now not coming on to your phone or a handheld mic like this or clothes talking mic but it's from 20 feet away where if you're in a busy household your son may be listening to music your daughter may be running around with something and asking your mom something and so forth right so this is like a common household setting where the words you're speaking to Alexa need to be recognized with very high accuracy yes right now we are still just in the recognition problem you haven't yet come to the understanding one right in if you pause I'm sorry once again what year was this is this before neural networks began to start to seriously prove themselves in audio space yeah this is around so I joined in 2013 in April right so the early research in neural networks coming back and showing some promising results in speech recognition space had started happening but it was very early but we to build on that on the very first thing we did when when I join and we with the team and remember it was a very smudge of a start-up environment which is great about Amazon and we double down on deep learning right away and we we knew will have to improve accuracy fast and because of that we worked on and the scale of data once you have a device like this if it is successful will improve big time like you'll suddenly have large volumes of data to learn from to make the customer experience better so how do you scale deep learning so we did our one of the first works in in training with distributed GPUs and where the training time was you know was linear in terms of like in the amount of data so that was quite important work where it was algorithmic improvements as well as a lot of engineering improvements to be able to train on thousands and thousands oliver of speech and that was an important factor so the if you ask me like in back in 2013 and 2014 when we launched echo the combination of large scale data deep learning progress near infinite GPX we had available on AWS even then was all came together for us to be able to solve the far field speech recognition to the extent it could be useful to the customers it's still not solved like I mean it's not that we are perfect at recognizing speech but we are great at at in terms of the settings that are in homes right so and that was important even in the early stages the first even I'm trying to look back at that time if I remember correctly that it was it seems like the task would be pretty daunting so like so we kind of take it for granted that it works now yes right so let me like how first time you mentioned startup I wasn't familiar how big the team was I kind of because I know there's a lot of really smart people working on looks and I was very very large team how big was the team how likely were you to fail in the highs of everyone else so like what I'll give you very interesting anecdote on that when I joined the team the speech recognition team was six people my first meeting and we had hired a few more people it was 10 people 9 out of 10 people thought it can't be done who was the one the one was me actually I should say and one was say my optimistic yeah and and 8th we're trying to convince let's go to the management and say let's not work on this problem let's work on some other problem like either telephony speech for customer service calls and so forth but this was the kind of belief you must have and I had experience with far-field speech recognition and I my eyes lit up when I saw a problem like that saying okay we have been in speech recognition always looking for that killer app yeah and this was a killer use case to bring something delightful in the hands of customers you mentioned you the way you kind of think of in a product way in the future have a press release and an FAQ and you think backwards that's did you have that the team have the echo in mind so this far field speech recognition actually putting a thing in the home that works that is able to interact with was that the press release what was the way close I would say in terms of the as I said the vision was started computer right or the inspiration and from there I can't divulge all the exact specifications but one of the first things that was magical on a lecture was music it brought me to back to music because my taste is still and when I was an undergrad so I still listen to those songs and I it was too hard for me to be a music fan with a phone right so I and I don't I hate things in my ear so from that perspective it was quite hard and and and music was part of the at least the documents I have seen right so so from that perspective I think yes in terms of our how far are we from the original vision I can't reveal that but it's that why I have done a fun at work because every day we go in and thinking like these are the new set of challenges to solve that's a great way to do great engineering is you think of the product press really I like that idea actually maybe we'll talk about it a bit later was just a super nice way to have focused I'll tell you this you're a scientist and a lot of my scientists have adopted that they they have now they love it as a process because it was very a scientist you're trained to write great papers but they are all after you've done the research or your probe and I and your PhD dissertation proposal is something that comes closest or a DARPA proposal or NSF proposal is the closest that comes to a press release but that process is now ingrained in our scientists which is like delightful for me to see you write the paper first then make it happen that's right that's not state-of-the-art results or you leave the results section open well you have a thesis about here's what I expect right and here's what it will change right so I think it is a great thing it works for researchers as well just so far field recognition yeah what was the big leap what what were the breakthroughs and yeah what was that journey liked it today yeah I think the as you said first there was a lot of skepticism on whether far field speech recognition will ever work to be good enough right and what we first did was got a lot of training data in a far field setting and that was extremely hard to get because none of it existed so how do you collect data in far field set up right with no customer bases there's no customer base right so that was first innovation and once we had that the next thing was okay you if you have the data first of all we didn't talk about like what would magical mean in this kind of a setting what is good enough for customers right that's always since you've never done this before what would be magical so so it wasn't just a research problem you had to put some in terms of accuracy and customer experience features some stakes on the ground saying here's where I think should it should get to so you established a bar and then how you measure progress to word is given you have no customer right now so from that perspective we went so first was the data without customers second was doubling down on deep learning as a way to learn and I can just tell you that the combination of the two caught our error rates by a factor of five from where we were when I started to within six months of having that data we at that point and I got the conviction that this will work right so because that was magical in terms of when it started working and that reached the who came close to the magical bar back to the bar right that we felt would be where people will use it but it was critical because you you really have one chance at this if we had launched in November 2014 years when we launched and if it was below the bar I don't think this category exists if you don't need the bar yeah and just having looked at voice based interactions like in the car or earlier systems it's a source of huge frustration for people in fact we use voice based interaction for collecting data on subjects to measure frustration so as a training set for computer vision for face data so we can get a data set of frustrated people that's the best way to get frustrated people is having them interact with a voice based system in the car so this is that bar I imagine is pretty high it was very high and we talked about how also errors are perceived from a eyes versus errors by humans but we are not done with the problems that ended up we had to solve to get it to launch so do you want the next one so the next one was what I think of as multi-domain natural language understanding it's very I wouldn't say easy but it is during those days solving it understanding in one domain and narrow domain was doable but for these multiple domains like music like information other kinds of household productivity alarms timers even though it wasn't as big as it is in terms of the number of skills Alexa has and the confusion space has like grown by three orders of magnitude it was still daunting even those days and again no customer base here again no customer base so now you're looking at meaning understanding and intent understanding and taking actions on behalf of customers based on their request and that is the next hard problem even if you have gotten the words recognized how do you make sense of them in those days there was still a lot of emphasis on rule-based systems for writing grammar patterns to understand the intent but we had a statistical first offer which even then where foreign language understanding we had in even those starting days and an entity recognizer and an intent classifier which was all trained statistically in fact we had to build the deterministic matching as fall-off to fix bugs that statistical models have right so it was just a different mindset where we focused on data-driven statistical understanding wins in the end if you have a huge a said yes it is contingent on that and that's why it came back to how do you get the data before customers the fact that this is why data becomes crucial to get at the point that you have the understanding system built in build up and notice that for here we were talking about human machine dialogue even those early days even it was very much transactional do one thing one shot a trances in great way there was a lot of debate on how much should Alex our talk back in terms of if you misunderstood you or you said play songs by the stones and let's say it doesn't know you know early days knowledge can be sparse or the stones right I the Rolling Stones right so our and you don't want them match to be Stone Temple Pilots or Rolling Stones right so you don't know which one it is so these kind of other signals to and now there we had great assets right from Amazon in terms of new acts like what is it what kind of yeah how do you solve that problem in terms of what we think of it as an entity resolution problem right so one is it right I mean the even if you figured out the stones is an entity you have to resolve it to whether it's the stones or the temple violence or some other stones maybe I misunderstood is the resolution the job of the algorithm or is the job of UX communicating with the human to help there as well there is both right it is lot you want 90% or high 90s to be done without any further questioning or UX right so but that it's absolutely okay just like as humans we asked the question I didn't understand your likes yeah it's fine for a lecture location you say I did not understand you right and and that's a important way to learn and I'll talk about where we have come with more self learning with these kind of feedback signals but in those days just solving the ability of understanding the intent and resolving to an action where action could be play a particular artist or a particular song was Superhawk again to the bar was high as you're talking about right so while we launched it in sort of 13 big domains I would say in terms of or thing we think of it as 30 in the big skills we had like music is a massive one when we launched it and now we have 90,000 plus skills on Alexa you