Kind: captions
Language: en
the inspiration was the Star Trek
computer so when you think of it that
way you know everything is possible but
when you launch a product you have to
start with someplace and when I joined
we the product was already in conception
and we started working on the far field
speech recognition because that was the
first thing to solve by that we mean
that you should be able to speak to the
device from a distance and in those days
that wasn't a common practice and even
in the previous research world I was in
was considered to an unsolvable problem
then in terms of whether you can
converse from a length and here I'm
still talking about the first part of
the problem where you say get the
attention of the device as in by saying
what we call the wake word which means
the word Alexa has to be detected with a
very high accuracy because it is a very
common word it has sound units that map
with words like I like you or Alec Alex
right so it's a undoubtably hard problem
to detect the right mentions of Alexa's
address to the device versus I like
Alexa you have to pick up that signal
when there's a lot of noise not only
North conversation they are in the house
well you remember on the device you're
still simply listening for the wake word
Alexa and there's a lot of words being
spoken in the house how do you know it's
Alexa and directed at Alexa because I
could say I love my Alexa I hate my Alex
I want a lecture to do this and in all
these three sentences I said Alexa I
didn't want it to wake up yeah so can I
just pause on a second what would be
your device that I should probably in
the introduction of this conversation
give to people in terms of with them
turning off their Lux a device if
they're listening to this podcast
conversation out loud like what's the
probability that an Alexa device will go
off because we mentioned Alexa like a
million times so it will we have done a
lot of different things where we can
figure out that there is
the device the speech is coming from a
human versus over there also I mean in
terms of like also it is think about ads
or so we have also launched a technology
for what a marketing kind of approaches
in terms of filtering it out but yes if
this kind of a podcast is happening it's
possible your device will wake up a few
times it's an unsolved problem but it is
definitely something we care very much
about but the idea is you want to detect
Alex were meant for the device first
even hearing alexa versus i like yeah
something and that's the fascinating
part so that was the first relief that's
the first of the world's best detector
of course
yeah the fir world's best wait word
detector yeah in the far field setting
not like something where the phone is
sitting on the table this is like people
have devices 40 feet away like in my
house or 20 feet away and you still get
an answer so that was the first part the
next is okay you're speaking to the
device of course you're gonna issue many
different requests some may be simple
some may be extremely hard but it's a
large vocabulary speech recognition
problem essentially where the audio is
now not coming on to your phone or a
handheld mic like this or clothes
talking mic but it's from 20 feet away
where if you're in a busy household your
son may be listening to music your
daughter may be running around with
something and asking your mom something
and so forth right so this is like a
common household setting where the words
you're speaking to Alexa
need to be recognized with very high
accuracy yes right now we are still just
in the recognition problem you haven't
yet come to the understanding one right
in if you pause I'm sorry once again
what year was this is this before
neural networks began to start to
seriously prove themselves in audio
space yeah this is around so I joined in
2013 in April right so the early
research in neural networks coming back
and showing some promising results in
speech recognition space had started
happening but it was very early but we
to build on that on the very first thing
we did when when I join and we with the
team and remember it was a very smudge
of a start-up environment which is great
about Amazon and we double down on deep
learning right away and we we knew will
have to improve accuracy fast and
because of that we worked on and the
scale of data once you have a device
like this if it is successful will
improve big time like you'll suddenly
have large volumes of data to learn from
to make the customer experience better
so how do you scale deep learning so we
did our one of the first works in in
training with distributed GPUs and where
the training time was you know was
linear in terms of like in the amount of
data so that was quite important work
where it was algorithmic improvements as
well as a lot of engineering
improvements to be able to train on
thousands and thousands oliver of speech
and that was an important factor so the
if you ask me like in back in 2013 and
2014 when we launched echo the
combination of large scale data deep
learning progress near infinite GPX we
had available on AWS even then was all
came together for us to be able to solve
the far field speech recognition to the
extent it could be useful to the
customers it's still not solved like I
mean it's not that we are perfect at
recognizing speech but we are great at
at in terms of the settings that are in
homes right so and that was important
even in the early stages the first even
I'm trying to look back at that time if
I remember correctly that it was it
seems like the task would be pretty
daunting so like so we kind of take it
for granted that it works now yes right
so let me like how first time you
mentioned startup I wasn't familiar how
big the team was I kind of because I
know there's a lot of really smart
people working on looks and I was very
very large team how big was the team how
likely were you to fail in the highs of
everyone else so like what I'll give you
very interesting anecdote on that when I
joined the team the speech recognition
team was six people my first meeting and
we had hired a few more people it was 10
people 9 out of 10 people thought it
can't be done who was the one the one
was me actually I should say and one was
say my optimistic yeah and and 8th we're
trying to convince let's go to the
management and say let's not work on
this problem let's work on some other
problem like either telephony speech for
customer service calls and so forth but
this was the kind of belief you must
have and I had experience with far-field
speech recognition and I my eyes lit up
when I saw a problem like that saying
okay we have been in speech recognition
always looking for that killer app yeah
and this was a killer use case to bring
something delightful in the hands of
customers you mentioned you the way you
kind of think of in a product way in the
future have a press release and an FAQ
and you think backwards that's did you
have that the team have the echo in mind
so this far field speech recognition
actually putting a thing in the home
that works that is able to interact with
was that the press release what was the
way close I would say in terms of the as
I said the vision was started computer
right or the inspiration and from there
I can't divulge all the exact
specifications but one of the first
things that was magical on a lecture was
music it brought me to back to music
because my taste is still and when I was
an undergrad so I still listen to those
songs and I it was too hard for me to be
a music fan with a phone right so I and
I don't I hate things in my ear so from
that perspective it was quite hard and
and and music was part of the at least
the documents I have seen right so so
from that perspective I think yes in
terms of our how far are we from the
original vision I can't reveal that but
it's that
why I have done a fun at work because
every day we go in and thinking like
these are the new set of challenges to
solve that's a great way to do great
engineering is you think of the product
press really I like that idea actually
maybe we'll talk about it a bit later
was just a super nice way to have
focused I'll tell you this you're a
scientist and a lot of my scientists
have adopted that they they have now
they love it as a process because it was
very a scientist you're trained to write
great papers but they are all after
you've done the research or your probe
and I and your PhD dissertation proposal
is something that comes closest or a
DARPA proposal or NSF proposal is the
closest that comes to a press release
but that process is now ingrained in our
scientists which is like delightful for
me to see you write the paper first then
make it happen that's right that's not
state-of-the-art results or you leave
the results section open well you have a
thesis about here's what I expect right
and here's what it will change right so
I think it is a great thing it works for
researchers as well just so far field
recognition yeah what was the big leap
what what were the breakthroughs and
yeah what was that journey liked it
today yeah I think the as you said first
there was a lot of skepticism on whether
far field speech recognition will ever
work to be good enough right and what we
first did was got a lot of training data
in a far field setting and that was
extremely hard to get because none of it
existed so how do you collect data in
far field set up right with no customer
bases there's no customer base right so
that was first innovation and once we
had that the next thing was okay you if
you have the data first of all we didn't
talk about like what would magical mean
in this kind of a setting what is good
enough for customers right that's always
since you've never done this before what
would be magical so so it wasn't just a
research problem you had to put some in
terms of accuracy and customer
experience features some stakes on the
ground saying here's where I think
should it should get to so you
established a bar and then how
you measure progress to word is given
you have no customer right now
so from that perspective we went so
first was the data without customers
second was doubling down on deep
learning as a way to learn and I can
just tell you that the combination of
the two caught our error rates by a
factor of five from where we were when I
started to within six months of having
that data we at that point and I got the
conviction that this will work right so
because that was magical in terms of
when it started working and that reached
the who came close to the magical bar
back to the bar right that we felt would
be where people will use it but it was
critical because you you really have one
chance at this if we had launched in
November 2014 years when we launched and
if it was below the bar I don't think
this category exists if you don't need
the bar
yeah and just having looked at voice
based interactions like in the car or
earlier systems it's a source of huge
frustration for people in fact we use
voice based interaction for collecting
data on subjects to measure frustration
so as a training set for computer vision
for face data so we can get a data set
of frustrated people that's the best way
to get frustrated people is having them
interact with a voice based system in
the car so this is that bar I imagine is
pretty high it was very high and we
talked about how also errors are
perceived from a eyes versus errors by
humans but we are not done with the
problems that ended up we had to solve
to get it to launch so do you want the
next one so the next one was what I
think of as multi-domain
natural language understanding it's very
I wouldn't say easy but it is during
those days solving it understanding in
one domain and narrow domain was doable
but for these multiple domains like
music like information
other kinds of household productivity
alarms timers
even though it wasn't as big as it is in
terms of the number of skills Alexa has
and the confusion space has like grown
by three orders of magnitude it was
still daunting even those days and again
no customer base here again no customer
base so now you're looking at meaning
understanding and intent understanding
and taking actions on behalf of
customers based on their request and
that is the next hard problem even if
you have gotten the words recognized how
do you make sense of them in those days
there was still a lot of emphasis on
rule-based systems for writing grammar
patterns to understand the intent but we
had a statistical first offer which even
then where foreign language
understanding we had in even those
starting days and an entity recognizer
and an intent classifier which was all
trained statistically in fact we had to
build the deterministic matching as
fall-off to fix bugs that statistical
models have right so it was just a
different mindset where we focused on
data-driven statistical understanding
wins in the end if you have a huge a
said yes it is contingent on that and
that's why it came back to how do you
get the data before customers the fact
that this is why data becomes crucial to
get at the point that you have the
understanding system built in build up
and notice that for here we were talking
about human machine dialogue even those
early days even it was very much
transactional do one thing one shot a
trances in great way there was a lot of
debate on how much should Alex our talk
back in terms of if you misunderstood
you or you said play songs by the stones
and let's say it doesn't know you know
early days knowledge can be sparse or
the stones right I the Rolling Stones
right so our and you don't want them
match to be Stone Temple Pilots or
Rolling Stones right so you don't know
which one it is so these kind of other
signals to
and now there we had great assets right
from Amazon in terms of new acts like
what is it what kind of yeah how do you
solve that problem in terms of what we
think of it as an entity resolution
problem right so one is it right I mean
the even if you figured out the stones
is an entity you have to resolve it to
whether it's the stones or the temple
violence or some other stones maybe I
misunderstood is the resolution the job
of the algorithm or is the job of UX
communicating with the human to help
there as well there is both right it is
lot you want 90% or high 90s to be done
without any further questioning or UX
right so but that it's absolutely okay
just like as humans we asked the
question I didn't understand your likes
yeah it's fine for a lecture location
you say I did not understand you right
and and that's a important way to learn
and I'll talk about where we have come
with more self learning with these kind
of feedback signals but in those days
just solving the ability of
understanding the intent and resolving
to an action where action could be play
a particular artist or a particular song
was Superhawk again to the bar was high
as you're talking about right so while
we launched it in sort of 13 big domains
I would say in terms of or thing we
think of it as 30 in the big skills we
had like music is a massive one when we
launched it and now we have 90,000 plus
skills on Alexa
you