MIT 6.S093: Introduction to Human-Centered Artificial Intelligence (AI)
bmjamLZ3v8A • 2019-04-24
Transcript preview
Open
Kind: captions
Language: en
welcome to human centered artificial
intelligence the last couple of decades
in the developments of deep learning
have been exciting in the problems that
we've been able to automate in the
problems that we've been able to crack
with learning based methods one of the
things underlying this lecture in the
following lectures is the idea that with
purely the learning based approach that
we have been using there are certain
aspects that are fundamental to our
reality that we're going to hit a wall
on that we have to integrate incorporate
the human being deeply into the learning
based systems in order to make the
system's learn well and operate in the
real world the underlying first
prediction under the idea of human
centered AI in this century is that the
learning based approaches have been
successful over the past two decades
like deep learning XI learning
approaches that learn from data are
going to continue to become better and
dominate the real-world applications so
as opposed to fine-tuned optimization
based models that do not learn from data
more and more we're going to see
learning based methods dominate
real-world applications that's the
underlying prediction that we're working
with now if that's the case the
corollary of that if learning based
methods is the solution to many of these
real-world problems is the way we get
smarter ai systems is by improving the
machine learning and the machine
teaching machine learning is the thing
that we've been talking about quite a
bit that's the deep learning that's the
algorithm the optimization of neill
network parameters where you learn from
data that's the current focus of the
community current focusing the research
and the thing that's behind the success
of much of the developments in deep
learning and then there's the machine
teaching
that's a human center part it's
optimization its optimizing not the
models not the algorithms but optimizing
how you select the data based on which
the algorithms learn its to make better
teachers just like when you yourself are
learning as a student or as a child how
to operate in this world the the world
and the parents and the teachers around
you are informing you with very sparse
information but providing the kind of
information that is most useful for your
learning process the selection of data
based on which to learn I believe is the
critical direction of research where we
have to solve in order to create truly
intelligent systems and ones that are
able to work in the real world and I'll
explain why and in ways that I'm
referring to the implications of
learning based systems so when you have
a learning system that a system that
learns from data
neural networks machine learning learns
from data the fundamental reality of
that is the model is trying to
generalize across the entirety of the
reality which will have be tasked with
operating based on a very small subset
of samples from that reality and that
generalization means that you it's
there's always going to be a degree of
uncertainty there's always going to be a
degree of incomplete information and so
no matter how much we want to these
systems will not be provably safe so we
can't put anything concrete down to how
guaranteed to be safe in some specific
way unless it's extremely constrained
therefore we need human supervision of
these systems the systems will not be
provably fair from an ethics perspective
from a discrimination perspective from
all degrees of fairness
therefore we need human supervision of
these systems and it will not be
explainable at any
the pipeline in which they made the
decisions they a systems will not be
perfectly explainable to the
satisfaction of us as human supervisors
so there again human supervision
constantly will be required and the
solution to this is a whole set of
techniques whole set of ideas they're
kind of they were putting under the flag
of human centered artificial
intelligence human center of AI and the
core ideas there is that we need to
integrate the human being deeply into
the annotation process and deeply into
the human supervision of the real world
operation of the system so both in the
training phase and the testing phase the
execution the operation of the system so
this is what deep learning looks like
with the human out of the loop the human
contributes to a learning model by
helping annotate some data and that data
is then used to train to train a model
that hopefully generalize in the real
world and that model makes decisions and
deep learning is really exciting because
it's able to in a greater and greater
degree of autonomy able to form
high-level representations of the raw
data in a way that it's actually able to
do quite well on certain kinds of tasks
that were before very difficult but
fundamentally the human is out of the
loop both of the training and the
operation first you build the data set
annotate the data set and then the
systems run away with it they train
Erland data and the real world operation
does not involve the human except as the
recipient of the service the system
provides now the human in the loop
version of that the human centered
version of that means that annotation
and operation of the system is both
aided by human beings in a deep way what
does that mean so we can look at human
experts or individuals and crowd
intelligence the wisdom of the crowd and
the wisdom of the individual at the
at the training phase the first part of
that is the objective annotation we need
to significantly improve objective
annotation meaning annotation where the
human intelligence is sufficient to be
able to look at a sample and annotate it
this is what we think about as an image
net and all the basic computer vision
tasks where a single human is enough to
do a pretty damn good job of determining
what's in a particular sample and then
there's subjective annotation things
that are difficult for humans to
determine and as a singular sample of a
human being as a crowd and kind of
converge and these difficult questions
these are questions at a low level of
emotion these things that are a little
bit fuzzy they require multiple people
to annotate and at the high level are
ethical questions of decisions that AI a
system is tasked to making or we're
tasked with making that nobody really
knows the right answer to and as a crowd
would kind of converging the right
answer that's where the crowd
intelligence comes in on the data
annotation step now in the operation
once you train the model the supervision
again of the system based and I'll give
examples of this more concretely on the
the wisdom of the individual is for
example operating in an autonomous
vehicle the supervision of that
autonomous vehicle a single driver is
tasked with supervising the decisions
that IA i sis them that's a critical
step for learning based system that's
not guaranteed to be safe
that's not guaranteed to be explainable
and the subjective side of that where
the crowd intelligence is required where
a single person is not able to make it
these are again ethical questions about
the operation of autonomous systems the
supervision of autonomous vehicles the
supervision of systems in the medical
diagnosis in medicine in general and the
this is AI operating in the real world
making ethical decisions that are
fundamentally difficult decisions for
humans to make and that's where the
crowd intelligence needs to
and so we have to transform the machine
learning problem by integrating the
human being first up top in the training
process on the left that's the usual
machine learning formulation of a human
being doing brute force annotation with
some kind of data set cats and dogs an
image net segmentation day of data set
and cityscapes video action recognition
in the YouTube dataset given the data
set humans put in a lot of expensive
labor to annotate what's going on in
that data and then the machine learns
the flip side of that the machine
teaching side the human center side of
that is the machine instead the learning
modelled learning algorithm talking
about most the neural networks here is
the is tasked with providing selecting
the subset the small sparse subsets of
the data that are most useful for the
human to annotate so instead of the
human doing the brute force task first
of the annotation the machine queries
the human this is the field called
machine teaching them the machine core
is a human with questions and therefore
the task is and this is wide open
research field the task is to minimize
several orders of magnitude the amount
of data that needs to be annotated in
the real-world operation side the
integration the human looks like this on
the left the machine now trained with a
learning model makes decisions and the
human living in this world receives the
service provided by the machine
whether that's medical diagnosis whether
that's an autonomous vehicle whether
that's a system that determines whether
you get a loan or not so on with the
human centered version of that the
machine makes a decision but it's able
to provide a degree of uncertainty
there's one of the big requirements to
be able to specify a degree of
uncertainty of that decision such that
when uncertainty is below a certain
threshold human supervision
sought and again in that in that
decision whether that's accosted
decision financially or costly decision
in terms of human life human supervision
is sought and the service is received by
the human by the very same humans that
are providing the supervision or another
another set of humans but ultimately the
decision is over sought by human beings
this is what I believe is going to be
the defining mode of operation for AI
systems in the 21st century is we won't
be able to as much as we'd like to
escape to create perfect AI systems that
escape the need to work together with
human beings at every step there is five
areas of research Grand Challenges here
that define human centered AI I'll focus
on a few today and focus on one very
much so and even with that degree of
high pruning we have 120 slides so I'll
skip around but on the human centered AI
during the learning phase there is the
methods the research arm of machine
teaching how do we select how do we
improve supervised learning as opposed
to needing ten thousand a hundred
thousand a million examples how do we
reduce that where the algorithm queries
only the essential elements and able to
learn effectively for very little
information from very little samples
just like we do when we're students when
we learns fundamental aspects of math
language and so on we just need a few
examples but those examples are critical
to our understanding and the second part
of that is the reward engineering that
during a learning process injecting the
human being into the definition of the
loss function of what's good what's bad
systems they have to operate in the real
world
have to understand what our society
deems is good and bad and we're not
always good and injecting that at the
very beginning there has to be a
continuous process of adjusting the
rewards of reward reengineering by
humans so that we can encode human
values into the learning process now on
the second part on the human centered AI
during real world operation when the
train the system is actually trained the
that there is the interactive element of
robots and humans working together that
means the part I'll focus on quite a bit
today because there's been quite a lot
of development and progress on the deep
learning side is human sensing is
algorithms that didn't understand the
human being algorithms that from taking
raw information whether it's video audio
text begin to get a context a measure of
the state of the human being in the
short term and the long term over time
the temporal the temporal understanding
and the instantaneous understanding then
there is the interaction aspect so once
you understand the human is the
perception problem you have to interact
with them and interact in such a way
that is continuous collaborative and a
rich meaningful experience that's a
we're in the very early days of creating
anything like rich meaningful
experiences with AI systems especially
learning based AI systems and a safety
in the real world operation safety
ethics unrolling the results of the
engineered rewards that were in place
during the the learning process now come
to fruition and we need to make sure
that the the Train model does not result
in things that are highly detrimental
catastrophic to our safety or highly
detrimental to what we deem is good and
bad in society of discrimination of
ethical considerations and all those
kinds of things the the gray area the
line we all walk as a society in the
crowd intelligence we have to provide
bounds on AI systems and there's an
entire group of work I'll mention what
we're doing in that area so first on the
machine teaching side and they efficient
supervised learning I'd like to sort of
do one slide on each of these to kind of
give you an idea near-term and and do
two things for each area that we will
elaborate in future lectures on and some
of it I'll elaborate today first the
near turn directions of research the
things that are within our reach now and
a sort of thought experiment a grand
challenge that when if we can do it
that'll be damn impressive that will be
a definition of real progress in this
area so near-term directions of research
for machine teaching for improved
supervised learning integrating the
human into the annotation process is
instead of annotating brute force is
annotate by asking the human questions
so we have to transform the way we do
annotations where the process of
annotation is not defining the data set
and then you go through the entire data
set it's an it's a machine teaching
system that queries the user for
questions to annotate and on the
algorithm side active learning these are
all sort of areas of work where we can
be more clever about the way we use data
a select date on which to train so
active learning is actively selecting
during the training process which part
of the data to train on and annotate
data augmentation is taking things have
been supervised by a human and expanding
them modifying the data warping the data
in interesting ways such that it expands
it multiplies the human effort that was
injected into helping understand what's
in the data the one-shot learning zero
shot learning are all in transfer
learning all in that category in self
play is in in the reinforcement learning
area where
the system constructs a model of the
world and goes along alone in a room and
plays with that model to try to figure
out the different constraints in the
model what how do you achieve good
things they're an example Grand
Challenge here that would define serious
progress in the field is if we take
imagenet or cocoa the imagenet challenge
or cocoa object detection challenge and
training only on a totally different
kind of data be able to achieve
state-of-the-art results so training
only on Wikipedia with the text and
images that are there on Wikipedia be
able to perform object detection on the
state-of-the-art benchmark of cocoa the
cocoa is the data set of different
objects with rich annotation of the
localization of the objects that I
believe is exactly the kind of thing
that all the problems in the transfer
learning and efficient data annotation
machine teaching have to be solved to
achieve that another way to another
challenge you can think of if we just
simplify it more is achieve 3% 0.3% err
on m miss thats the handwritten
recognition task that everybody always
provides as an example so achieve a very
good accuracy state-of-the-art accuracy
by training only in a single example of
a digit as opposed to training on
thousands training on one example that's
something that most US humans can do
given an Inc one example of a new
language you haven't seen before for
each character after studying them for a
little bit be able to now classify
future characters at high accuracy the
second part of the learning process
where the human needs to be injected in
the near term directions the research
there is the reward engineering and the
tuning of those continuous tuning of
those rewards by human being
so if opening eyes doing quite a bit of
work here here's a game played by human
and they eye it's really my favorite
example of this on the left humans
controlling a boat that's finishing a
race on the right is our la agent
reinforcement learning agent that's
controlling a boat that's trying to not
finish a race trying to maximize the
reward defined prior to bye-bye
initially by a human being and what it
finds is that you can get let much more
reward by collecting green turbos that
appear as opposed to finishing the race
it realizes that finishing the race
actually gets in the way of maximizing
reward and so that's the unintended
consequences of a reward function that
was that was specified previously and
most human supervisors of this result
would be able to adjust through the
re-entry engineer the reward function to
be able to get the robot to the AI
system here to finish the race and that
kind of continuous monitoring monitoring
of the performance of the system during
the training process is as a near-term
direction of research that's a few deep
mind open AI and ourselves are taking on
example Grand Challenge is allowing AI
system to operate in a context where
there's a lot of fuzziness for us humans
there's a lot of uncertainty there's a
lot of gray area there's a lot of
challenging aspects in terms of what is
right and what is wrong that we're
continuing you to improve on example I
provide here is one of the least popular
things in in the world is the US
Congress so replacing US Congress the
body of representatives of the people of
the United States and they make bills
based on the belief of the people that
you know that sounds a lot like what
Netflix does in recommending what movies
should watch next in representing what
people love to watch so that's just a
recommender system so
it makes perfect sense that an AI system
should be able to take on this challenge
and that would I see that as a grand
challenge is replacing some of the
fundamental representation of large
crowds of people that make ethical
decisions replaced by by a human
centered AI system okay in real-world
operation the first thing we have to do
before we have a robot and a human work
together the first thing is the robot
has to perceive perceive the human
question do you want to change the way
Congress works make it better or do you
want to just take the system that
currently is and automate it so the idea
is take the system as it currently is
supposed to be and automate that so
system can provide a lot more
transparency of the inputs the idea of
Congress is suppose the only inputs is
supposed to be the people and the
beliefs of the people and there's also
you know but that in there's rich
information there so for example I mean
the the input there's a you know for me
not saying anything about politics but
there's certain issues I care a lot
about certain issues I don't care much
about and this put that aside and then
there's certain issues that I know a lot
about and certain issues I know very
little about and those don't actually
intersect the well I'm very opinionated
about things I don't know anything about
it's very common all of us are so being
able to put that representation of me
into a system that would take a lot of
our entire nation together and be able
to make bills that represent the people
now the challenge there it can't be just
the training set and then the system now
operates AI is running the country
no there has to be that human element
where we're constantly supervising just
like we're in theory supposed to be
supervising our congressmen and
Congresswomen human sensing the first
part in order to have an ad system that
works with a human being they asked us
to perceive understand the state of the
human being at the very simplest level
and the more complex temporal contextual
overtime level so the near-term
directions the research is purely the
perception problem the deep or deep
learning shines of taking data whether
that comes in visual audio text and so
on and being able to classify the
physical mental social state the social
context of the person be able to
everything and this is what I'll cover a
little bit of today everything from face
detection face recognition and motion
recognition natural language processing
body possessed
those same recommender systems speech
recognition that all of those
conversions of raw data that captures
something about the human being into
actually meaningful actionable
information
the grand challenge there is emotion
recognition you know there there's been
a lot of companies and ideas that we've
somehow cracked emotion recognition that
we are able to determine the mood of a
person but really that's if for those
who were here last year with Lisa
Feldman Barrett but just if you're sort
of very honest and you study emotional
intelligence and emotion and the
expression of emotion it's a fascinating
area and we're not even close to being
able to build perceptual systems that
detect emotion well we're more so doing
is detecting very simple facial
expressions that correspond to our
storybook versions of emotions smiling
crying like frowning in a caricatured
way so if you build a system that has a
high accuracy of doing real emotion
recognition you can think of it as
steady here an AI system that classifies
binary classification problem with 95
percent accuracy of whether you want to
be left alone or not and being able to
do that after collecting data for 30
days that I see is a really clean
formulation of exactly some exactly the
kind of human understanding we need to
be able to build in in our learning
models and we're very far away from that
especially the long temporal aspect of
that of being able to integrate data
over a long period of time then the
second part of human robot interaction
in the real world operation is the
experience this is where we're now just
beginning to consider that interactive
experience of what how do we have a rich
fulfilling experience we have autonomous
vehicles for example semi autonomous
vehicles whether that's Tesla Volvo
super cruise with the Cadillac
bunch of systems that have now greater
and greater degrees of automation in the
car and we get to have the human
interact with that AI system and trying
to figure out how do we have a rich
fulfilling experience in the though
currently the Volvo system that's that
experience is more limited there's a
little icon it's more kind of
traditional driving situation in the
Tesla you have a much bigger display
about what's going on and the
supercruise there's a camera looking at
your eyes the in the Cadillac super
cruise system there's a camera looking
to your eyes determining if you're awake
or not I'm paying attention or not and
that there's like an experience there
that we're trying to create and in the
Tesla case you know the miles are
racking up we have real data here at MIT
we're studying this exact interaction
there's now over a billion miles driven
in the Tesla and the same in the fully
autonomous side with weigh mode they've
now reached 10 plus million miles driven
autonomously and there's a lot of people
experimenting with this but that's the
that collaborative interaction of going
back and forth of being able to for the
AI system to express the degree of
uncertainty as about the environment
about the AI system being able to
express when it needs help and not be
able to communicate what are its
limitations and capabilities and so on
trade-off control be able to seek human
supervision there's a dance there that's
really that takes into consideration
everything from the neurobiological
research to psychology to the deep to
deep learning to the pure robotics HR HR
I human robot interaction aspects one
grand challenge would be you know Tesla
is driven 1 billion miles now under
autopilot under the semi autonomous mode
the grand challenge here is when we
start getting to the kind of mileage
that we see in the United States every
year you start getting into the hundreds
of billions of miles driven semi
autonomously we get to see teenagers 16
17 18 using these systems for the first
time good to see older folks hope
who don't necessarily drive or use any
kind of AI in their lives get to use
these systems we start to explore that
aspect that's the real challenge and of
course the the touring the old touring
test now reimagined by Alexa with them
the Alexa prize challenge of social BOTS
is creating natural language is such a
beautiful thing to explore human robot
interaction with is both on the audio
side and adjust the text side is passing
the Turing test that's that's the true
grand challenge in a real way where you
want to have a conversation with the
robot for prolonged periods of times
maybe more than even some of your other
friends and on the other side of friends
is the risk the catastrophic risk that's
potential when you have an AI system
that's learning from data the near-term
directions of research is purely the
human supervision of AI decisions in
terms of safety and ethics there's a lot
of systems like with cars or medical
diagnosis and so on well there's some
left critical safety critical aspect
that we want to be able to supervise the
safety of that and there's ethical
decisions in terms of who gets alone and
not who gets a certain criminal penalty
or not the any degree to which I ask
systems are incorporated into that you
have to consider ethical questions and
even just the crude the low-level
perception systems like face recognition
you want to make sure that your face
recognition systems are not
discriminating based on color or gender
or age and so on you want to make sure
that at that basic fundamental level of
ethics the the systems of train in a way
they maintain our human values or the
the better angels of our nature the the
better size of our values some some of
the brighter aspects of our values
and the other thing is in terms of just
maintaining values that's the the normal
that's looking at the mean of the
distribution but we also want to do
control the outliers from the a system
not to do anything catastrophic so the
unintended consequences when something
happens that you didn't anticipate you
want to be able to put boundaries on
that and the grand challenge there
really it all boils down to the ability
of an AI system to say that it's
uncertain about something and that
measure of uncertainty has to be good it
has to be able to make a prediction
always accompany with uncertainty even
on things he hasn't seen before that's
the real challenge to be able to be
trained on cats and dogs and then seeing
a giraffe and saying I'm not sure what
that is we're quite far away from that
because right now probably confidently
say it's a dog depending on the giraffe
but we want to be able to have an
extremely high accuracy in the ability
of AI systems to determine their own
uncertainty to know what they don't know
because from that comes the supervision
from that comes the ability to stop
under things that it's uncertain about
catastrophic events the first aspect of
real-world operation is understanding
the human one of the places where deep
learning has really shined is the
perception problem it all begins at the
ability to look at raw data and convert
that into meaningful information that's
really the understanding the human comes
in not the kind of understanding that
when you're in a relationship with
somebody when you're friends with
somebody over a long period of time you
gain an understanding of their Cork's
limitations capabilities so on that's
really fascinating but the first step is
to just to be able to when you see them
recognize who they are what's on their
mind what's their are the body language
the what are they saying with their
mouth all those basic raw perception
tasks
that's where deep learning really Shawn
I'd like to cover the state-of-the-art
in in those various perception tasks so
first face recognition now there's a
full slide presentation with this and
I'm skipping around the full slide
presentation is the following structure
for each of these topics it has the
motivation description the excitement
the worry the the future impact is the
first part and then there's five papers
one defining the quote unquote
old-school seminal work that open the
field then the early progress in the
field and define the paper three is the
recent breakthrough often associated
with deep learning paper four is the
current state of the art and paper five
is the thing that defines the future
direction there are possible set of
things that define the future direction
and then the open problems in the field
and where the future research is very
much needed that's kind of the structure
of every topic I'll cover here as
quickly as possible face recognition so
what is it it's the first thing you know
the face contains so much rich
information about the state of the human
being so understanding the human being
really starts at the face and detecting
the face is the first step
detecting the body and then that there's
a head on top of that body that's the
first step and then there is the task of
face recognition being an exceptionally
active area of research because it has a
lot of applications and through that
research we're able to now study study a
lot of aspects how we perform perception
on the face so recognition purely stated
is the recognizing the identity of a
human face who is this
detection is detect is just detecting a
face now recognition means there's a
database of identities what is it seven
billion of them on earth and you're
trying to determine which of them it is
which of the seven billion it is or
whatever the data the databases the face
verification problem is something that
your phone uses when you are lock it
with your face is it saying is it you or
not is it Lex or somebody else it's a
database of to one person and versus
everybody else there's a lot of
applications here obviously I from
identification to all the security
aspects of using the using the face as a
sort of fingerprint of your identity and
all the interactive elements of AI
systems software based systems in this
world okay so why is it hard so all the
usual computer vision problems come in
lighting variation pose variation that's
just computer vision is really hard it's
just to get these raw numbers and you
have to infer so many things that are
that us humans take for granted
so the basic computer vision stuff but
there's stuff on top of that so faces
we're trying to it's like cats versus
dogs there's thousands of breeds of dog
and thousands of breeds of cats in that
same way there's faces can look very
similar to each other so these two
classes are trying to separate could be
very very very close together into
intermingle now there's a lot of phase
data available now because of the
application because of the financial
benefits of such datasets but for any
one individual unless you're Brad Pitt
or Angelina Jolie or celebrity there's
not many samples of the data available
so the individuals based on which the
classifications to be made is often not
very much data then there is the a lot
of variation so you have to in making a
face recognition task you have to be
invariant to all the hair styles all the
that you change
so overtime the weight gain the weight
loss the beard you decided to to grow
the glasses you wear sometimes and not
others different styles of glasses and
so on makeup or no makeups all of these
things is still you is still the same
identity you have to be able to classify
that and the kind of accuracy especially
for security applications extremely high
that's required the reason it's an
exciting area is there's a lot of
possibility but and there's also a lot
of concern right so the future impact
utopia dystopia and the more reasonable
middle path here is face provides a very
user friendly way of letting your
devices recognize you and say hello
your voice is certainly one but one of
the most powerful ones to really
classify at a distance is face so what
does that mean the utopian view the
possibility of the future the the best
possible brightest possible future is
you can use your face to as a passport
you know you replace the license the
priests all the security measures we put
from the passwords and our devices to
the credit card and so on all of that
you know Apple pays will be face pay you
show up it'll automatically connect to
all your devices all your banking
information and so on obviously the
flipside of that just rephrasing that
sentence is also can be dystopian
because you know complete violations of
privacy being watched at any time being
able to eye it to your Facebook and
social media and all your devices being
able to identify you making it
impossible for you to sort of hide from
society the fundamental aspects of
privacy maintaining privacy that's many
of us value greatly the middle path is
really just a useful way to unlock your
phone
the recent breakthroughs here is started
with the with deep face the essential
idea there is applying deep neural
networks to the task of face recognition
I mean with a lot of the breakthroughs
here on the on the perception side we're
not covering the old-school papers and
so on
the and the the historical context here
biggest breakthroughs came in with deep
learning two thousand six seven eight
last last ten years so that's the same
is true with face recognition deep face
was the big first application that
achieved near human performance on the
one of the big benchmarks at the time on
the labeled faces in the wild so using a
very large data set being able to form a
good representation the state-of-the-art
or at least close to the
state-of-the-art his face net the key
idea there as using those same deep
architectures to now optimize for the
representation itself directly the
notebook will putting out or shared with
some of you for the assignment describes
face recognition the challenge there
that it's not like the traditional
classification problem you have to you
have to form an embedding of the face
into a small vector compressed vector
such that in that embedding faces that
are similar to each other so identities
that are close together are close in the
Euclidean sense in that embedding and
people that are very different are far
away and so you use that embedding to
then do the classification that's really
the only way to deal with datasets for
which you have so little information on
any one individual person and so face
not optimize that embedding in a way
that directly optimizes for the
Euclidean distance between non matching
identities so there's still a lot of
excitement about face recognition
there's a lot of benchmark competitions
and a lot of
working in this and really bigger badder
networks and more data is is really one
of the ways to crack this problem
so public large disk dataset with six
hundred and seventy two thousand
identities four point seven million
photos that's in 2017 and that just
keeps scaling up and up and up and up
now we have to also be honest here and
the on the P the the possible future
directions of work in that you know even
though the benchmarks are growing that's
still a tiny subset of the people in the
world we're still not quite there to be
able to have the general face
recognition applicable to the entirety
of the population or of large swath of
the population in the world
so in this topic here brief coverage
we're not covering all the aspects of
the face especially temporal that are
useful in face recognition or useful
seeing a lot of things about the face
which is the FAC s facts the different
kinds of facial expressions that can
then be used to infer emotion and so on
you know raised eyebrows and all those
kinds of things and could provide rich
information from recognizing and
interpreting the face and the different
other modalities including 3d face
recognition we're not covering there's a
lot of exciting areas there where we're
just looking at the pure formulation of
the face recognition problem of looking
at a 2d single image the open problems
here is first not often stated and
misinterpreted by people is that most of
these methods of face recognitions start
with assuming that you have a bounding
box around the face
now the oftentimes recognition can
happen as so they're assuming a frontal
or near frontal view of the face but you
can do recognition all kinds of poses
and it's very interesting to think you
know that recognition the way we
recognize our friends and colleagues
parents and children is often using a
lot of cue contact
information that's beyond just the pure
frontal view of the face it can do
pretty well on profile views you can
from body language and so on so all
those things that's open in the field
how we incorporate that into face
recognition then the black box side is
problematic for both bias and just being
able to understand why incorrect
decisions are made is the making those
face recognition systems more
interpretable and then finally privacy
the ability to collect the kind of data
where the face recognition it would be
performing extremely well and yet not
violating the fundamental aspects of
privacy that we value activity
recognition taking the next step forward
here into the richer temporal context of
what people do again the same structure
from recent breakthroughs to the future
direction of work what is it it's class
it's classifying human activity from
images are from video and why is it
important is it provides a content
depending on the the level of
abstraction for the activity it provides
context for understanding the human what
are they doing are they playing baseball
are they singing are they sleeping or
are they putting on makeup knitting so
on mixing butter why is it hard again
all the usual problems in image
recognition the kind of data we're
dealing with is just much larger the
kind of video the richness of
possibilities that define what activity
is is much larger so the complexity is
much larger it's often difficult to
quantify motion because the the
fundamental aspect of activity is the
change in the world is the motion of
things and then it's difficult to
determine how the dynamics of the
physics of the world especially from a
2d view of what's background information
what's noise and what's essential to the
to understanding the activity
and the subjective ambiguous elements of
activity when does when does a
particular activity begin what does it
end
what's all the gray areas when you're
partially engaging in that activity and
so on when you start to annotate these
things we start to try to do the
detection it becomes clear that
sometimes the activity is partially
undertaken in the beginning and the end
is fuzzy future impact utopia dystopian
middle path so the impact here comes
from the being up able to understand the
world in time and be able to predict the
utopian possibilities is that the
contextual perception that can occur
from here can enrich the experience
between the human and robot that
dystopian view the flipside is being
able to understand sort of human
activities can let the robots sever the
relationship so it can damage the the
human robot interaction to where they
just do their own thing the middle path
is just finding useful information
massive amounts of data like YouTube now
there's a YouTube video data set being
able to identify what's going on in this
video being able to infer rich useful
semantic information and so what do we
do with video how do we do perception
video now the recent breakthrough came
with deep learning and see 3d this 3d
convolution neural networks that take a
sequence of images and they're able to
determine the action that's going on in
end-to-end way what's going on in the
video that was that was a recent
breakthrough the state of the art coming
from slightly well from a different
architecture that takes in two streams
one is a the image RGB data the other is
optical flow data that's really focusing
on the motion in the image those are the
two that's open the wave of two stream
networks here from that paper showing
the different architectures the on the
far right is the two stream architecture
and the see 3d on the shown under B here
taking the sequence of images but
all these are just different
architectures and then first one is LS
CMS
there's different architectures of how
do you represent or how do you allow a
learning model to be able to capture the
dynamics in the data the future
possibilities has to do well literally
with the future of being able to take
single images or sequences of images and
predicting the future it's very
interesting to think about in our
ability to hallucinate the future and
generate the future from images you
start to think about what are the
defining qualities of activities and in
this way augment data and be able to
train much much more accurate action
recognition systems topics not covered
is the localization of activity in video
so action recognition purely defined is
I give you a clip and you tell me what's
going on on this clip now if you take
actually a full YouTube video you want
to be able to localize find all the at
all the times when a particular
activities are going on it could be
multi label multiple activities going on
at the same time beginning and ending in
a synchronously and then there is more
richly three-dimensional or 2d
classification of activity based on
human movement so looking at scattered
like from a Kinect from 3d sensors
looking at skeleton based action
recognition from sensors is not that
provides you more than just the 2d image
data the open problems is that is
activity recognition is more than just
the way we move our body or if it's
baseball like a ball in your hand and
hitting it with a baseball bat it also
has to do with context there's you know
sitting down or working or looking at
something picking up an item those
sometimes can change profoundly based on
the other objects in the scene and the
activity of other people in the scene
and so being able to work with that kind
of context is a totally open problem
it's having to reduce a very complex
real-world context into something where
you can clearly identify an active
the body pose estimation is the task of
localizing the joints that form the
skeleton of the human body
so infer from visual information the
positions of the different joints along
the line of complexity it's important to
be able to understand the body language
the rich of the rich information about
the body of the human being so that's
from reading body language to animation
to acting activity recognition and it's
just a useful representation of the
human body you're analyzing pedestrians
or in interactive environments human
robot interaction being able to
understand what the heck it is the human
is trying to do the body poses really is
really useful it's hard because the body
when you look at it a tube 2d image
projection of the body there's a lot of
it's a highly dimensional optimization
problem figuring out how the raw pixels
match to the to the actual
three-dimensional orientation of the
human joints and the usual computer
vision challenges of pose lighting and
so on future impact is it's really
exciting for interactive environments
for a robot to be able to know the
position of the human body which is
trying to interact whether you're trying
with it's a robot that's trying to get
their favorite human a beer or whatever
your choice of favorite choice of drink
you have to be able to find where their
hand is so you can do the trade-off same
thing in the car you have to determine
if the person's hands are on the
steering wheel if their head and
orientation is such that they're able to
physically take control the vehicle
that's a really exciting set of
possibilities there and you know there's
applications and sports and CGI and
video games and all aspects when the
robot human have to work together the
dystopian view you can imagine is of
course being able to localize all those
joints means robots that are able to
more effectively hurt humans and so
that's always that's always a huge
concern and always
a dark dystopian view of the world with
so much AI in it of course the reality
is it's just more rich fulfilling HCI
that takes advantage of not just the
face stuff coming from the face but also
stuff but the body of the human that the
robot is interacting with so I started
with deep learning being applied to the
body posed estimation problem 2014 with
Depot's the key ideas there is looking
at the holistic human posed estimation
problem of detecting all the different
joints of a single person in an image
power of deep learning is that you no
longer have to do handcrafted expert
engineered features that it
automatically determines a set of
features all the parts of being detected
for you so this highly complex problem
is all solved with data this is the the
state of the art with 2017 and beyond
there's been a few papers from CMU along
this line is doing real-time
multi-person to depots estimation but in
a bottom-up way where you're detecting
individual joints first so all the knees
in the picture all the elbows all the
shoulders all the wrists and so on and
then stitching them together using parts
affinity fields what is the most likely
so if you find 17 elbows in a picture
you then have to try to see which elbow
belongs to which person so that actually
turns out to be extremely a powerful way
to detect especially multipoles
especially to deal with occlusions way
of detecting body pose it's really
interesting and also is able to because
of that because of the separation of the
detection x' is able to run in real time
which is also really exciting possible
future direction is the you using much
more information using deformable models
of the human body so not just the
skeleton a rich volumetric information
to do the to do the detection and then
optimizing for what's the most likely
orientation of the body
the open problems in the field is the
fact that you know pose is not a thing
that happens in in a single image pose
that happens is part of human behavior
and part of movement of time so here
Monty Python Ministry of Silly Walks
people walk in funny ways but so we
collect a lot of data on pedestrians and
I can tell you that people walk in
different ways and people position their
body in different ways and so the
temporal aspects of human motion are for
the most part not incorporated in the
body pose estimation problem and they
should be there's a lot of exciting
possibilities of capturing the the the
temporal dynamics there's a lot of
awesome slides here and I'm just
skipping through speech recognition
those 2018 was really big for
recommender systems for Netflix OkCupid
ai for president each one of
I mentioned briefly today we'll have a
separate mini lecture I taught an entire
course on this at CAI last year so deep
learning for understanding the human
it's a topic I'm really excited about
because it's really the first step for a
machine to be able to interact in a rich
way with the human being is to
understand it and it's also area where
the most near-term impact can happen a
system to be able to effectively detect
what a human being is up to what they're
thinking about how to best serve them
and enrich the experience of interacting
with that human let me jump to AI safety
and then the interactive experience with
humans and robots to just give examples
of some work in that direction of some
research in that direction I'm really
excited about so a is safety at the very
basic level there is an AI system that's
making decisions where we want human
beings to supervise those decisions
we've done quite a bit of work here at
MIT on that aspect of supervising
machines with arguing machines an open
ai has done work with safety by having
machines debate each other so this kind
of idea that you can achieve safety by
not giving ultimate power to any one
decision maker and the disagreement that
emerges from two AI systems or multiply
systems having to make decisions and
agree with each other it allows us to
then produce a signal of uncertainty
based on which the human supervision can
be sought without that when we have a
state-of-the-art
blackbox AI system that does something
like drive a car all we have is a system
that just runs and we're supposed to
have faith that it's always going to be
right we don't have any uncertainty
signal coming from the system so the
idea of arguing machines though we
develop them but working on is to have
multiple the AI system and ensemble of
AI system where the disagreement when
there's a disagreement detected the
human supervision is sought and the idea
there is
that when you have a system like Tesla
autopilot here we've instrumented test
the vehicle you know something like
Tesla autopilot it's telling you nothing
about how uncertain it is about the
decision it's making it just knows you
know the system once the system is on
it's now steering the car for you in a
very rare cases this is just disengage
but no matter what it's not showing to
you the degree of uncertainty has about
the world around it and so the way we
greet that signal uncertainty is by
adding another in this case end-to-end
vision system that's looking at the
external environment making steering
decisions and whenever there's a
disagreement between the two detected
that's when human supervision is sought
and we can predict in this way shown in
the plot there is we can predict with
high accuracy the times when the driver
chose to disengage the system because
they were uncomfortable so you're
detecting you're using this mechanism to
detect risky challenging situations
it's 
Resume
Read
file updated 2026-02-13 13:24:45 UTC
Categories
Manage