MIT 6.S093: Introduction to Human-Centered Artificial Intelligence (AI)
bmjamLZ3v8A • 2019-04-24
Transcript preview
Open
Kind: captions Language: en welcome to human centered artificial intelligence the last couple of decades in the developments of deep learning have been exciting in the problems that we've been able to automate in the problems that we've been able to crack with learning based methods one of the things underlying this lecture in the following lectures is the idea that with purely the learning based approach that we have been using there are certain aspects that are fundamental to our reality that we're going to hit a wall on that we have to integrate incorporate the human being deeply into the learning based systems in order to make the system's learn well and operate in the real world the underlying first prediction under the idea of human centered AI in this century is that the learning based approaches have been successful over the past two decades like deep learning XI learning approaches that learn from data are going to continue to become better and dominate the real-world applications so as opposed to fine-tuned optimization based models that do not learn from data more and more we're going to see learning based methods dominate real-world applications that's the underlying prediction that we're working with now if that's the case the corollary of that if learning based methods is the solution to many of these real-world problems is the way we get smarter ai systems is by improving the machine learning and the machine teaching machine learning is the thing that we've been talking about quite a bit that's the deep learning that's the algorithm the optimization of neill network parameters where you learn from data that's the current focus of the community current focusing the research and the thing that's behind the success of much of the developments in deep learning and then there's the machine teaching that's a human center part it's optimization its optimizing not the models not the algorithms but optimizing how you select the data based on which the algorithms learn its to make better teachers just like when you yourself are learning as a student or as a child how to operate in this world the the world and the parents and the teachers around you are informing you with very sparse information but providing the kind of information that is most useful for your learning process the selection of data based on which to learn I believe is the critical direction of research where we have to solve in order to create truly intelligent systems and ones that are able to work in the real world and I'll explain why and in ways that I'm referring to the implications of learning based systems so when you have a learning system that a system that learns from data neural networks machine learning learns from data the fundamental reality of that is the model is trying to generalize across the entirety of the reality which will have be tasked with operating based on a very small subset of samples from that reality and that generalization means that you it's there's always going to be a degree of uncertainty there's always going to be a degree of incomplete information and so no matter how much we want to these systems will not be provably safe so we can't put anything concrete down to how guaranteed to be safe in some specific way unless it's extremely constrained therefore we need human supervision of these systems the systems will not be provably fair from an ethics perspective from a discrimination perspective from all degrees of fairness therefore we need human supervision of these systems and it will not be explainable at any the pipeline in which they made the decisions they a systems will not be perfectly explainable to the satisfaction of us as human supervisors so there again human supervision constantly will be required and the solution to this is a whole set of techniques whole set of ideas they're kind of they were putting under the flag of human centered artificial intelligence human center of AI and the core ideas there is that we need to integrate the human being deeply into the annotation process and deeply into the human supervision of the real world operation of the system so both in the training phase and the testing phase the execution the operation of the system so this is what deep learning looks like with the human out of the loop the human contributes to a learning model by helping annotate some data and that data is then used to train to train a model that hopefully generalize in the real world and that model makes decisions and deep learning is really exciting because it's able to in a greater and greater degree of autonomy able to form high-level representations of the raw data in a way that it's actually able to do quite well on certain kinds of tasks that were before very difficult but fundamentally the human is out of the loop both of the training and the operation first you build the data set annotate the data set and then the systems run away with it they train Erland data and the real world operation does not involve the human except as the recipient of the service the system provides now the human in the loop version of that the human centered version of that means that annotation and operation of the system is both aided by human beings in a deep way what does that mean so we can look at human experts or individuals and crowd intelligence the wisdom of the crowd and the wisdom of the individual at the at the training phase the first part of that is the objective annotation we need to significantly improve objective annotation meaning annotation where the human intelligence is sufficient to be able to look at a sample and annotate it this is what we think about as an image net and all the basic computer vision tasks where a single human is enough to do a pretty damn good job of determining what's in a particular sample and then there's subjective annotation things that are difficult for humans to determine and as a singular sample of a human being as a crowd and kind of converge and these difficult questions these are questions at a low level of emotion these things that are a little bit fuzzy they require multiple people to annotate and at the high level are ethical questions of decisions that AI a system is tasked to making or we're tasked with making that nobody really knows the right answer to and as a crowd would kind of converging the right answer that's where the crowd intelligence comes in on the data annotation step now in the operation once you train the model the supervision again of the system based and I'll give examples of this more concretely on the the wisdom of the individual is for example operating in an autonomous vehicle the supervision of that autonomous vehicle a single driver is tasked with supervising the decisions that IA i sis them that's a critical step for learning based system that's not guaranteed to be safe that's not guaranteed to be explainable and the subjective side of that where the crowd intelligence is required where a single person is not able to make it these are again ethical questions about the operation of autonomous systems the supervision of autonomous vehicles the supervision of systems in the medical diagnosis in medicine in general and the this is AI operating in the real world making ethical decisions that are fundamentally difficult decisions for humans to make and that's where the crowd intelligence needs to and so we have to transform the machine learning problem by integrating the human being first up top in the training process on the left that's the usual machine learning formulation of a human being doing brute force annotation with some kind of data set cats and dogs an image net segmentation day of data set and cityscapes video action recognition in the YouTube dataset given the data set humans put in a lot of expensive labor to annotate what's going on in that data and then the machine learns the flip side of that the machine teaching side the human center side of that is the machine instead the learning modelled learning algorithm talking about most the neural networks here is the is tasked with providing selecting the subset the small sparse subsets of the data that are most useful for the human to annotate so instead of the human doing the brute force task first of the annotation the machine queries the human this is the field called machine teaching them the machine core is a human with questions and therefore the task is and this is wide open research field the task is to minimize several orders of magnitude the amount of data that needs to be annotated in the real-world operation side the integration the human looks like this on the left the machine now trained with a learning model makes decisions and the human living in this world receives the service provided by the machine whether that's medical diagnosis whether that's an autonomous vehicle whether that's a system that determines whether you get a loan or not so on with the human centered version of that the machine makes a decision but it's able to provide a degree of uncertainty there's one of the big requirements to be able to specify a degree of uncertainty of that decision such that when uncertainty is below a certain threshold human supervision sought and again in that in that decision whether that's accosted decision financially or costly decision in terms of human life human supervision is sought and the service is received by the human by the very same humans that are providing the supervision or another another set of humans but ultimately the decision is over sought by human beings this is what I believe is going to be the defining mode of operation for AI systems in the 21st century is we won't be able to as much as we'd like to escape to create perfect AI systems that escape the need to work together with human beings at every step there is five areas of research Grand Challenges here that define human centered AI I'll focus on a few today and focus on one very much so and even with that degree of high pruning we have 120 slides so I'll skip around but on the human centered AI during the learning phase there is the methods the research arm of machine teaching how do we select how do we improve supervised learning as opposed to needing ten thousand a hundred thousand a million examples how do we reduce that where the algorithm queries only the essential elements and able to learn effectively for very little information from very little samples just like we do when we're students when we learns fundamental aspects of math language and so on we just need a few examples but those examples are critical to our understanding and the second part of that is the reward engineering that during a learning process injecting the human being into the definition of the loss function of what's good what's bad systems they have to operate in the real world have to understand what our society deems is good and bad and we're not always good and injecting that at the very beginning there has to be a continuous process of adjusting the rewards of reward reengineering by humans so that we can encode human values into the learning process now on the second part on the human centered AI during real world operation when the train the system is actually trained the that there is the interactive element of robots and humans working together that means the part I'll focus on quite a bit today because there's been quite a lot of development and progress on the deep learning side is human sensing is algorithms that didn't understand the human being algorithms that from taking raw information whether it's video audio text begin to get a context a measure of the state of the human being in the short term and the long term over time the temporal the temporal understanding and the instantaneous understanding then there is the interaction aspect so once you understand the human is the perception problem you have to interact with them and interact in such a way that is continuous collaborative and a rich meaningful experience that's a we're in the very early days of creating anything like rich meaningful experiences with AI systems especially learning based AI systems and a safety in the real world operation safety ethics unrolling the results of the engineered rewards that were in place during the the learning process now come to fruition and we need to make sure that the the Train model does not result in things that are highly detrimental catastrophic to our safety or highly detrimental to what we deem is good and bad in society of discrimination of ethical considerations and all those kinds of things the the gray area the line we all walk as a society in the crowd intelligence we have to provide bounds on AI systems and there's an entire group of work I'll mention what we're doing in that area so first on the machine teaching side and they efficient supervised learning I'd like to sort of do one slide on each of these to kind of give you an idea near-term and and do two things for each area that we will elaborate in future lectures on and some of it I'll elaborate today first the near turn directions of research the things that are within our reach now and a sort of thought experiment a grand challenge that when if we can do it that'll be damn impressive that will be a definition of real progress in this area so near-term directions of research for machine teaching for improved supervised learning integrating the human into the annotation process is instead of annotating brute force is annotate by asking the human questions so we have to transform the way we do annotations where the process of annotation is not defining the data set and then you go through the entire data set it's an it's a machine teaching system that queries the user for questions to annotate and on the algorithm side active learning these are all sort of areas of work where we can be more clever about the way we use data a select date on which to train so active learning is actively selecting during the training process which part of the data to train on and annotate data augmentation is taking things have been supervised by a human and expanding them modifying the data warping the data in interesting ways such that it expands it multiplies the human effort that was injected into helping understand what's in the data the one-shot learning zero shot learning are all in transfer learning all in that category in self play is in in the reinforcement learning area where the system constructs a model of the world and goes along alone in a room and plays with that model to try to figure out the different constraints in the model what how do you achieve good things they're an example Grand Challenge here that would define serious progress in the field is if we take imagenet or cocoa the imagenet challenge or cocoa object detection challenge and training only on a totally different kind of data be able to achieve state-of-the-art results so training only on Wikipedia with the text and images that are there on Wikipedia be able to perform object detection on the state-of-the-art benchmark of cocoa the cocoa is the data set of different objects with rich annotation of the localization of the objects that I believe is exactly the kind of thing that all the problems in the transfer learning and efficient data annotation machine teaching have to be solved to achieve that another way to another challenge you can think of if we just simplify it more is achieve 3% 0.3% err on m miss thats the handwritten recognition task that everybody always provides as an example so achieve a very good accuracy state-of-the-art accuracy by training only in a single example of a digit as opposed to training on thousands training on one example that's something that most US humans can do given an Inc one example of a new language you haven't seen before for each character after studying them for a little bit be able to now classify future characters at high accuracy the second part of the learning process where the human needs to be injected in the near term directions the research there is the reward engineering and the tuning of those continuous tuning of those rewards by human being so if opening eyes doing quite a bit of work here here's a game played by human and they eye it's really my favorite example of this on the left humans controlling a boat that's finishing a race on the right is our la agent reinforcement learning agent that's controlling a boat that's trying to not finish a race trying to maximize the reward defined prior to bye-bye initially by a human being and what it finds is that you can get let much more reward by collecting green turbos that appear as opposed to finishing the race it realizes that finishing the race actually gets in the way of maximizing reward and so that's the unintended consequences of a reward function that was that was specified previously and most human supervisors of this result would be able to adjust through the re-entry engineer the reward function to be able to get the robot to the AI system here to finish the race and that kind of continuous monitoring monitoring of the performance of the system during the training process is as a near-term direction of research that's a few deep mind open AI and ourselves are taking on example Grand Challenge is allowing AI system to operate in a context where there's a lot of fuzziness for us humans there's a lot of uncertainty there's a lot of gray area there's a lot of challenging aspects in terms of what is right and what is wrong that we're continuing you to improve on example I provide here is one of the least popular things in in the world is the US Congress so replacing US Congress the body of representatives of the people of the United States and they make bills based on the belief of the people that you know that sounds a lot like what Netflix does in recommending what movies should watch next in representing what people love to watch so that's just a recommender system so it makes perfect sense that an AI system should be able to take on this challenge and that would I see that as a grand challenge is replacing some of the fundamental representation of large crowds of people that make ethical decisions replaced by by a human centered AI system okay in real-world operation the first thing we have to do before we have a robot and a human work together the first thing is the robot has to perceive perceive the human question do you want to change the way Congress works make it better or do you want to just take the system that currently is and automate it so the idea is take the system as it currently is supposed to be and automate that so system can provide a lot more transparency of the inputs the idea of Congress is suppose the only inputs is supposed to be the people and the beliefs of the people and there's also you know but that in there's rich information there so for example I mean the the input there's a you know for me not saying anything about politics but there's certain issues I care a lot about certain issues I don't care much about and this put that aside and then there's certain issues that I know a lot about and certain issues I know very little about and those don't actually intersect the well I'm very opinionated about things I don't know anything about it's very common all of us are so being able to put that representation of me into a system that would take a lot of our entire nation together and be able to make bills that represent the people now the challenge there it can't be just the training set and then the system now operates AI is running the country no there has to be that human element where we're constantly supervising just like we're in theory supposed to be supervising our congressmen and Congresswomen human sensing the first part in order to have an ad system that works with a human being they asked us to perceive understand the state of the human being at the very simplest level and the more complex temporal contextual overtime level so the near-term directions the research is purely the perception problem the deep or deep learning shines of taking data whether that comes in visual audio text and so on and being able to classify the physical mental social state the social context of the person be able to everything and this is what I'll cover a little bit of today everything from face detection face recognition and motion recognition natural language processing body possessed those same recommender systems speech recognition that all of those conversions of raw data that captures something about the human being into actually meaningful actionable information the grand challenge there is emotion recognition you know there there's been a lot of companies and ideas that we've somehow cracked emotion recognition that we are able to determine the mood of a person but really that's if for those who were here last year with Lisa Feldman Barrett but just if you're sort of very honest and you study emotional intelligence and emotion and the expression of emotion it's a fascinating area and we're not even close to being able to build perceptual systems that detect emotion well we're more so doing is detecting very simple facial expressions that correspond to our storybook versions of emotions smiling crying like frowning in a caricatured way so if you build a system that has a high accuracy of doing real emotion recognition you can think of it as steady here an AI system that classifies binary classification problem with 95 percent accuracy of whether you want to be left alone or not and being able to do that after collecting data for 30 days that I see is a really clean formulation of exactly some exactly the kind of human understanding we need to be able to build in in our learning models and we're very far away from that especially the long temporal aspect of that of being able to integrate data over a long period of time then the second part of human robot interaction in the real world operation is the experience this is where we're now just beginning to consider that interactive experience of what how do we have a rich fulfilling experience we have autonomous vehicles for example semi autonomous vehicles whether that's Tesla Volvo super cruise with the Cadillac bunch of systems that have now greater and greater degrees of automation in the car and we get to have the human interact with that AI system and trying to figure out how do we have a rich fulfilling experience in the though currently the Volvo system that's that experience is more limited there's a little icon it's more kind of traditional driving situation in the Tesla you have a much bigger display about what's going on and the supercruise there's a camera looking at your eyes the in the Cadillac super cruise system there's a camera looking to your eyes determining if you're awake or not I'm paying attention or not and that there's like an experience there that we're trying to create and in the Tesla case you know the miles are racking up we have real data here at MIT we're studying this exact interaction there's now over a billion miles driven in the Tesla and the same in the fully autonomous side with weigh mode they've now reached 10 plus million miles driven autonomously and there's a lot of people experimenting with this but that's the that collaborative interaction of going back and forth of being able to for the AI system to express the degree of uncertainty as about the environment about the AI system being able to express when it needs help and not be able to communicate what are its limitations and capabilities and so on trade-off control be able to seek human supervision there's a dance there that's really that takes into consideration everything from the neurobiological research to psychology to the deep to deep learning to the pure robotics HR HR I human robot interaction aspects one grand challenge would be you know Tesla is driven 1 billion miles now under autopilot under the semi autonomous mode the grand challenge here is when we start getting to the kind of mileage that we see in the United States every year you start getting into the hundreds of billions of miles driven semi autonomously we get to see teenagers 16 17 18 using these systems for the first time good to see older folks hope who don't necessarily drive or use any kind of AI in their lives get to use these systems we start to explore that aspect that's the real challenge and of course the the touring the old touring test now reimagined by Alexa with them the Alexa prize challenge of social BOTS is creating natural language is such a beautiful thing to explore human robot interaction with is both on the audio side and adjust the text side is passing the Turing test that's that's the true grand challenge in a real way where you want to have a conversation with the robot for prolonged periods of times maybe more than even some of your other friends and on the other side of friends is the risk the catastrophic risk that's potential when you have an AI system that's learning from data the near-term directions of research is purely the human supervision of AI decisions in terms of safety and ethics there's a lot of systems like with cars or medical diagnosis and so on well there's some left critical safety critical aspect that we want to be able to supervise the safety of that and there's ethical decisions in terms of who gets alone and not who gets a certain criminal penalty or not the any degree to which I ask systems are incorporated into that you have to consider ethical questions and even just the crude the low-level perception systems like face recognition you want to make sure that your face recognition systems are not discriminating based on color or gender or age and so on you want to make sure that at that basic fundamental level of ethics the the systems of train in a way they maintain our human values or the the better angels of our nature the the better size of our values some some of the brighter aspects of our values and the other thing is in terms of just maintaining values that's the the normal that's looking at the mean of the distribution but we also want to do control the outliers from the a system not to do anything catastrophic so the unintended consequences when something happens that you didn't anticipate you want to be able to put boundaries on that and the grand challenge there really it all boils down to the ability of an AI system to say that it's uncertain about something and that measure of uncertainty has to be good it has to be able to make a prediction always accompany with uncertainty even on things he hasn't seen before that's the real challenge to be able to be trained on cats and dogs and then seeing a giraffe and saying I'm not sure what that is we're quite far away from that because right now probably confidently say it's a dog depending on the giraffe but we want to be able to have an extremely high accuracy in the ability of AI systems to determine their own uncertainty to know what they don't know because from that comes the supervision from that comes the ability to stop under things that it's uncertain about catastrophic events the first aspect of real-world operation is understanding the human one of the places where deep learning has really shined is the perception problem it all begins at the ability to look at raw data and convert that into meaningful information that's really the understanding the human comes in not the kind of understanding that when you're in a relationship with somebody when you're friends with somebody over a long period of time you gain an understanding of their Cork's limitations capabilities so on that's really fascinating but the first step is to just to be able to when you see them recognize who they are what's on their mind what's their are the body language the what are they saying with their mouth all those basic raw perception tasks that's where deep learning really Shawn I'd like to cover the state-of-the-art in in those various perception tasks so first face recognition now there's a full slide presentation with this and I'm skipping around the full slide presentation is the following structure for each of these topics it has the motivation description the excitement the worry the the future impact is the first part and then there's five papers one defining the quote unquote old-school seminal work that open the field then the early progress in the field and define the paper three is the recent breakthrough often associated with deep learning paper four is the current state of the art and paper five is the thing that defines the future direction there are possible set of things that define the future direction and then the open problems in the field and where the future research is very much needed that's kind of the structure of every topic I'll cover here as quickly as possible face recognition so what is it it's the first thing you know the face contains so much rich information about the state of the human being so understanding the human being really starts at the face and detecting the face is the first step detecting the body and then that there's a head on top of that body that's the first step and then there is the task of face recognition being an exceptionally active area of research because it has a lot of applications and through that research we're able to now study study a lot of aspects how we perform perception on the face so recognition purely stated is the recognizing the identity of a human face who is this detection is detect is just detecting a face now recognition means there's a database of identities what is it seven billion of them on earth and you're trying to determine which of them it is which of the seven billion it is or whatever the data the databases the face verification problem is something that your phone uses when you are lock it with your face is it saying is it you or not is it Lex or somebody else it's a database of to one person and versus everybody else there's a lot of applications here obviously I from identification to all the security aspects of using the using the face as a sort of fingerprint of your identity and all the interactive elements of AI systems software based systems in this world okay so why is it hard so all the usual computer vision problems come in lighting variation pose variation that's just computer vision is really hard it's just to get these raw numbers and you have to infer so many things that are that us humans take for granted so the basic computer vision stuff but there's stuff on top of that so faces we're trying to it's like cats versus dogs there's thousands of breeds of dog and thousands of breeds of cats in that same way there's faces can look very similar to each other so these two classes are trying to separate could be very very very close together into intermingle now there's a lot of phase data available now because of the application because of the financial benefits of such datasets but for any one individual unless you're Brad Pitt or Angelina Jolie or celebrity there's not many samples of the data available so the individuals based on which the classifications to be made is often not very much data then there is the a lot of variation so you have to in making a face recognition task you have to be invariant to all the hair styles all the that you change so overtime the weight gain the weight loss the beard you decided to to grow the glasses you wear sometimes and not others different styles of glasses and so on makeup or no makeups all of these things is still you is still the same identity you have to be able to classify that and the kind of accuracy especially for security applications extremely high that's required the reason it's an exciting area is there's a lot of possibility but and there's also a lot of concern right so the future impact utopia dystopia and the more reasonable middle path here is face provides a very user friendly way of letting your devices recognize you and say hello your voice is certainly one but one of the most powerful ones to really classify at a distance is face so what does that mean the utopian view the possibility of the future the the best possible brightest possible future is you can use your face to as a passport you know you replace the license the priests all the security measures we put from the passwords and our devices to the credit card and so on all of that you know Apple pays will be face pay you show up it'll automatically connect to all your devices all your banking information and so on obviously the flipside of that just rephrasing that sentence is also can be dystopian because you know complete violations of privacy being watched at any time being able to eye it to your Facebook and social media and all your devices being able to identify you making it impossible for you to sort of hide from society the fundamental aspects of privacy maintaining privacy that's many of us value greatly the middle path is really just a useful way to unlock your phone the recent breakthroughs here is started with the with deep face the essential idea there is applying deep neural networks to the task of face recognition I mean with a lot of the breakthroughs here on the on the perception side we're not covering the old-school papers and so on the and the the historical context here biggest breakthroughs came in with deep learning two thousand six seven eight last last ten years so that's the same is true with face recognition deep face was the big first application that achieved near human performance on the one of the big benchmarks at the time on the labeled faces in the wild so using a very large data set being able to form a good representation the state-of-the-art or at least close to the state-of-the-art his face net the key idea there as using those same deep architectures to now optimize for the representation itself directly the notebook will putting out or shared with some of you for the assignment describes face recognition the challenge there that it's not like the traditional classification problem you have to you have to form an embedding of the face into a small vector compressed vector such that in that embedding faces that are similar to each other so identities that are close together are close in the Euclidean sense in that embedding and people that are very different are far away and so you use that embedding to then do the classification that's really the only way to deal with datasets for which you have so little information on any one individual person and so face not optimize that embedding in a way that directly optimizes for the Euclidean distance between non matching identities so there's still a lot of excitement about face recognition there's a lot of benchmark competitions and a lot of working in this and really bigger badder networks and more data is is really one of the ways to crack this problem so public large disk dataset with six hundred and seventy two thousand identities four point seven million photos that's in 2017 and that just keeps scaling up and up and up and up now we have to also be honest here and the on the P the the possible future directions of work in that you know even though the benchmarks are growing that's still a tiny subset of the people in the world we're still not quite there to be able to have the general face recognition applicable to the entirety of the population or of large swath of the population in the world so in this topic here brief coverage we're not covering all the aspects of the face especially temporal that are useful in face recognition or useful seeing a lot of things about the face which is the FAC s facts the different kinds of facial expressions that can then be used to infer emotion and so on you know raised eyebrows and all those kinds of things and could provide rich information from recognizing and interpreting the face and the different other modalities including 3d face recognition we're not covering there's a lot of exciting areas there where we're just looking at the pure formulation of the face recognition problem of looking at a 2d single image the open problems here is first not often stated and misinterpreted by people is that most of these methods of face recognitions start with assuming that you have a bounding box around the face now the oftentimes recognition can happen as so they're assuming a frontal or near frontal view of the face but you can do recognition all kinds of poses and it's very interesting to think you know that recognition the way we recognize our friends and colleagues parents and children is often using a lot of cue contact information that's beyond just the pure frontal view of the face it can do pretty well on profile views you can from body language and so on so all those things that's open in the field how we incorporate that into face recognition then the black box side is problematic for both bias and just being able to understand why incorrect decisions are made is the making those face recognition systems more interpretable and then finally privacy the ability to collect the kind of data where the face recognition it would be performing extremely well and yet not violating the fundamental aspects of privacy that we value activity recognition taking the next step forward here into the richer temporal context of what people do again the same structure from recent breakthroughs to the future direction of work what is it it's class it's classifying human activity from images are from video and why is it important is it provides a content depending on the the level of abstraction for the activity it provides context for understanding the human what are they doing are they playing baseball are they singing are they sleeping or are they putting on makeup knitting so on mixing butter why is it hard again all the usual problems in image recognition the kind of data we're dealing with is just much larger the kind of video the richness of possibilities that define what activity is is much larger so the complexity is much larger it's often difficult to quantify motion because the the fundamental aspect of activity is the change in the world is the motion of things and then it's difficult to determine how the dynamics of the physics of the world especially from a 2d view of what's background information what's noise and what's essential to the to understanding the activity and the subjective ambiguous elements of activity when does when does a particular activity begin what does it end what's all the gray areas when you're partially engaging in that activity and so on when you start to annotate these things we start to try to do the detection it becomes clear that sometimes the activity is partially undertaken in the beginning and the end is fuzzy future impact utopia dystopian middle path so the impact here comes from the being up able to understand the world in time and be able to predict the utopian possibilities is that the contextual perception that can occur from here can enrich the experience between the human and robot that dystopian view the flipside is being able to understand sort of human activities can let the robots sever the relationship so it can damage the the human robot interaction to where they just do their own thing the middle path is just finding useful information massive amounts of data like YouTube now there's a YouTube video data set being able to identify what's going on in this video being able to infer rich useful semantic information and so what do we do with video how do we do perception video now the recent breakthrough came with deep learning and see 3d this 3d convolution neural networks that take a sequence of images and they're able to determine the action that's going on in end-to-end way what's going on in the video that was that was a recent breakthrough the state of the art coming from slightly well from a different architecture that takes in two streams one is a the image RGB data the other is optical flow data that's really focusing on the motion in the image those are the two that's open the wave of two stream networks here from that paper showing the different architectures the on the far right is the two stream architecture and the see 3d on the shown under B here taking the sequence of images but all these are just different architectures and then first one is LS CMS there's different architectures of how do you represent or how do you allow a learning model to be able to capture the dynamics in the data the future possibilities has to do well literally with the future of being able to take single images or sequences of images and predicting the future it's very interesting to think about in our ability to hallucinate the future and generate the future from images you start to think about what are the defining qualities of activities and in this way augment data and be able to train much much more accurate action recognition systems topics not covered is the localization of activity in video so action recognition purely defined is I give you a clip and you tell me what's going on on this clip now if you take actually a full YouTube video you want to be able to localize find all the at all the times when a particular activities are going on it could be multi label multiple activities going on at the same time beginning and ending in a synchronously and then there is more richly three-dimensional or 2d classification of activity based on human movement so looking at scattered like from a Kinect from 3d sensors looking at skeleton based action recognition from sensors is not that provides you more than just the 2d image data the open problems is that is activity recognition is more than just the way we move our body or if it's baseball like a ball in your hand and hitting it with a baseball bat it also has to do with context there's you know sitting down or working or looking at something picking up an item those sometimes can change profoundly based on the other objects in the scene and the activity of other people in the scene and so being able to work with that kind of context is a totally open problem it's having to reduce a very complex real-world context into something where you can clearly identify an active the body pose estimation is the task of localizing the joints that form the skeleton of the human body so infer from visual information the positions of the different joints along the line of complexity it's important to be able to understand the body language the rich of the rich information about the body of the human being so that's from reading body language to animation to acting activity recognition and it's just a useful representation of the human body you're analyzing pedestrians or in interactive environments human robot interaction being able to understand what the heck it is the human is trying to do the body poses really is really useful it's hard because the body when you look at it a tube 2d image projection of the body there's a lot of it's a highly dimensional optimization problem figuring out how the raw pixels match to the to the actual three-dimensional orientation of the human joints and the usual computer vision challenges of pose lighting and so on future impact is it's really exciting for interactive environments for a robot to be able to know the position of the human body which is trying to interact whether you're trying with it's a robot that's trying to get their favorite human a beer or whatever your choice of favorite choice of drink you have to be able to find where their hand is so you can do the trade-off same thing in the car you have to determine if the person's hands are on the steering wheel if their head and orientation is such that they're able to physically take control the vehicle that's a really exciting set of possibilities there and you know there's applications and sports and CGI and video games and all aspects when the robot human have to work together the dystopian view you can imagine is of course being able to localize all those joints means robots that are able to more effectively hurt humans and so that's always that's always a huge concern and always a dark dystopian view of the world with so much AI in it of course the reality is it's just more rich fulfilling HCI that takes advantage of not just the face stuff coming from the face but also stuff but the body of the human that the robot is interacting with so I started with deep learning being applied to the body posed estimation problem 2014 with Depot's the key ideas there is looking at the holistic human posed estimation problem of detecting all the different joints of a single person in an image power of deep learning is that you no longer have to do handcrafted expert engineered features that it automatically determines a set of features all the parts of being detected for you so this highly complex problem is all solved with data this is the the state of the art with 2017 and beyond there's been a few papers from CMU along this line is doing real-time multi-person to depots estimation but in a bottom-up way where you're detecting individual joints first so all the knees in the picture all the elbows all the shoulders all the wrists and so on and then stitching them together using parts affinity fields what is the most likely so if you find 17 elbows in a picture you then have to try to see which elbow belongs to which person so that actually turns out to be extremely a powerful way to detect especially multipoles especially to deal with occlusions way of detecting body pose it's really interesting and also is able to because of that because of the separation of the detection x' is able to run in real time which is also really exciting possible future direction is the you using much more information using deformable models of the human body so not just the skeleton a rich volumetric information to do the to do the detection and then optimizing for what's the most likely orientation of the body the open problems in the field is the fact that you know pose is not a thing that happens in in a single image pose that happens is part of human behavior and part of movement of time so here Monty Python Ministry of Silly Walks people walk in funny ways but so we collect a lot of data on pedestrians and I can tell you that people walk in different ways and people position their body in different ways and so the temporal aspects of human motion are for the most part not incorporated in the body pose estimation problem and they should be there's a lot of exciting possibilities of capturing the the the temporal dynamics there's a lot of awesome slides here and I'm just skipping through speech recognition those 2018 was really big for recommender systems for Netflix OkCupid ai for president each one of I mentioned briefly today we'll have a separate mini lecture I taught an entire course on this at CAI last year so deep learning for understanding the human it's a topic I'm really excited about because it's really the first step for a machine to be able to interact in a rich way with the human being is to understand it and it's also area where the most near-term impact can happen a system to be able to effectively detect what a human being is up to what they're thinking about how to best serve them and enrich the experience of interacting with that human let me jump to AI safety and then the interactive experience with humans and robots to just give examples of some work in that direction of some research in that direction I'm really excited about so a is safety at the very basic level there is an AI system that's making decisions where we want human beings to supervise those decisions we've done quite a bit of work here at MIT on that aspect of supervising machines with arguing machines an open ai has done work with safety by having machines debate each other so this kind of idea that you can achieve safety by not giving ultimate power to any one decision maker and the disagreement that emerges from two AI systems or multiply systems having to make decisions and agree with each other it allows us to then produce a signal of uncertainty based on which the human supervision can be sought without that when we have a state-of-the-art blackbox AI system that does something like drive a car all we have is a system that just runs and we're supposed to have faith that it's always going to be right we don't have any uncertainty signal coming from the system so the idea of arguing machines though we develop them but working on is to have multiple the AI system and ensemble of AI system where the disagreement when there's a disagreement detected the human supervision is sought and the idea there is that when you have a system like Tesla autopilot here we've instrumented test the vehicle you know something like Tesla autopilot it's telling you nothing about how uncertain it is about the decision it's making it just knows you know the system once the system is on it's now steering the car for you in a very rare cases this is just disengage but no matter what it's not showing to you the degree of uncertainty has about the world around it and so the way we greet that signal uncertainty is by adding another in this case end-to-end vision system that's looking at the external environment making steering decisions and whenever there's a disagreement between the two detected that's when human supervision is sought and we can predict in this way shown in the plot there is we can predict with high accuracy the times when the driver chose to disengage the system because they were uncomfortable so you're detecting you're using this mechanism to detect risky challenging situations it's
Resume
Categories