Turing Test: Can Machines Think?

MGW_Qcqr9eQ • 2020-04-27

Transcript preview

Open

Kind: captions
Language: en
in this video I proposed to ask the
question that was asked by Alan Turing
almost seventy years ago in his paper
Computing Machinery and intelligence can
machines think this is the first paper
in a paper reading club that we started
focused on artificial intelligence but
also including mathematics physics
computer science you know science all
the scientific and engineering
disciplines on the surface this is a
philosophical paper but really it's one
of the most impactful and important
first steps towards actually engineering
intelligent systems but providing a test
benchmark that we call today the Turing
test of how we can actually know
quantifiably that a system has become
intelligent so I'd like to talk about an
overview of ideas in the paper provide
some of the objections inside the paper
and external to the paper consider some
alternatives to the test proposed within
the paper and then finished with some
takeaways like I said the title of the
paper was Computing Machinery and
intelligence published almost 70 years
ago in 1950 author Alan Turing and to me
now we can argue about this on the slide
I say it's one of the most impactful
papers to me it probably is the most
impactful paper in the history of
artificial intelligence while only being
a philosophy paper I think the number of
researchers from inside computer science
and from outside that has inspired as
may dream at a collective intelligence
level of our species inspire that this
is possible I think is immeasurable for
all the major engineering breakthroughs
and computer science breakthroughs and
papers stretching all the way back to
the 30s and 40s with even the work by
Alan Turing with the Turing machine some
some of the mathematical foundations of
computer science to today with deep
learning a sequence of papers from the
very practical
Alex Ned paper to the backpropagation
paper so all of these papers that
underlie the actual successes of the
field I think the seed was plan
did the dream was born with this paper
and it happens to have some of my
favorite opening laws of any paper I've
ever read it goes I propose to consider
the question can machines think this
should begin with the definitions of the
meaning of the terms machine and think
the definition might be framed so as to
reflect so far as possible normally use
of the words but this attitude is
dangerous if the meaning of the words
machine and think are to be found in
examining how they're commonly used it
is difficult to escape the conclusion
that the meaning and the answer to the
question can machines think is to be
sought in a statistical survey such as a
Gallup poll but this is absurd instead
of attempting such a definition I shall
replace the question by another which is
closely related to it and is expressed
in relatively unambiguous terms and he
goes on to define the imitation game the
construction that we today call the
Turing test which goes like this there's
a human interrogator on one side of the
wall and there's two entities one a
machine one a human on the other side
and the human interrogator communicates
with the two entities on the other side
of the wall by written word by passing
notes back and forth and after some time
of this conversation the human
interrogator is tasked with making a
decision which of the other two entities
is a human and which is a machine I
think this is a powerful leap of
engineering which is take an ambiguous
but a profound question like can
machines think and convert it into a
concrete test that can serve as a
benchmark for intelligence but there's
echoes in this question to some of the
other profound questions that we often
ask so not only can machines think but
can machines be conscious Commission's
fall in love can machines create art
music poetry can machines enjoy a
delicious meal
piece of chocolate cake I think these
are really really important questions
but very difficult to ask when we're
trying to create a non-human system that
tries to achieve human-level
capabilities so that's where touring
formulates this imitation game and his
prediction was that by the year 2000 or
in 50 years since the paper that a
machine with 100 megabytes of storage
will fool 30% of humans in a five-minute
test of conversation another broader
societal prediction he made which i
think is also interesting is that people
will no longer consider a phrase like
thinking machine contradictory
so basically artificial intelligence at
a human level become so commonplace that
we would just take it for granted and
the other part that he goes at lengthen
towards the end of the paper to describe
which he believes that learning machines
or machine learning will be a critical
component of this success I think it's
also useful to break apart to imply
claims within the paper open claims open
questions one is that the imitation game
as throwing proposes is a good test of
intelligence and the second is that
machines can actually pass this test so
when you say can machines think you're
both proposing an engineering benchmark
for the word think and raising the
questions can machines pass this
benchmark one of the perhaps tragic but
also exciting aspects of this whole area
of work is that we still have a lot of
work to do
so throughout this presentation I will
not only describe some of the ideas in
the paper and outside of it in the year
since but also some of the open
questions that remain both at the
philosophical the psychological and the
technical levels so here the open
question stands is even impossible to
create a test of intelligence for
artificial systems that will be
convincing to us or will we always raise
the bar a Korell
that question is looking at the
prediction that were made that people
will no longer find the phrase Thinking
Machines contradictory why do we still
find that phrase contradictory why do we
still think that computers are not at
all intelligent for many people the game
of chess
was seen as the highest level of
intelligence in these early days in fact
we assign a lot of intelligence to Garry
Kasparov for being one of the greatest
if not the greatest chess players of all
time as a human why do we not assign at
least an inkling of that to IBM D blue
when I beat Garry Kasparov now of course
you might start saying there's a
brute-force algorithm or in the case of
alphago now for zero
you know how the learning mechanisms
behind those algorithms work when they
mastered the game of go in the game of
chess
and we'll get to some of those
objections but there's something deeply
psychological within those objections
that almost fear an artificial
intelligence that passes the test so the
drawing test is very interesting as a
thought experiment as a philosophical
construct but it's also interesting as a
real engineering test and one of the
implementations of it has been called
the lobner prize which has been running
since 1991 to today and the awards
behind it the war structure is 25,000
dollars for a system that using text
alone passes the test and $100,000 that
uses other modalities like visual and
auditory input the rules of the
competition have changed through the
years but there currently are as follows
it's a 25-minute conversation and in
order to win to pass the test you have
to fool 50 percent of the judges with
which the system communicates mitsugu
and rose from Steve Warwick and Bruce
Wilcox have been dominating the past ten
years winning all but one of the years
some details
Masuka rows are both mostly scripted
rule-based chatbots so they're not and
to end learning systems I believe there
is a little bit of machine learning but
as as I understand at the core there
mostly scripted and on the next slide
we'll go through a transcript of a
conversation with mitsuko to give you a
sense of what those conversations are
like but I'd like to mention sort of a
tragic note really that as far as I
understand the prize is no longer funded
but money aside it shows that there's
not a broad interest in this
construction both inside and outside the
research community it's really
surprising to me that google deepmind
facebook AI major groups major
individual researchers in the field of
AI have not taken on this challenge I
hope to actually be a small part of
changing that a little bit because I
think there is a lot of value in
considering the Turing test I don't
think as a distraction and I'll return
to that point later on in this
datian so you can pause and maybe read
parts of this transcript I'll highlight
on the right at the top of the right
column the human says you're a robot
Mitsuko answers how can a person be a
robot human many people live lives akin
to robots Mitsuko how many more than a
million human yes Mitsuko that seems
like rather a lot human do you think
Southampton will be Swansea with suku I
think so human why it's so good no
reason it's just what I think human he
must have a reason Masuka nope
no real reason human which team do you
support
Mitsuko I support Leeds United human
leads exclamation point how's the season
going Masuku yes it is a city in the
north of England about half a million
people live there it is the same as ever
so you see in this transcript it's just
some little gap of in humanity that
becomes apparent there's a lingering and
a repeatability of points there's
certain gaps in understanding and
ability to follow tangents all those
kinds of things but it's still not clear
to me as an open question how to make
explicit where exactly the point of the
failure of the test is I believe that
hasn't actually been really researched
that well in these constructions as
opposed to decision making at the very
end of a conversation is this human or
not rather marking parts of a
conversation is more or less human like
suspicious parts that make you wonder
this is not human I think they'll be
really interesting to see if it's
possible to make explicit what aspects
of the conversation are the failure
points one of the times that the claim
that the Turing test was passed I think
most famously was in 2014 at an
exhibition event that marked the 60th
anniversary of drawings death eugene
goostman fooled 33% of the event judges
and the method he used was to portray a
13 year old Ukrainian boy that had a
bunch of different personality quirks
and obviously the language barrier and
had some humor and a constant sort of
drive towards misdirecting the
conversation back to the places where it
was comfortable doing so there's some
criticism the committee of this event
due to some sort of smoke and mirrors
kind of the PR marketing side of things
that that I think is always there with
these kind of exhibition events but
setting that aside I think the
interesting lessons here is that the
parameters the rules of the actual
engineering of the Turing test can
determine whether it contains sort of
the spirit of the Turing test which is
the test that captures the ability of of
an agent to have a deep meaningful
conversation so in this case you can
argue that a few tricks were used to
circumvent the need to have a deep
meaningful conversation and 30% of
judges were fooled without rigorous
thorough transparent open domain testing
on the left is a transcript with Scott
Harrison the famed computer scientist
quantum computing researcher talked to
him on the podcast brilliant guy he
posted some of the conversation that he
had with Eugene he was one of the judges
on his blog that I think is really
interesting so it shows that the judge
the interrogator when they're an expert
they can drive they can truly put the
the bot to the test Scott did he really
didn't allow the kind of misdirection
the Eugene non-stop tried to to do and
you could see that in the transcript
Scott refuses to take the misdirection
so as I mentioned despite the waning I
guess popularity of the lobner prize and
the Turing test idea in general Google
has published the paper and proposed a
system called Mena
that's a chatbot that's an end-to-end
deep learning system
presentation of goal in the 2.6 billion
parameters is to capture the
conversational context well to be able
to generate the text that fits the
conversation context well now one
interesting aspect of this besides being
a serious attempt at creating a
learning-based system for open domain
conversational agents is that a new
metric is proposed and it's a two-part
metric of sensibleness and specificity
now sensibleness is that a boss
responses have to make sense in context
they have to fit the context just give
you a sense for humans who have 97
percent sensibleness so ability to match
what we're saying to the to the context
now the reason you need another side of
that metric is because you can be
sensible you can fit the context by
being boring by being generic by making
statements like I don't know or that's a
good point so if these generic
statements that fit a lot of different
kinds of contexts so the other side of
the metric is specificity basically the
goal being there is don't be boring is
to say something very specific to this
context so not only does it match the
context but it captures something very
unique to this particular set of lines
of conversation that form the context I
think it's fair to say that the the
beauty the music the the humor the wit
of conversation comes from that ability
to play with the specifics the
specificity metric so both are really
important
humans achieve 86% sensibleness and
specificity mean achieved seventy-nine
percent compared to mitsugu who achieves
56% now take this all with a grain of
salt I want to be very careful here
because there is also not to throw shade
but it's close source currently and
there's a little bit of a feeling of a
PR marketing situation here naturally
perhaps the paper is made in such a way
the methodology and the results are made
in such a way that benefit the way the
learning framework was constructed
now that's I don't want to over
criticize that because I think there's
still a lot of interesting ideas in this
paper but in terms of looking at the
actual percentages of 86 percent human
performance and 79 percent meaning a
performance I think we're quite away
from being able to make conclusive
statements about a system achieving
human level conversational capabilities
so those plots should be taken with a
grain of salt but the actual content of
the idea is I think is really
interesting I think quite obviously the
future long term but hopefully short
term is in learning end to end
learning based approaches to open domain
conversation so just like drawing
described funny enough 70 years ago on
this paper that machine learning it
would be essential to success I I
believe the same it's a lot less
interesting and revolutionary to think
so today but I believe that machine
learning will also need to be a very
central part of achieving human level
conversational capabilities so let's
talk through some objections nine of
them are highlighted by Turing himself
in his paper here provides some informal
highly informal summaries the first
objection is religious which connects
thinking to quote unquote the soul and
God presumably is the giver of the soul
to humans now Tory's response to that is
God is all-powerful there is no reason
why he can't assign souls to anything
biological or artificial so it doesn't
seem that whatever mechanism by which
the soul arrives in the human cannot
also be repeated for artificial
creatures the second objection is the
quote unquote head in the sand it's a
bit of a ridiculous one but I think it's
an important one because it keeps coming
up often even in today's context
highlighted by folks like you know I
musk Stuart Russell and so on the head
in the sand objection is that AGI is
scary so human level and super human
level intelligence
scary today we talk about is existential
threats it seems like the world would be
totally transformed if we have something
like that then it could be transform in
a highly negative way so let's not think
about it because it kind of seems far
away so it probably won't happen so
let's just not think about it that's
kind of the objection of the Turing test
it's so far away it's not worthwhile to
even think about a test for this
intelligence or what human level
intelligence means or what superhuman
level intelligent means the response
quite naturally is that it doesn't
matter how you feel about something on
whether it's going to happen or not so
we kind of have to set our feelings
aside and not allow fear or emotion to
model our thinking or detract us from
thinking about it at all the third
objection is from Gaydos incompleteness
theorem saying there's limits to
computation this is the Roger Penrose
line of thinking that basically if a
machine is a computation system there is
limited capabilities in that it can
never be a perfectly rational system
Tony's response to this is that humans
are not rational either they're flawed
nowhere does it say that intelligence
equals infallibility in fact it could
probably be argued that phal ability is
at the core of intelligence the fourth
objection is that consciousness may be
required for intelligence Touring's
response to this is to separate whether
something is conscious and whether
something appears to be cautious so the
focus of the Turing test is how
something appears and so in some sense
humans to us as far as we know only
appear to be cautious we can't prove
that they're actually conscious humans
outside of ourselves and so since humans
only appear to be cautious there's no
reason to think that machines can't also
appear to be cautious and that's at the
core of the Turing test
so the Turing test kind of skirts around
the question of whether something is or
isn't intelligence whether is or isn't
conscious the fundamental question is
does it appear to be intelligent does it
appear to be cautious so he actually
doesn't respond to the idea that
consciousness is or isn't required
for intelligence he just says that if it
is there's no reason why you can't fake
it and that will be sufficient to
achieve the display of intelligence the
fifth objection is the negative Nancy
objection of machines will never be able
to do X whatever X is you can make it
love joke humor understand to generate
humor
eat enjoy food create art music poetry
and so on so there's a lot of things we
could put in that X the machines could
never do and basically highlighting our
human intuition about the limitations of
machines just like well the second
objection
naturally the response here is that the
objection that machines will never do X
doesn't have any actual reasoning behind
it is just a vapid opinion based on the
world today refusing to believe that the
world of tomorrow will be different the
sixth objection probably the most
important one the most interesting comes
by way of Ada Lovelace lady Lovelace the
mother of computer science it was a
basic idea that machines can only do
what we program them to do now this is
an objection that appears in many forms
throughout before touring and after
touring and I think it's a really
important objection to think about so in
this particular case I think Turia's
response is quite shallow but it is
nevertheless pretty interesting and
we'll talk about it again later on his
responses well if machines can only do
what we programmed them to do we can
rephrase that statement as saying
machines can't surprise us and when you
rephrase it that way it becomes clear
that machines actually surprised us all
the time a system that is sufficiently
complex will no longer be one of which
we have a solid intuition of how it
behaves even if we built all the
individual pieces of code for those of
you have programmed things so I've
written a lot of programs
in the initial design stage of an
intuition about how it should behave
there's a design there's a plan
you know what the individual functions
do but as the piece of code grows your
ability to intuit exactly the mapping
from input to output fades with the size
of the code base even if you understand
everything about the code and even if
you said logical and syntactic bugs
aside the
objection looks to the brain and looks
to the continuous analog nature of that
particular neural network system so
Touring's response to that is sure the
brain might be analog and then computer
digital computers are discrete but if
you have a big enough digital computer
it can sufficiently approximate the
analog system meaning to a sufficient
degree that it would appear intelligent
the eighth objection is the freewill
objection right is that when you have
deterministic rules laws algorithms
they're going to result in predictable
behavior and this kind of exactly
deterministic predictable behavior
doesn't quite feel like the mind that we
know us humans as possessing this kind
of feeling that underlies what's
required for intelligence for a mind I
think is behind the Chinese room thought
experiment that we'll talk about next
so Touring's response here is that
humans very well could be a complex
collection of rules there's no
indication that we're not just because
we don't understand or don't even have
the tools to explore the kind of rules
that underlie our brain doesn't mean
it's not just a collection of
deterministic perfectly predictable sets
of rules objection number nine is kind
of fun
quite possibly Turing is throwing us but
more likely the ideas of mind-reading
extrasensory perception telepathy were a
little bit more popular in his time so
the objection here is what if
mind-reading was used to cheat the test
so basically if human to human
communication through telepathy could be
used then a machine can't achieve that
same kind of telepathic communication
and so that can be used to uh to
circumvent the effectiveness of the test
now
join us response to this is well you
just have to design our room that not
only protects you from being able to see
whether it's a robot or a human but also
design a telepathy proof room that
prevents telepathic communication again
could be touring trolling us but I think
more importantly I think it's a nice
illustration at the time and even still
today that there's a lot of mystery
about how our mind works
if you chuckle and completely laugh off
the possibility of telepathic
communication I think you're assuming
too much about your own knowledge about
how our mind works I think we know very
little about how our mind works it is
true we have very little scientific
evidence of telepathic communication but
that shouldn't you shouldn't take the
next leap and have a feeling like you
understand that telepathic communication
is impossible you should nevertheless
maintain an open mind but as an
objection it doesn't seem to be a very
effective one I wanted to dedicate just
one slide and probably the most famous
objection to the Turing test proposed by
John Searle in 1980 in his paper minds
brains and programs commonly known as
the Chinese room thought experiment and
it's kind of a combination of number
four number six and number eight
objections in the previous slide which
is the consciousness is required for
intelligence the ada lovelace objection
that programs can only do what we
program them to do and the deterministic
free will objection that deterministic
rules will lead to predictable behavior
and that doesn't seem to be like what
the mind does so there's echoes of all
those objections that toring anticipated
all put together into the Chinese room
as a small aside it is now 6 a.m. I did
not sleep last night so this video is
brought to you by this magic potion
called nitro cold brew a an excessively
expensive canned beverage from Starbucks
that fuels
me this wonderful Saturday morning
here's to you dear friends okay the
Chinese room involves following
instructions of an algorithm so there's
a human sitting inside a room that
doesn't know how to speak Chinese but
there's notes being passed to them
inside the room from outside in Chinese
and all they do is follow a set of rules
in order to respond to that language so
the idea is if the brain inside the
system that passes the Turing test is
simply following a set of rules that
it's not truly understanding it is not
conscious it does not have a mind the
objection is philosophical so there's
not for my computer science engineering
self there's not enough meat in it to
even make it that interesting it's very
human centric but allow us to explore it
further
so the key argument is that programs
computational systems are formal and so
they can capture syntactic structure
minds our brains have mental content so
they can capture semantics and so the
claim that I think is the most important
the clearest in the paper is that syntax
by itself is neither constitutive of nor
sufficient for semantics so just because
you can replicate the syntax of the
language doesn't mean you can truly
understand it and this is the same kind
of criticism we hear of language models
of today with transformers that opening
is gp2 really doesn't understand the
language
it's just mimicking the statistics of it
so well that it can generate
syntactically correct and even like have
echoes of semantic structure that
indicates some kind of understanding but
it doesn't to me that argument is not
very interesting from an engineering
perspective because it just sounds like
saying humans can understand things
humans are special
therefore machines cannot understand
things
it's a very human centric argument
that's not allowing us to rigorously
explore what exactly this understanding
mean from a computational perspective or
put in other words if understanding
intelligence consciousness either one of
those is not achievable through
computation then where is the point that
computation hits the wall the most
interesting open questions to me here
are on the point of faking things or
mimicking or the appearance of things
does the mimicking of thinking equal
thinking does the mimicking of
consciousness equal consciousness does
the mimicking of love equal love this is
something that I think a lot about and
depending on the day go back and forth
but I tend to believe from an
engineering perspective I tend to agree
with the spirit and the work of Alan
Turing in that at this time as engineers
we can only focus on building the
appearance of thinking the appearance of
consciousness the appearance of love I
think as we work towards creating that
appearance will actually begin to
understand the fundamentals of what it
means to be conscious what it means to
love what it means to think you may have
even heard me say sometimes that the
appearance of consciousness is
consciousness I think that's me being a
little bit poetic but I think from our
perspective from our exceptionally
limited understanding both problems are
in the same direction so it's not like
if we focus on creating the appearance
of consciousness that's going to lead us
astray in my personal view is going to
lead us very far down the road of
actually understanding and maybe one day
engineering consciousness and now I'd
like to talk about some alternatives and
variations the Turing test that I find
quite interesting so there's a lot of
kind of natural variations and
extensions to the Turing test first the
total Turing test proposed in 1989 it
extends the Turing test in the natural
language conversation domain to
perception computer vision and obviously
manipulation of robotics so it takes it
into the
world the interesting question here to
me is whether adding extra modalities
like audio visual manipulation makes the
test harder or easier to me is very
possible that a test with a narrow
bandwidth of communication such as the
natural language communication the
Turing test is actually harder to pass
than the one that includes other
modalities but anyway one of the
powerful things about the original
Turing test is that is so simple the
Lovelace test proposed in 2001 builds on
the Ada Lovelace objection to form the
test that says the machine has to do
something surprising that the creator or
the person who's aware how the program
was created cannot explain so it should
be truly surprised there is also in 2014
was proposed Lovelace 2.0 test which
emphasizes a more constrained definition
of what surprising is because it's very
difficult to pin down to formalize the
idea of surprise and explain right in in
the original formulation of the Lovelace
test but with Lovelace 2.0 it emphasizes
sort of creativity art so on so it's
more concrete than surprise especially
if you define constraints to which
creative medium we're operating in you
basically have to create an impressive
piece of artistic work I think that's an
interesting conception but it takes us
in the land that's much more not less
subjective than the original Turing test
but this brings us to the open and the
very interesting question of surprise
which i think is really at the core of
our conception of intelligence I think
it is true that our idea of what makes
an intelligent machine is one that
really surprised us so when we one day
finally create a system of human level
or superhuman level intelligence
we will surely be surprised so we have
to think what kind of behavior is one
that will surprise this to the core to
me I have many examples in mind that
I'll cover in future videos but one
certainly one of the hardest ones is
humor and finally the truly total Turing
test proposed in 1998 proposes an
interesting philosophical idea that we
should not judge the performance of an
individual agent in an isolated context
but instead look at the body of work
produced by a collection of intelligent
agents throughout their evolution with
some constraints on the consistency
underlying you know the evolutionary
process it's interesting to suggest that
the way we conceive of intelligence
amongst us humans is grounded in the
long arc of history of the body of work
we've created together I don't find that
argument convincing but I do find the
interesting question and the open
question the idea that we should measure
systems not in the moment or a
particular five-minute period or 20
minute period but over a period of
months and years perhaps condensed in a
simulated context so really increase the
scale at which we judge interactions by
several orders of magnitude that to me
is a really interesting idea you know to
judge alpha zero performance not on a
single game of chess but looking at
millions of games and not looking at a
million games for a static set of
parameters but looking at the millions
of games played as the system was
strained from scratch and became better
and better and better there's something
about that full journey that may capture
intelligence so intelligence very well
could be the journey not the destination
I think there's something there it's
very imprecise in this construction but
it struck me as a as a very novel idea
for benchmark not to measure
instantaneous performance but
performance over time in the improvement
of performance over time it appears that
there's something to that but I can't
quite make it concrete and I'm not sure
as possible to formalize in the way that
the original Turing test is formalized
another kind of test is the Winograd
schema challenge which i think is really
compelling and in many ways so first to
explain it with an example there's a
sentence really two sentences let's say
the trophy doesn't fit into the brown
suitcase because it's too small and the
trophy doesn't fit into the brown
suitcase because it is too large and the
question is what is too small what is
too large the answer for the small what
is too small is the suitcase is too
small the trophy doesn't fit into the
brown suitcase because it is too small
and then the second question is what is
too large the answer there is the trophy
that trophy doesn't fit into the brown
suitcase because it is too large the
basic idea behind this challenge is the
ambiguity and the sentence can only be
resolved with common-sense reasoning
about ideas in this world and so the
strength of this test is it's quite
clear quite simple and yet requires the
least in theory this this deep thing
that we think makes us human which is
the ability to reason at the very basic
level of common sense reasoning the
other nice thing is it can be a
benchmark like we're used to in the
machine learning world that doesn't
require subjective human judges there's
literally a right answer the weakness
here that's holds for other similar
challenges in the space is that it's
very difficult to come up with a large
amount of questions I mean each one is
handcrafted and so that means you can't
build a benchmark of millions or
billions of questions it has to be on a
small scale variations of the Winograd
scheme are included
and some natural language benchmarks of
today that people use in the machine
learning context the Amazon elect
surprise
I think captures nicely the spirit of
the Turing test I think it's actually
quite an amazing challenge and
competition that uses voice conversation
in the wild so with real people and they
can use a I think it's called a social
bot skill on there Alexa devices and I
don't want to wake up my own Alexa
devices but basically say her name and
say let's chat and that brings up one of
the bots involved in the challenge and
then you can have a conversation and
then the bar that's to be reached is for
you to have a twenty minute or longer
conversation with the bot and for
two-thirds or more of the interactions
to be that long so the basic metric of
successful interaction is the duration
of the interaction and as of today we're
still really really far away from that
so why is this a good metric and I do
think it's a really powerful metric as
opposed to us judging the quality of
conversation in retrospect we speak with
our actions so a deep meaningful
conversation is one we don't want to
leave when we have other things
contending for our time when we make the
choice to stay in that conversation
that's as powerful a signal as any to
show that that conversation has content
has meaning is enjoyable I think that
it's what passing the Turing test in its
original spirit actually is and I should
mention that as of today no team has
even come close to passing the Turing
test as it is constructed by the Alexa
prize there are several things that are
really surprising about this challenge
one is that it's not a lot more popular
and two that Amazon chose to limit it to
students only I mean almost making it an
educational exercise as opposed to a
moonshot challenge for our entire
generation of researchers I mentioned
before but I'll say it again here that
it's surprising to me that the biggest
research lab
industry and academia have not focused
on this problem have not found the magic
within the Turing test problem and the
elect surprise as it formulates I
believe the spirit of the Turing tests
quite well a very different kind of test
is the hotter price that I buy markers
hotter which I think is really
fascinating on both a philosophical
mathematical angle underlying it is the
idea that compression is strongly
correlated with intelligence put another
way the ability to compress knowledge
well requires intelligence and the
better you compress that knowledge the
more intelligent you are I think this is
a really compelling notion because then
we can make explicit we can quantify how
intelligent you are by how well you're
able to compress knowledge as the prize
webpage puts it being able to compress
well is closely related to acting
intelligently thus reducing the slippery
concept of intelligence to hard file
size numbers so the task is to take one
gigabyte of Wikipedia data and compress
it down as much as possible the current
best is a eight point five eight
compression factor so down from one
gigabyte to one hundred seventeen
megabytes and the awards for each one
percent improvement you win five
thousand euros I find this competition
just amazing and fascinating on many
levels I think it's a really good
formulation of an intelligence challenge
but it's not a test that's one of his
kind of limitations at least in the
poetic sense that it doesn't set a bar
beyond which we're really damn impressed
meaning it's harder to set a bar like
the one formulated by the Turing test
beyond which we feel it would be human
level intelligence now the bar that's
set by the Turing Alan Turing and others
the lobna prize alexa prize are also
arbitrary but it feels like we're able
to intuit a good bar in that context
better
being able to intuit the kind of bar we
need to set for the compression
challenge another fascinating challenge
is the abstraction and reasoning
challenge put forth by francois charlet
just a few months ago so this is very
exciting it's actually ongoing is a
competition on Kegel I think with the
deadline in May it's a really really
interesting idea I haven't internalized
it fully yet and perhaps we'll do a
separate video on just this paper alone
and I'll talk to Francois I'm sure on
the podcast and other contacts in the
future about it
I think there's a lot of brilliant ideas
here that I still have to kind of digest
a little bit but let me describe the
high level ideas behind this benchmark
so first of all the name is abstraction
reason in corpus or challenge arc the
domain is in a grid world of patterns
not limited in size but the grid world
is filled with cells that can be of
different colors and the spirit of the
set of tests that Francois proposes is
to stay close to IQ test so psychometric
intelligent tests that we use to measure
the intelligence of human beings
now the Turing test is kind of at a
higher level of natural language in this
construction of Arc it goes as close as
possible to the very basic elements of
reasoning just like an attic you test of
patterns it gets to the very core such
that we can then make explicit the
priors the concepts that we bring to the
table of those tests and if we can make
them explicit it reduces the test as
close as possible to the measure of the
system's ability to reason now the
concepts that are brought to this grid
world here's just a couple of example of
priors that Francois shows in his paper
I recommend highly it called on the
measure of intelligence here prior
concept is not referring to a previous
concept is referring to a prior set of
knowledge that you bring to the table so
this first row of illustrations of the
two
world's illustrates the idea of object
persistence with noise so we're able to
understand that large objects when there
is some visual noise occluding our
ability to see them that they still
exist in the world and if that noise
changes the object is still unchanged so
that that idea of object persistence in
the world is as a prior that we bring to
the table of understanding this grid
world another prior is on the left at
the bottom is objects are defined by
spatial contiguity so so objects in this
grid world when the cells are the same
color and they're touching each other
they're probably part of the same object
and if there's black cells that separate
the those groupings of cells that means
there's multiple objects so this kind of
spatial contiguity of colored cells
defined the entity of the object and on
the right at the bottom is the color
based contiguity which means that even
if the cells of different colors are
touching if their colors are different
that means it likely belongs to a
different object that's the basic prior
and there's a few others by the way just
beautiful pictures in that paper that
make you really think about the core
elements of intelligence I love that
paper worth worth looking at there's a
lot of interesting insights in there
just to give you some examples of what
the actual task for the machine in this
test looks like it's similar to the kind
of task we've seen in an IQ test so here
there's three pairings and the task is
for the fourth pairing of images to
generate the grid world that fits the
other three that fits the generating
pattern of the other three so in this
case figure four from the paper a task
where the implicit goal is to complete a
symmetrical pattern the nature of the
task is specified
by the three input-output examples the
test-taker must generate the output grid
corresponding to the input grid of the
test input bottom right so here will
your task with understanding in the
first three pairings is that the input
has a perfect global symmetry to it and
also that there's parts of the image
that are missing that can be filled in
order to complete that perfect symmetry
now that's relying on another prior
another basic concept of symmetry which
I think underlies a lot of our
understanding of visual patterns again
so the intelligent system has to have a
good representation of symmetry in
various contexts this is fascinating and
beautiful beautiful images okay another
example figure 10 from the paper a task
where the implicit goal is to count
unique objects and select the objects
that appears the most times the actual
task has more demonstration pairs in
these three so figure 10 here from the
paper a task where the implicit goal is
to count unique objects and select the
objects that appear the most times so
again there's three pairings you see in
the first one there's three blue objects
and the second one is four yellow
objects and the third one there's three
red objects so you have to figure that
out and then the output is the grid
cells capturing that object that appears
the most times and so apply that kind of
reasoning to complete the output of the
fourth pairing one of the challenges for
this kind of test is it's difficult to
generate but just like I said I think
there's a lot of really interesting
technical and philosophical ideas here
that are worth exploring so let's
quickly talk through a few takeaways so
zooming
is the Turing test a good measure of
intelligence and can it serve as an
answer to the big ambiguous but profound
philosophical questions of chem machines
think so first some notes on the
underlying challenges of the Turing test
let's talk about intelligence so if we
compare human behavior and intelligent
behavior it's clear that the Turing test
hopes to capture the intelligent parts
of human behavior but if we're trying to
really capture human level intelligence
it's also possible that we want to
capture the unintelligent irrational
parts human behavior so it's an open
question or the natural conversation is
a test of intelligence or humaneness
because if it's a test of intelligence
it's focusing only on kind of rational
systematic thinking if it's a test of
humaneness then you have to capture the
full range of emotion the mess the
irrationality the laziness the boredom
all the things that make us human and
all the things that then project
themselves into the way we carry out
through conversation as I mentioned in
the previous objectives the Turing test
really focuses on the external
appearances not the internal processes
so like I said from an engineering
perspective I think it's very difficult
to create a test for internal processes
for some of these concepts that we have
a very poor understanding of like
intelligence like consciousness I think
the best we can do right now in terms of
quantifying and having a measure of
something we have to look at the
external performance of the system as
opposed to some properties of the
internal processes another challenge for
the Turing test as Scott our instance
conversation we gene Guzman indicates is
that the skill of the interrogator is
really important here that's both on the
just the conversational skill of how
much you can stretch and challenge the
conversation with
and to on the human side of it the
ability of the interrogator identified
the humaneness of both the human and the
machine so the ability to have a
conversation that challenges the bot and
the ability to make the actual
identification of human or machine
those are both skills that are essential
to the Turing test also to me is really
interesting the anthropomorphize a ssin
of human to inanimate object interaction
I think is really fascinating and it's
an open question whether in some
construction of the Turing test whether
anthropomorphism is leveraged to
convince the human whether that's
cheating the Turing test or in fact
that's an essential element to
convincing us humans that something is
intelligent perhaps as a starting point
we have to anthropomorphize something
before we allow to be intelligent in our
subjective judgment of its intelligence
and finally another limitation of the
Turing test that could be narrowly
stated as why do we expect a bot to talk
what is it why what if it doesn't feel
like talking does it still fail I think
a more general way to phrase that is why
do we judge the performance of a system
on such a narrow window of time I think
as I mentioned before this there could
be something interesting on expanding
the window of time over which we analyze
the intelligence of the system looking
not just at the average performance but
the growth of its performance as it
interacts with you as the individual I
think one key aspect of intelligence is
a social aspect and a social connection
I think in part may require getting to
know the person and there's something to
rethink in the Turing test that relies
on us building a relationship with a
person as part of the test so you could
think of it as kind of the ex machina
Turing test where they
spent a series of conversations together
several days together all those kinds of
things that feels like an interesting
extension of the Turing test which could
reveal the significant limitation of the
current construction of the Turing test
which is a limited window of time one
time at the end interrogator judgment of
whether it's human or machine now my
view overall on the Turing test is that
yes something like the Turing test as
originally constructed so the natural
language conversation is close to the
ultimate test of intelligence and
moreover this is where I disagree I
think I disagree with Francois shalay
and other world-class researchers in the
areas through it Russell and so on that
I think the Turing test is not a
distraction for us to think about it
doesn't pull us away from actually
making progress in the field I think it
keeps us honest I think truly analyzing
where we stand in natural language
conversation will help us understand how
far away we are and more than that I
think there should be active research on
this field I think the love the prize
type of formulations the elect surprise
formulations should be more popular than
they are and I think researchers should
take them very seriously now that
doesn't mean that the the work of the
the arc benchmark with the IQ test type
of intelligent tests is not also going
to be fruitful potentially very fruitful
but I think ultimately the real and test
of human level intelligence will occur
in something like the construction of
the Turing test with natural language
open domain conversation the results in
deep meaningful connection between human
and machine zooming out a little bit I
think in general I think AI researchers
don't like and try to avoid the
messiness of human beings as is captured
by the human robot interaction field and
set of problems I think
more than just embracing the Turing test
I think we should embrace the messiness
of the human being in all the different
domains of computer vision of natural
language of robotics autonomous vehicles
I've been a longtime advocate that semi
autonomous vehicles are here to stay for
a long time we're going to have to
figure out the human robot interaction
problem and for that we have to embrace
perceiving everything about the human
inside the car perceiving everything
about the humans outside the car as I
mentioned this presentation of the paper
is actually part of our paper reading
club focused on artificial intelligence
where we discuss a couple of times a
week on the discord server called Lex
plus AI podcast they you're welcome to
join we have an amazing community of
brilliant people there that discuss all
kinds of topics in artificial
intelligence and beyond this particular
illustration that I just love is from
will Scobie who's an illustrator from
United Kingdom who is part of this
discord community so he contributed it
and in general aside from the amazing
conversations I encourage and hope to
see other members of the community
contribute art code visualizations
slides ideas for these kinds of videos
I'm really excited by the kind of
conversations I've seen if you're
watching this video I want to join in
click on a discord link in the
description on the slide
join the conversation new paper every
week it's fun just to give you a little
sense of the ideas behind this AI paper
reading Club like what the goals are so
what is it I think the goal is to take a
seminal paper in the field that doesn't
just focus in on the specific sort of
paragraph to paragraph section of
section analysis what the papers saying
but actually use the paper to discuss
the history the big-picture development
of the field within the context of that
paper now that could be philosophical
papers like the storm-tossed paper or it
could be very specific papers in the
field again physics mathematics compete
science and probably quite a bit of deep
learning so the hope is to prioritize
beautiful powerful impactful insights as
opposed to full coverage of all the
contents of the paper and the actual
meanings on this chord hopefully are
less one person presenting and more
discussion there's a lot of brilliant
people there civil so you can have 300
400 people on voice chat which is a
really intimate setting and yet people
aren't interrupting each other it's not
chaos it's quite an amazing community
the other goal I'd love to see is even
if we cover technical papers the goal is
for it to be accessible to everyone
so both high school students people
outside o

Resume

Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip yang Anda berikan.

***

# Menguak Kecerdasan Mesin: Analisis Mendalam Makalah Alan Turing dan Evolusi Turing Test

### Inti Sari (Executive Summary)
Video ini membahas makalah ikonik Alan Turing tahun 1950 berjudul *"Computing Machinery and Intelligence"*, yang menjadi fondasi filosofis dan teknis bagi dunia Kecerdasan Buatan (AI) modern. Pembahasan mencakup konsep awal "Imitation Game" atau yang kini dikenal sebagai Turing Test, prediksi Turing tentang kemampuan mesin, serta berbagai keberatan filosofis dan teknis terhadap gagasan mesin yang berpikir. Selain itu, video ini mengevaluasi relevansi tes tersebut di era AI modern, mengulas alternatif benchmark seperti ARC dan Google Meena, serta menegaskan pentingnya interaksi bahasa alami dan pemahaman konteks manusia dalam pengembangan AGI (Artificial General Intelligence).

---

### Poin-Poin Kunci (Key Takeaways)
*   **Dampak Sejarah:** Makalah Turing tahun 1950 dianggap sebagai karya paling berpengaruh dalam sejarah AI, menginspirasi perkembangan dari mesin Turing awal hingga *Deep Learning* modern.
*   **The Imitation Game:** Turing mengubah pertanyaan filosofis "Bisakah mesin berpikir?" menjadi tantangan teknik yang konkret: bisakah mesin menipu manusia untuk percaya bahwa ia adalah manusia lain?
*   **Evolusi Teknologi:** AI beralih dari *chatbot* berbasis skrip (seperti Mitsuku) menuju pembelajaran *end-to-end* (seperti Google Meena) yang mampu memahami konteks dan kekhasan percakapan.
*   **Keberatan Filosofis:** Berbagai argumen menentang kemampuan mesin berpikir telah muncul, mulai dari teologis, matematis (Teorema Gödel), hingga Argumen "Chinese Room" Searle yang membedakan antara sintaks (aturan) dan semantik (pemahaman).
*   **Alternatif Benchmark:** Selain Turing Test, terdapat metode evaluasi lain seperti *Winograd Schema* (penalaran akal sehat), *Hutter Prize* (kompresi data), dan *ARC* (Abstraction and Reasoning Corpus) yang berfokus pada kemampuan penalaran umum.
*   **Relevansi Masa Depan:** Meskipun ada kritik, Turing Test tetap dianggap penting untuk mengukur kemampuan AI dalam membangun koneksi dan empati manusia, bukan sekadar kecerdasan logis semata.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Fondasi: "Computing Machinery and Intelligence" dan The Imitation Game
*   **Konteks Sejarah:** Sekitar 70 tahun lalu, Alan Turing mengajukan pertanyaan revolusioner: "Can machines think?" Makalahnya ini menjadi pijakan bagi para peneliti AI.
*   **Konsep The Imitation Game (Turing Test):**
    *   **Setup:** Seorang interogator manusia berkomunikasi melalui catatan tertulis dengan dua entitas terpisah oleh dinding: satu manusia dan satu mesin.
    *   **Tujuan:** Interogator harus menebak mana yang mesin dan mana yang manusia.
    *   **Nilai:** Tes ini mengubah pertanyaan abstrak tentang "berpikir" menjadi standar teknik yang dapat diukur.
*   **Prediksi Turing:** Ia memprediksi bahwa pada tahun 2000, mesin dengan penyimpanan 100MB akan dapat menipu 30% manusia dalam percakapan 5 menit, dan istilah "mesin berpikir" tidak akan lagi dianggap kontradiktif.
*   **Implementasi Nyata (Loebner Prize):** Kompetisi yang berjalan sejak 1991 dengan hadiah uang tunai. Namun, pemenangnya seperti Mitsuku dan Rose sebagian besar adalah *chatbot* berbasis skrip, bukan sistem pembelajaran *end-to-end*. Minat dari lab AI besar (seperti Google DeepMind) terhadap kompetisi ini menurun.

#### 2. Evolusi AI Modern dan Keberatan Awal
*   **Kasus Eugene Goostman (2014):** Sebuah bot berhasil menipu 33% juri dengan berpura-pura menjadi bocah lelaki berusia 13 tahun dari Ukraina. Ini dikritik sebagai trik "asap dan cermin" (misdirection) yang memanfaatkan hambatan bahasa, bukan kecerdasan mendalam.
*   **Google Meena:** Mewakili pendekatan modern dengan *deep learning end-to-end* (2,6 miliar parameter).
    *   **Metrik:** Menggunakan *Sensibleness* (masuk akal secara konteks) dan *Specificity* (tidak generik/membosankan).
    *   **Performa:** Skor Meena (79%) mendekati manusia (86%), jauh melampaui Mitsuku (56%).
*   **Keberatan Turing (1-4):**
    *   **Teologis:** Hanya manusia yang memiliki jiwa. Tanggapan Turing: Tuhan Mahakuasa, bisa memberikan jiwa pada apa pun, termasuk mesin.
    *   *"Head in the Sand":** Konsekuensi AI terlalu menakutkan, jadi lebih baik diabaikan. Tanggapan: Perasaan takut tidak mengubah kenyataan bahwa kita harus memikirkannya.
    *   **Matematis:** Keterbatasan komputasi (Teorema Ketidaklengkapan Gödel). Turing berpendapat bahwa mesin masih bisa melakukan hal-hal yang menakjubkan meski ada batasan matematis.

#### 3. Keberatan Lanjutan dan Argumen Chinese Room
*   **Objection 5 (Negative Nancy):** Klaim bahwa mesin tidak bisa melakukan X (seperti menikmati makanan, bercanda, atau membuat seni). Tanggapan: Ini hanya opini tanpa dasar ilmiah yang menolak kemungkinan masa depan.
*   **Objection 6 (Ada Lovelace):** Mesin hanya melakukan apa yang diperintahkan (tidak bisa mengejutkan). Tanggapan: Sistem yang kompleks sering kali berperilaku di luar intuisi penciptanya karena hubungan input-output yang tidak lagi dipahami secara menyeluruh.
*   **Objection 8 (Freewill):** Mesin bersifat deterministik, sementara pikiran manusia memerlukan kehendak bebas. Tanggapan: Manusia mungkin juga sekumpulan aturan yang sangat kompleks yang *terasa* seperti kehendak bebas.
*   **Argumen Chinese Room (John Searle, 1980):** Menggabungkan keberatan tentang kesadaran, Lovelace, dan kehendak bebas. Searle berargumen bahwa jika seseorang di dalam ruangan mengikuti aturan untuk memanipulasi simbol bahasa China tanpa memahaminya, sistem tersebut "lulus" tes tapi tidak benar-benar "mengerti". Kesimpulannya: Sintaks (komputasi) tidak sama dengan semantik (pemahaman mental).

#### 4. Alternatif dan Variasi Turing Test
Karena keterbatasan Turing Test, muncul berbagai alternatif:
*   **Total Turing Test (1989):** Menambahkan persepsi (visi komputer) dan manipulasi (robotik) ke dalam percakapan.
*   **Lovelace Test & 2.0:** Mengharuskan mesin melakukan sesuatu yang mengejutkan dan tidak bisa dijelaskan oleh penciptanya, berfokus pada kreativitas.
*   **Truly Total Turing Test:** Menilai koleksi agen selama periode evolusi yang lama (bukan satu sesi saja), melihat perjalanan pembelajaran (misalnya AlphaZero).
*   **Winograd Schema Challenge:** Tes berbasis kalimat ambigu yang membutuhkan penalaran akal sehat untuk memecahkannya (misalnya: "Piala tidak muat di koper cokelat karena *dia* terlalu kecil/besar").
*   **Alexa Prize (Amazon):** Kompetisi percakapan suara "di alam bebas" dengan metrik keberhasilan berdasarkan durasi percakapan (20+ menit).
*   **Hutter Prize:** Pendekatan matematis di mana kecerdasan dikorelasikan dengan kemampuan kompresi data (semakin efisien mengompresi Wikipedia, semakin cerdas).

#### 5. Abstraction and Reasoning Corpus (ARC) dan Kritik Turing Test
*   **Kompresi vs. Turing Test:** Tantangan kompresi (Hutter Prize) baik secara matematis, tetapi sulit menentukan "batas" kehebatan seperti pada Turing Test yang lebih intuitif.
*   **ARC (Francois Chollet):**
    *   Kompetisi berbasis dunia grid (pola warna) yang mendekati tes IQ psikometri.
    *   Fokus pada penalaran tentang konsep dasar (seperti kekekalan objek, kontiguitas spasial).
    *   Tugas: Menyelesaikan pola output berdasarkan pasangan input-output sebelumnya.
*   **Kritik terhadap Turing Test:**
    *   Apakah tes ini mengukur kecerdasan atau "kemanusiaan"? Kecerdasan adalah berpikir rasional, sedangkan kemanusiaan mencakup irasionalitas dan emosi.
    *   Tes ini hanya menilai penampilan eksternal, bukan proses internal (kesadaran).
    *   Jendela waktu yang sempit membuat sulit untuk menilai kedalaman hubungan.

#### 6. Kesimpulan: Masa Depan AI dan Komunitas
*   **Pembelaan Turing Test:** Pembicara tidak sepakat dengan peneliti seperti Francois Chollet atau Stuart Russell yang menganggap Turing Test sebagai gangguan. Tes ini dianggap menjaga kejujuran peneliti dan tolok ukur kemajuan percakapan bahasa alami.
*   **Inti Kecerdasan:** Tes kecerdas

Read

file updated 2026-02-13 13:24:47 UTC