Is Most Published Research Wrong?

42QuXLucH3Q • 2016-08-11

Transcript preview

Open

Kind: captions
Language: en
[Applause]
in 2011 an article was published in the
reputable Journal of Personality and
Social Psychology it was called feeling
the future experimental evidence for
anomalous retroactive influences on
cognition and effect or in other words
proof that people can see into the
future the paper reported on nine
experiments in one participants were
shown two curtains on a computer screen
and asked to predict which one had an
image behind it the other just covered a
blank wall once the participant made
their sele the computer randomly
positioned an image behind one of the
curtains then the selected curtain was
pulled back to show either the image or
the blank wall the images were randomly
selected from one of three categories
neutral negative or erotic if
participants selected the curtain
covering the image this was considered a
hit now with there being two curtains
and the images positioned randomly
behind one of them you would expect the
hit rate to be about 50% and that is
exactly what the researchers found at
least for negative neutral images
however for erotic images the hit rate
was
53% does that mean that we can see into
the future is that slight deviation
significant well to assess significance
scientists usually turn to P values a
statistic that tells you how likely a
result at least this extreme is if the
null hypothesis is true in this case the
null hypothesis would just be that
people couldn't actually see into the
future and the 53% result was due to
lucky guesses for this study the P value
was .01 meaning there was just a 1%
chance of getting getting a hit rate of
53% or higher from simple luck P values
less than 005 are generally considered
significant and worthy of publication
but you might want to use a higher bar
before you accept that humans can
accurately perceive the future and say
invite the study's author on your news
program but hey it's your choice after
all the 0.05 threshold was arbitrarily
selected by Ronald fiser in a book he
published in
1925 but this raises the question how
much of the published research
literature is actually false the
intuitive answer seems to be 5% I mean
if everyone's using p less than 05 as a
cut off for statistical significance you
would expect five of every hundred
results to be false positives but that
unfortunately grossly underestimates the
problem and here's why imagine you're a
researcher in a field where there are a
thousand hypotheses currently being
investigated let's assume that 10% of
them reflect true relationships and the
rest are false but no one of course
knows which or which that's the whole
point of doing the research now assuming
the experiments are pretty welld
designed they should correctly identify
around say 80 of the 100 True
relationships this is known as a
statistical power of 80% so 20 results
are false negatives perhaps the sample
size was too small or the measurements
were not sensitive enough now consider
that from those 900 false hypotheses
using a P value of 05 45 false
hypotheses will be incorrectly
considered true as for the rest they
will be correctly identified as false
but most journals rarely publish null
results they make up just 10 to 30% of
papers depending on the field which
means that the papers that eventually
get published will include 80 true
positive results 45 false positive
results and maybe 20 true negative
results nearly a third of published
results will be wrong even with the
system working normally things get even
worse if studies are underpowered and
Analysis shows they typically are if
there is a higher ratio of false to True
hypotheses being tested or if the
researchers are biased all of this was
pointed out in a 2005 paper entitled why
most published research is false so
recently researchers in a number of
fields have attempted to quantify the
problem by replicating some prominent
past results the reproducibility project
repeated 100 psychology studies but
found only 36% had a statistically
significant result the second time
around and the strength of measured
relationships were on average half those
of the original studies an attempted
verification of 53 studies considered
landmarks in the basic science of cancer
only managed to reproduce six even
working closely with the original
studies authors these results are even
worse than I just calculated the reason
for this is nicely illustrated by a 2015
study showing that eating a bar of
chocolate every day can help you lose
weight faster in this case the
participants were randomly allocated to
one of three treatment groups one went
on a low carb diet another went on the
same low carb diet plus a 1.5 o bar of
chocolate per day and the third group
was the control contr instructed just to
maintain their regular eating habits at
the end of 3 weeks the control group had
neither lost nor gained weight but both
low carb groups had lost an average of 5
lbs per person the group that ate
chocolate however lost weight 10% faster
than the non-chocolate eaters the
finding was statistically significant
with a P value less than .05 as you
might expect this news spread like
wildfire to the front page of build the
most widely circulated Daily Newspaper
in Europe then to the Daily Star the
Irish Examiner The Huffington Post and
even Shape magazine unfortunately the
whole thing had been faked kind of I
mean researchers did perform the
experiment exactly as they described but
they intentionally designed it to
increase the likelihood of false
positives the sample size was incredibly
small just five people per treatment
group and for each person 18 different
measurements were tracked including
weight cholesterol sodium blood protein
levels Sleep Quality well-being and so
on so if weight loss did show a
significant difference there were plenty
of other factors that might have so the
headline could have been chocolate
lowers cholesterol or increases Sleep
Quality or something the the point is a
P value is only really valid for a
single measure once you're comparing a
whole slew of variables the probability
that at least one of them gives you a
false positive goes way up and this is
known as pcking researchers can make a
lot of decisions about their analysis
that can decrease the P value for
example let's say you analyze your data
and you find it nearly reaches statistic
iCal significance so you decide to
collect just a few more data points to
be sure then if the P value drops below
05 you stop collecting data confident
that these additional data points could
only have made the result more
significant if there were really a true
relationship there but numerical
simulations show that relationships can
cross the significance threshold by
adding more data points even though a
much larger sample would show that there
really is no relationship in fact there
are a great number of ways to increase
the likelihood of significant results
like having two dependent variables
adding more observations controlling for
gender or dropping one of three
conditions combining all three of these
strategies together increases the
likelihood of a false positive to over
60% and that is using p less than 05 now
if you think this is just a problem for
psychology Neuroscience or medicine
consider the pentaquark an exotic
particle made up of five quirks as
opposed to the regular three for protons
or neutrons particle physics employs
particularly stringent requirements for
statistical significance referred to as
five Sigma or one chance in 3.5 million
of getting a false positive but in 2002
a Japanese experiment found evidence for
the theta plus pentor and in the two
years that followed 11 other independent
experiments then looked for and found
evidence of that same pentor with very
high levels of statistical significance
from July 2003 to May 2004 a theoretical
paper on pent cors was published on
average every other day but alas it was
a false false Discovery further
experimental attempts to confirm the
theta plus pentor using greater
statistical power failed to find any
trace of its existence the problem was
those first scientists weren't blind to
the data they knew how the numbers were
generated and what answer they expected
to get and the way the data was cut and
analyzed or packed produced The False
finding now most scientists aren't
packing maliciously there are legitimate
decisions to be made about how to
collect analyze and Report data and
these decisions impact on the
statistical significance of results for
example 29 different research groups
were given the same data and asked to
determine if darkskinned soccer players
are more likely to be given red cards
using identical data some groups found
there was no significant effect While
others concluded dark skin players were
three times is likely to receive a red
card the point is that data doesn't
speak for itself it must be interpreted
looking at those results it seems that
dark skinn players are more likely to
get red carded but certainly not three
times as likely consensus helps in this
case but for most results only one
research group provides the analysis and
therein lies the problem of incentives
scientists have huge incentives to
publish papers in fact their careers
depend on it as one scientist Brian noek
puts it there is no cost to getting
things wrong the cost is not getting
them published journals are far more
likely to publish results that reach
statistical significance so if a method
of data analysis results in a P value
less than 05 then you're likely to go
with that method publication's also more
likely if the result is novel and
unexpected this encourages researchers
to investigate more and more unlikely
hypotheses which further decreases the
ratio of true to spous relationships
that are tested now what about
replication isn't science meant to
self-correct by having other scientists
replicate the findings of an initial
Discovery in theory yes but in practice
it's more complicated like take the
precognition study from the start of
this video three researchers attempted
to replicate one of those experiments
and what did they find well surprise
surprise the hit rate they obtained was
not significantly different from chance
when they tried to publish their
findings in the same Journal as the
original paper they were rejected the
reason the journal refuses to publish
replication studies so if you're a
scientist the successful strategy is
clear and don't even attempt replication
studies because few journals will
publish them and there is a very good
chance that your results won't be
statistically significant anyway in
which case instead of being able to
convince colleagues of the lack of
reproducing possibility of an effect you
will be accused of just not doing it
right so a far better approach is to
test novel and unexpected hypotheses and
then pack your way to a statistically
significant result now I don't want to
be too cynical about this because over
the past 10 years things have started
changing for the better many scientists
acknowledge the problems I've outlined
and are starting to take steps to
correct them there are more large- scale
replication studies undertaken in the
last 10 years plus there's a site
retraction watch dedicated to
publicizing papers that have been
withdrawn there are online repositories
for unpublished negative results and
there is a move towards submitting
hypotheses and methods for peerreview
before conducting experiments with the
guarantee that research will be
published regardless of results so long
as the procedure is followed this
eliminates publication bias promotes
higher powerered studies and lessens the
incentive for packing the thing I find
most striking about the reproducibility
crisis in science is not the prevalence
of incorrect information in published
scientific journals after all getting to
the truth we know is hard and
mathematically not everything that is
published can be correct what gets me is
the thought that even trying our best to
figure out what's true using our most
sophisticated and rigorous mathematical
tools peer review and standards of
practice we still get it wrong so often
so how frequently do we delude ourselves
when we're not using the scientific
method as flawed as our science may be
it is Far and Away moral
than any other way of knowing that we
have this episode of veritasium was
supported in part by These Fine people
on patreon and by audible.com the
leading provider of audiobooks online
with hundreds of thousands of titles in
all areas of literature including
fiction non-fiction and periodicals
audible offers a free 30-day trial to
anyone who watches this channel just go
to audible.com/veritasium so they know I
sent you a book I'd recommend is called
the invention of nature by Andrea wolf
which is a biography of Alexander Von
Hult an adventurer naturalist who
actually inspired Darwin to board the
Beagle you can download that book or any
other of your choosing for a one-month
free trial at audible.com/veritasium so
as always I want to thank Audible for
supporting me and I really want to thank
you for watching

Resume

Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip yang Anda berikan.

***

# Krisis Replikasi: Mengapa Banyak Penelitian Ilmiah Salah dan Cara Memperbaikinya

### Inti Sari (Executive Summary)
Video ini membahas "krisis replikasi" dalam dunia ilmiah, sebuah fenomena di mana banyak penelitian yang diterbitkan ternyata tidak dapat diulang atau mengandung hasil positif palsu (*false positives*). Pembahasan mencakup kritik terhadap penggunaan *P-value*, bias dalam publikasi ilmiah, praktik manipulasi data (*P-hacking*), serta insentif yang salah dalam akademisi. Di akhir, video menawarkan solusi konkret seperti *preregistration* untuk mengembalikan integritas metode ilmiah.

### Poin-Poin Kunci (Key Takeaways)
*   **Standar P-value:** Ambang batas signifikansi statistik (*P-value* < 0,05) adalah standar arbitrer yang tidak menjamin kebenaran suatu penelitian, melainkan hanya menunjukkan probabilitas rendah hasil tersebut terjadi secara kebetulan.
*   **Tingkat Kesalahan Tinggi:** Diperkirakan hingga sepertiga dari hasil penelitian yang diterbitkan adalah *false positive* atau salah, disebabkan oleh bias publikasi dan daya statistik (*power*) yang lemah.
*   **Krisis Replikasi:** Upaya pengulangan (replikasi) studi psikologi dan kanker menunjukkan bahwa mayoritas penelitian landmark tidak dapat direproduksi dengan hasil yang sama.
*   **P-hacking:** Praktik memanipulasi data atau metode analisis hingga menghasilkan angka signifikan (misalnya pada studi penurunan berat badan dengan cokelat) sering terjadi demi memenuhi target publikasi.
*   **Solusi Perbaikan:** Mekanisme *preregistration* (mendaftarkan hipotesis dan metode sebelum penelitian dimulai) dan repositori hasil negatif mulai diterapkan untuk mengurangi bias dan meningkatkan kualitas sains.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Kasus "Feeling the Future" dan Standar P-value
Video dimulai dengan membahas sebuah artikel tahun 2011 berjudul *"Feeling the Future"* yang diterbitkan dalam *Journal of Personality and Social Psychology*. Penelitian ini mengklaim memiliki bukti eksperimental bahwa orang dapat merasakan masa depan (presiensien).
*   **Eksperimen:** Partisipan diminta menebak layar mana (dari dua tirai) yang menyembunyikan gambar. Hasilnya, untuk gambar netral atau negatif, tebakan benar sekitar 50% (sesuai tebakan acak). Namun, untuk gambar *erotis*, tingkat keberhasilan mencapai 53%.
*   **Analisis Statistik:** Hasil ini memiliki *P-value* sebesar 0,01, yang berarti peluang kejadian ini terjadi karena keberuntungan murni hanya 1%.
*   **Kontroversi:** Standar ilmiah umumnya menggunakan ambang batas *P-value* < 0,05 yang ditetapkan secara arbitrer oleh Ronald Fisher pada tahun 1925. Meskipun secara statistik "signifikan", klaim bahwa orang bisa melihat masa depan dianggap tidak masuk akal secara logika.

#### 2. Masalah *False Positives* dan Bias Publikasi
Bagian ini menjelaskan mengapa banyak penelitian yang diterbitkan sebenarnya salah. Intuisi awal mengatakan jika tingkat kesalahannya adalah 5%, maka hanya 5% penelitian yang salah. Namun, kenyataannya jauh lebih buruk.
*   **Skenario 1000 Hipotesis:** Bayangkan ada 1000 hipotesis; 100 di antaranya benar (benar-benar ada efeknya) dan 900 salah (hanya kebetulan).
*   **Daya Statistik (Power):** Dengan asumsi daya statistik 80%, dari 100 hipotesis yang benar, 80 akan terdeteksi benar (*true positive*), dan 20 akan gagal terdeteksi (*false negative*).
*   **False Positives:** Dari 900 hipotesis yang salah, standar *P-value* 0,05 akan menghasilkan sekitar 45 hasil positif palsu (*false positives*).
*   **Bias Publikasi:** Jurnal ilmiah cenderung hanya mempublikasikan hasil yang signifikan dan menolak hasil negatif (null results). Akibatnya, dari total 125 penelitian positif (80 benar + 45 salah), hampir sepertiganya adalah hasil yang salah. Masalah ini diperparah oleh studi yang kurang bertenaga (*underpowered*) dan bias peneliti.

#### 3. Fenomena Krisis Replikasi
Pada tahun 2005, sebuah paper berjudul *"Why Most Published Research is False"* memicu kesadaran akan masalah ini.
*   **Psikologi:** Proyek Reproduktibilitas dalam psikologi mencoba mengulang 100 studi yang diterbitkan. Hasilnya, hanya 36% yang kembali menunjukkan signifikansi statistik, dan efeknya hanya setengah dari yang dilaporkan semula.
*   **Penelitian Kanker:** Upaya verifikasi terhadap 53 studi kanker penting menunjukkan bahwa hanya 6 studi yang berhasil direproduksi.

#### 4. Contoh *P-hacking*: Studi Cokelat dan Penurunan Berat Badan
Sebuah studi tahun 2015 mengklaim bahwa mengonsumsi cokelat dapat mempercepat penurunan berat badan. Studi ini membandingkan kelompok diet rendah karbohidrat, diet rendah karbohidrat plus cokelat, dan kelompok kontrol.
*   **Hasil:** Kelompok cokelat menurunkan berat badan 10% lebih cepat dengan *P-value* < 0,05.
*   **Media:** Berita ini diterima dengan antusias oleh media besar seperti *Bild*, *Daily Star*, *Huffington Post*, dan *Shape*.
*   **Realita:** Studi ini sengaja dibuat dengan metode yang buruk sebagai demonstrasi. Ini adalah contoh *P-hacking*, di mana peneliti mengutak-atik data atau variabel hingga mendapatkan hasil yang terlihat signifikan secara statistik demi publikasi.

#### 5. Insentif yang Salah dalam Dunia Akademis
Brian Nosek, seorang psikolog, menyoroti bahwa masalah utamanya adalah insentif. Karier ilmuwan bergantung pada publikasi.
*   **Tidak Ada Biaya untuk Salah:** Tidak ada konsekuensi berarti jika sebuah penelitian ternyata salah, selama penelitian tersebut diterbitkan. Biaya terbesar justru adalah *tidak* mempublikasikan apa-apa.
*   **Preferensi Jurnal:** Jurnal lebih menyukai hasil yang baru, tak terduga, dan signifikan secara statistik (*P* < 0,05). Ini mendorong peneliti menguji hipotesis yang tidak mungkin benar hanya untuk mencari sesuatu yang unik.
*   **Kematian Replikasi:** Secara teori, sains memperbaiki diri melalui replikasi. Namun, praktiknya, jurnal menolak mempublikasikan studi replikasi karena dianggap tidak baru atau menarik. Sebagai contoh, replikasi dari studi presiensien "Feeling the Future" yang menunjukkan hasil negatif ditolak oleh jurnal yang sama.

#### 6. Solusi dan Perubahan Positif
Dalam 10 tahun terakhir, komunitas ilmiah mulai bergerak memperbaiki sistem ini.
*   **Kesadaran:** Ilmuwan semakin menyadari masalah ini dan mulai melakukan studi replikasi skala besar.
*   **Transparansi:** Situs seperti *Retraction Watch* mempublikasikan penarikan kertas ilmiah, dan ada repositori online untuk hasil penelitian negatif yang tidak dipublikasikan jurnal.
*   **Preregistration:** Solusi paling efektif yang ditawarkan adalah *preregistration*. Peneliti harus mendaftarkan hipotesis dan metode mereka ke jurnal untuk ditinjau oleh *peer review* **sebelum** eksperimen dilakukan.
    *   Jika prosedur diikuti, jurnal berkomitmen mempublikasikan hasilnya apa pun (positif atau negatif).
    *   Ini menghilangkan bias, meningkatkan daya statistik, dan mencegah *P-hacking*.

### Kesimpulan & Pesan Penutup
Video diakhiri dengan refleksi bahwa meskipun sains memiliki alat yang ketat seperti matematika dan *peer review*, kita masih sering salah karena faktor manusia dan bias sistemik. Hal yang paling mengejutkan adalah betapa mudahnya kita menipu diri sendiri tanpa metode ilm

Read

file updated 2026-02-13 13:09:34 UTC