Michael Kearns: Differential Privacy

vaFhEgLoUe0 • 2019-11-21

Transcript preview

Open

Kind: captions
Language: en
so is there hope for any kind of privacy
in a world where a few likes can can
identify you so there is differential
privacy right what is differential
differential privacy basically is a kind
of alternate much stronger notion of
privacy than these anonymization ideas
and it you know it's a technical
definition but like the spirit of it is
we we compared to two alternate worlds
okay so let's suppose I'm a researcher
and I want to do you know I there's a
database of medical records and one of
them's yours and I want to use that
database of medical records to build a
predictive model for some disease so
based on people's symptoms and test
results and the like I want to you know
build a probably a model predicting the
probability two people have disease so
you know this is the type of scientific
research that we would like to be
allowed to continue and in differential
privacy you act ask a very particular
counterfactual question
we basically compare two alternatives
one is when I do this I build this model
on the database of medical records
including your medical record and the
other one is where I do the same
exercise with the same database with
just your medical record removed so
basically you know it's two databases
one with n records in it and one with n
minus one records in it the n minus one
records are the same and the only one
that's missing in the second case is
your medical record so differential
privacy basically says that any harms
that might come to you from the analysis
in which your data was included are
essentially Munir ly identical to the
harms that would have come to you if the
same analysis had done been done without
your medical record included so in other
words this doesn't say that bad things
cannot happen to you as a result of data
analysis it just says that these bad
things were going to happen to you
already
even if your data wasn't included and to
give a very concrete example right you
know you know like we discussed at some
length the the study that you know the
in the 50s that was done that created
the that established the link between
smoking and lung cancer and we make the
point that like well if your data was
used in that analysis and you know the
world kind of knew that you were a
smoker because you know there was no
stigma associated with smoking before
that those findings real harm might have
come to you as a result of that study
that your data was included in in
particular your insurer now might have a
higher posterior belief that you might
have lung cancer and raise your premiums
so you've suffered economic damage but
the point is is that if the same
analysis been done without with all the
other n minus-1 medical records and just
yours missing the outcome would have
been the same your your data was an
idiosyncratic eleum crucial to
establishing the link between smoking
and lung cancer because the link between
smoking and lung cancer is like a fact
about the world that can be discovered
with any sufficiently large database of
medical records but that's a very low
value of harm yes so that's showing that
very little harm is done great but how
what is the mechanism of differential
privacy so that's the kind of beautiful
statement of it well what's the
mechanism by which privacy's preserve
yeah so it's it's basically by adding
noise to computations right so the basic
idea is that every differentially
private algorithm first of all or every
good differentially private al but never
useful one is a probabilistic algorithm
so it doesn't on a given input if you
gave the Elven the same input multiple
times and we would give different
outputs each time from some distribution
and the way you achieve differential
privacy algorithmically is by kind of
carefully and tastefully adding noise to
a computation in the right places and
you know to give a very concrete example
if I want to compute the average of a
set of numbers right the non private way
of doing that is to take those numbers
and average them and release like a new
mayor
we precise value for the average okay in
differential privacy you wouldn't do
that you would first compute that
average to numerical Precision's and
then you'd add some noise to it right
you'd add some kind of a zero mean you
know Gaussian or exponential noise to it
so that the actual value you output
right is not the exact mean but it'll be
close to the mean but it'll be close the
noise the you add will sort of prove
that nobody can kind of reverse engineer
any particular value that went into the
average so noise noise is the Savior how
many algorithms can be aided by miam by
adding noise yeah so I'm a relatively
recent member of the differential
privacy community my co-author Aaron
Roth is you know really one of the
founders of the field and has done a
great deal of work and I've learned a
tremendous amount working with him on it
growing up field already yeah but it now
it's pretty mature but I must admit the
first time I saw the definition of
deferential privacy my reaction was like
wow that is a clever definition and it's
really making very strong promises and
my you know you know at first saw the
definition in much earlier days and my
first reaction was like well my worried
about this definition would be that it's
a great definition of privacy but that
it'll be so restrictive that we won't
really be able to use it like you know
we won't be able to do compute many
things in a differentially private way
so that that's one of the great
successes of the field I think isn't
showing that the opposite is true and
that you know most things that we know
how to compute absent any privacy
considerations can be computed in a
differentially private way so for
example pretty much all of statistics
and machine learning can be done
differentially privately so pick your
favorite machine learning algorithm at
propagation and neural networks you know
card for decision trees support vector
machines boosting you name it as well as
classic hypothesis testing and the like
and statistics
none of those algorithms are
differentially private in their original
form
all of them have mod
vacations that add noise to the
computation in different places in
different ways that achieve differential
privacy so this really means that to the
extent that you know we've become a you
know a scientific community very
dependent on the use of machine learning
and statistical modeling and data
analysis we really do have a path to
kind of provide privacy guarantees to
those methods and and so we can still
you know enjoy the benefits of kind of
the data science era while providing you
know rather robust privacy guarantees to
individuals
you

Resume

Berikut adalah rangkuman profesional dari transkrip yang diberikan:

# Ringkasan Transkrip: Privasi Diferensial (Differential Privacy)

### Inti Sari
Video ini membahas konsep **Privasi Diferensial**, sebuah standar privasi data yang lebih kuat daripada sekadar anonimisasi. Konsep ini dirancang untuk memastikan bahwa hasil analisis data tidak berubah secara signifikan terlepas dari apakah data spesifik seorang individu disertakan atau tidak, sehingga melindungi individu tanpa mengorbankan manfaat ilmiah dari analisis data berskala besar.

### Poin-Poin Kunci
*   **Definisi Kuat:** Privasi diferensial menawarkan jaminan privasi yang lebih kuat dibandingkan metode tradisional dengan memastikan kontribusi data seseorang tidak dapat diidentifikasi.
*   **Prinsip "Dunia Ganda":** Konsep ini membandingkan hasil analisis antara dua skenario: satu dengan data individu (n) dan satu tanpa data individu tersebut (n-1).
*   **Jaminan Kerugian (Harm):** Tujuannya bukan untuk mencegah hal buruk terjadi akibat analisis data, tetapi untuk memastikan bahwa hal buruk tersebut akan terjadi *terlepas dari* keberadaan data Anda.
*   **Mekanisme Noise:** Privasi dicapai dengan menambahkan gangguan acak (*noise*) ke dalam hasil komputasi, mencegah *reverse engineering* data asli.
*   **Penerapan Luas:** Teknik ini dapat diterapkan pada berbagai algoritma pembelajaran mesin dan statistik tanpa menghambat kemampuan komputasi secara signifikan.

### Rincian Materi

**1. Konsep Dasar dan Keterbatasan Anonimisasi**
Di era di mana data seperti "like" dapat mengungkap identitas seseorang, anonimisasi tradisional tidak lagi cukup. Privasi diferensial diperkenalkan sebagai gagasan yang lebih kuat. Intinya adalah memastikan bahwa output dari analisis data akan terlihat hampir sama, apakah data Anda ada di dalam basis data atau tidak. Jika keberadaan data Anda tidak mengubah hasil secara signifikan, maka privasi Anda terlindungi.

**2. Memahami "Kerugian" melalui Studi Kasus Rokok**
Privasi diferensial tidak menjamin bahwa tidak akan ada konsekuensi negatif dari analisis data, namun memastikan bahwa konsekuensi tersebut tidak disebabkan oleh data spesifik Anda.
*   *Contoh:* Pada studi tahun 1950-an tentang hubungan rokok dan kanker paru-paru, jika data Anda digunakan dan perusahaan asuransi menaikkan premi perokok, hal itu dianggap sebagai "kerugian".
*   Namun, karena fakta bahwa rokok menyebabkan kanker dapat ditemukan di *semua* kumpulan data besar lainnya, mengecualikan data Anda tidak akan mengubah hasil studi tersebut. Dengan kata lain, kenaikan premi itu akan terjadi bahkan tanpa partisipasi Anda.

**3. Cara Kerja: Penambahan Noise (Noise Addition)**
Untuk mencapai privasi ini, algoritma dibuat menjadi probabilistik, artinya input yang sama dapat menghasilkan output yang berbeda pada waktu yang berbeda.
*   *Metode:* Saat menghitung statistik seperti rata-rata, sistem tidak memberikan angka yang tepat. Sebaliknya, ia menambahkan "noise" (gangguan) acak—misalnya menggunakan distribusi Gaussian atau eksponensial—dengan rata-rata nol.
*   *Hasil:* Angka yang dirilis akan dekat dengan nilai sebenarnya tetapi tidak persis. Ini mencegah pihak luar untuk melakukan *reverse engineering* guna mengetahui apakah data tertentu ada di dalam himpunan data asli.

**4. Ruang Lingkup dan Masa Depan**
Bersama dengan rekan penulis seperti Aaron Roth, pembicara menekankan bahwa bidang ini sudah matang.
*   *Kekhawatiran Awal:* Awalnya ada kekhawatiran bahwa privasi diferensial akan terlalu membatasi sehingga analisis menjadi tidak berguna.
*   *Realitas:* Terbukti bahwa hampir semua hal yang dapat dihitung tanpa batasan privasi juga dapat dihitung dengan privasi diferensial.
*   *Penerapan:* Teknik ini telah beradaptasi untuk berbagai algoritma, termasuk statistik, *machine learning* (seperti *backpropagation*, *neural networks*, *decision trees*, SVM, *boosting*), dan pengujian hipotesis. Modifikasi pada algoritma asli dengan penambahan noise memungkinkan komunitas ilmiah tetap memanfaatkan ilmu data sambil menjaga jaminan privasi yang kuat.

### Kesimpulan & Pesan Penutup
Privasi diferensial menyeimbangkan antara kebutuhan akan analisis data yang mendalam dan perlindungan privasi individu. Dengan menambahkan noise matematis ke dalam hasil komputasi, kita dapat memperoleh wawasan yang berharga dari data tanpa membahayakan kontributor data tersebut. Konsep ini telah terbukti efektif dan fleksibel, dapat diterapkan pada berbagai macam algoritma modern tanpa mengorbankan utilitas data.

Read

file updated 2026-02-13 13:23:38 UTC