Anonymizing Sequential
Releases
Ke Wang
Benjamin C. M. Fung
Simon Fraser University
Simon Fraser University
[email protected]
[email protected]
ACM SIGKDD 2006
Motivation: Sequential Releases
• Previous works address single release only.
• Data are typically released sequentially in
multiple versions.
– New information become available.
– A tailored view for each data sharing purpose.
– Separate releases for sensitive and identifying
information.
2
T1: Current Release
T2: Previous Release
Pid Name
Job
Class
Pid Job
Disease
1
Alice
Banker
c1
1
Banker
Cancer
2
Alice
Banker
c1
2
Banker
Cancer
3
Bob
Clerk
c2
3
Clerk
HIV
4
Bob
Driver
c3
4
Driver
Cancer
5
Cathy
Engineer
c4
5
Engineer
HIV
The join on T1.Job = T2.Job
Pid
Name
Job
Disease
Class
1
Alice
Banker
Cancer
c1
2
Alice
Banker
Cancer
c1
3
Bob
Clerk
HIV
c2
4
Bob
Driver
Cancer
c3
5
Cathy
Engineer
HIV
c4
-
Alice
Banker
Cancer
c1
-
Alice
Banker
Cancer
c1
Do not want Name to
be linked to Disease
in the join of the two
releases.
3
T1: Current Release
T2: Previous Release
Pid Name
Job
Class
Pid Job
Disease
1
Alice
Banker
c1
1
Banker
Cancer
2
Alice
Banker
c1
2
Banker
Cancer
3
Bob
Clerk
c2
3
Clerk
HIV
4
Bob
Driver
c3
4
Driver
Cancer
5
Cathy
Engineer
c4
5
Engineer
HIV
The join on T1.Job = T2.Job
Pid
Name
Job
Disease
Class
1
Alice
Banker
Cancer
c1
2
Alice
Banker
Cancer
c1
3
Bob
Clerk
HIV
c2
4
Bob
Driver
Cancer
c3
5
Cathy
Engineer
HIV
c4
-
Alice
Banker
Cancer
c1
-
Alice
Banker
Cancer
c1
join sharpens
identification:
{Bob, HIV} has
groups size 1.
4
T1: Current Release
T2: Previous Release
Pid Name
Job
Class
Pid Job
Disease
1
Alice
Banker
c1
1
Banker
Cancer
2
Alice
Banker
c1
2
Banker
Cancer
3
Bob
Clerk
c2
3
Clerk
HIV
4
Bob
Driver
c3
4
Driver
Cancer
5
Cathy
Engineer
c4
5
Engineer
HIV
The join on T1.Job = T2.Job
Pid
Name
Job
Disease
Class
1
Alice
Banker
Cancer
c1
2
Alice
Banker
Cancer
c1
3
Bob
Clerk
HIV
c2
4
Bob
Driver
Cancer
c3
5
Cathy
Engineer
HIV
c4
-
Alice
Banker
Cancer
c1
-
Alice
Banker
Cancer
c1
join weakens
identification:
{Alice,
Cancer} has
groups size 4.
lossy join: combat
join attack.
5
T1: Current Release
T2: Previous Release
Pid Name
Job
Class
Pid Job
Disease
1
Alice
Banker
c1
1
Banker
Cancer
2
Alice
Banker
c1
2
Banker
Cancer
3
Bob
Clerk
c2
3
Clerk
HIV
4
Bob
Driver
c3
4
Driver
Cancer
5
Cathy
Engineer
c4
5
Engineer
HIV
The join on T1.Job = T2.Job
Pid
Name
Job
Disease
Class
1
Alice
Banker
Cancer
c1
2
Alice
Banker
Cancer
c1
3
Bob
Clerk
HIV
c2
4
Bob
Driver
Cancer
c3
5
Cathy
Engineer
HIV
c4
-
Alice
Banker
Cancer
c1
-
Alice
Banker
Cancer
c1
join enables
inferences
across tables:
AliceCancer
has 100%
confidence.
6
Related Work
• k-anonymity [SS98, FWY05, BA05, LDR05, WYC04, WLFW06]
– Quasi-identifier (QID): e.g., {Job, birth date,
Zip}.
Explicit ID
(removed)
QID (anonymized
to groups of size
≥ k)
Sensitive
attributes
– The database is made anonymous to its local
QID.
– In sequential releases, the database must be
made anonymous to a global QID spanning the
join of all releases thus far.
7
Related Work
• l-diversity [MGK06]
– Sensitive values are “well-represented” in
each QID group (measured by entropy).
• Confidence limiting [WFY05, WFY06]:
qid  s, confidence < h
where qid is a QID group, s is a sensitive
value.
8
Related Work
• View releases
– T1 and T2 are two views in one release, both
can be modified before the release.
– [MW04, DP05] measures information
disclosure of a view set wrt a secret view.
– [YWJ05, KG06] detects privacy violation by a
view set over a base table.
– Detect, not eliminate, violations.
9
Sequential Release
• Sequential release:
– Current release T1. Previous release T2.
– T1 was unknown when T2 was released.
– T2 cannot be modified when T1 is released.
• Solution #1: k-anonymize all attributes in T1 excessive distortion.
• Solution #2: generalize T1 based on T2 monotonically distort the later release.
• Solution #3: anonymize a complete cohort of all
potential releases at one time – must predict all
future releases
10
Intuition of Our Approach
• A lossy join hides the true join relationship
to cripple a global QID.
• Generalize T1 so that the join with T2
becomes lossy enough to disorient the
attacker.
• Two general privacy notions: (X,Y)anonymity and (X,Y)-linkability, where X
and Y are sets of attributes.
11
(X,Y)-Privacy
• k-anonymity: # of distinct records for each
QID group ≥ k.
• (X,Y)-anonymity: # of distinct Y values for
each X group ≥ k.
• (X,Y)-linkability: the maximum confidence
of having a Y value given having a X value
is ≤ k.
• Generalize k-anonymity [SS98] and
confidence limiting [WFY05, WFY06].
12
Example: (X,Y)-Anonymity
Pid
1
1
1
2
2
2
2
Job
Banker
Banker
Banker
Clerk
Clerk
Clerk
Clerk
Zip
123
123
123
456
456
456
456
PoB
Canada
Canada
Canada
Japan
Japan
Japan
Japan
Test
HIV
Diabetes
Eye
HIV
Diabetes
Eye
Heart
• k-anonymity uses # of records as
anonymity, fails to ensure k distinct
patients.
13
Example: (X,Y)-Anonymity
• Anonymity wrt patients (instead of
records):
– X = {Job, Zip, PoB} and Y = Pid
– Each X group is linked to at least k distinct
values on Pid.
• Anonymity wrt tests:
– X = {Job, Zip, PoB} and Y = Test
– Each X group is linked to at least k distinct
tests.
14
Example: (X,Y)-Linkability
Pid
1
2
3
4
5
6
Job
Banker
Banker
Banker
Banker
Clerk
Clerk
Zip
123
123
123
123
456
456
PoB
Canada
Canada
Canada
Canada
Japan
Japan
Test
HIV
HIV
HIV
Diabetes
Diabetes
Diabetes
• {Banker,123,Canada}  HIV (75% confidence).
• With Y = Test, (X,Y)-linkability states that no
test can be inferred from a X group with
15
confidence > a given threshold.
Problem Statement
• The data holder made previous release T2 and
now makes current release T1, where T2 and T1
are projections of the same underlying table.
• Want to ensure (X,Y)-privacy on the join of T1
and T2, where X and Y are attribute sets on the
join.
• Sequential anonymization: generalize T1 on X ∩
att(T1) so that the join satisfies (X,Y)-privacy and
T1 remains as useful as possible.
16
Generalization / Specialization
• Each generalization replaces all child
values with the parent value.
– A cut contains exactly one
value on every root-to-leaf
Professional
path.
Engineer
Lawyer
Job
ANY
Admin
Banker
Clerk
• Alternatively, each specialization replaces
the parent value with a consistent child
value in the record.
17
Match Function
• The attacker applies prior knowledge to
match the records in T1 and T2.
• So, the data holder applies such prior
knowledge in sequential anonymization
• We consider prior knowledge:
– schema information of T1 and T2.
– taxonomies for attributes.
– the inclusion-exclusion principle.
18
Match Function
• Let t1  T1 and t2  T2.
• Inclusion Predicate: t1.A matches t2.A if
they are on the same generalization path
for attribute A.
– e.g., Male matches Single Male.
• Exclusion Predicate: t1.A matches t2.B
only if they are not semantically
inconsistent (based on common sense).
– To exclude impossible matches.
– e.g., Male and Pregnant are semantically
inconsistent, so are Married Male and 6
Month Pregnant.
19
Algorithm Overview
Top-Down Specialization
Input: T1, T2, (X,Y)-privacy, a taxonomy tree for each
attribute in X1=X ∩ att(T1).
Output: a generalized T1 satisfying the privacy requirement.
1.
2.
3.
4.
5.
6.
7.
generalize every value of Aj to ANYj where Aj  X1;
while there is a valid candidate in ỤCutj do
find the winner w of highest Score(w) from ỤCutj;
specialize w on T1 and remove w from ỤCutj;
update Score(v) and the valid status for all v in ỤCutj;
end while
output the generalized T1 and ỤCutj;
20
Anti-Monotone Privacy
• Theorem 1: On a single table, (X,Y)privacy is anti-monotone wrt specialization
on X: if violated, remains violated after a
specialization.
• On the join of T1 and T2, (X,Y)-privacy is
not anti-monotone wrt specialization of T1.
– Specializing T1 may create dangling records,
e.g., by specializing “CA” into “LA” and “San
Francisco”, “LA” records in T1 no longer
match “San Francisco” records in T2.
21
Anti-Monotone Privacy
• Theorem 2: Assume that T1 and T2 are
projections of the same underlying table,
(X,Y)-privacy on the join of T1 and T2 is
anti-monotone wrt specialization of T1 on
X ∩ att(T1).
22
Score Metric
• Each specialization gains some information and
loses some privacy. We maximize gain per loss
• InfoGain(v) is measured on T1.
• PrivLoss(v) is measured on the join of T1 and T2.
23
Challenges
Each specialization affects the matching of
join, Score(v), and privacy checking.
•
•
rejoining T1 and T2 for each specialization is
too expensive.
Materializing the join is impractical because
a lossy join can be very large.
Our solution: Incrementally maintains some
count statistics without executing the join
–
extension of Top-Down Specialization
[FWY05][WFY05]
24
Empirical Study
• The Adult data set. 45222 records.
Categorical attributes only.
25
Schema for T1 and T2
• T1 contains the Class Income level
Department Attribute
# of
Leaves
Taxation
Education (E)
16
(T1)
Occupation (O)
14
Work-class (W)
8
Common
Marital-status (M) 7
(T1 & T2)
Relationship (Ra) 6
Sex (S)
2
Immigration Native-country (Nc) 40
(T2)
Race (Ra)
5
# of
Levels
5
3
5
4
3
2
5
3
Empirical Study
• Classification metric
– Classification error on the generalized testing set
of T1.
• Distortion metric [SS98]
– 1 unit of distortion for generalization of each
value in each record.
– Normalized by the number of records.
27
(X,Y)-Anonymity
• TopN attributes: most important for classification.
• Join attributes are Top3 attributes.
• X contains
– TopN attributes in T1 (to ensure that the
generalization is performed on important
attributes),
– all join attributes,
– all attributes in T2 (to ensure X is global).
28
Distortion of (X,Y)-anonymity
• Ki denotes the key in Ti.
• XYD: our method with Y = K1.
• KAD: k-anonymization on QID=att(T1).
29
Classification error of (X,Y)-anonymity
•
•
•
•
•
XYE: our method with Y = K1.
XYE(row): our method with Y={K1,K2}.
BLE: the unmodified data.
KAE: k-anonymization on QID=att(T1).
RJE: removing all join attributes from T1.
30
(X,Y)-Linkability
• Y contains TopN attributes.
– If not important, simply remove them.
• X contains the rest of the attributes in T1 and
T2.
• Focus on classification error because no
previous work studies distortion for (X,Y)linkability.
31
Classification error of (X,Y)-linkability
•
•
•
•
XYE: our method with Y = TopN.
BLE: the unmodified data.
RJE: removing all join attributes from T1.
RSE: removing all attributes in Y from T1.
32
Scalability
(X,Y)-anonymity (k=40)
(X,Y)-linkability (k=90%)
33
Conclusion
• Previous k-anonymization focused on a single
release of data.
• Studied the sequential anonymization problem
when data are released sequentially and a global
QID may span several releases.
• Introduced lossy join to hide the join relationship
and weaken the global QID.
• Addressed challenges due to large size of lossy
join.
• Extendable to more than two releases T2,…,Tp.
34
References
[BA05] R. Bayardo and R. Agrawal. Data privacy
through optimal k-anonymization. In IEEE ICDE,
pages 217.228, 2005.
[DP05] A. Deutsch and Y. Papakonstantinou.
Privacy in database publishing. In ICDT, 2005.
[FWY05] B. C. M. Fung, K. Wang, and P. S. Yu.
Top-down specialization for information and
privacy preservation. In IEEE ICDE, pages
205.216, April 2005.
[KG06] D. Kifer and J. Gehrke. Injecting utility into
anonymized datasets. In ACM SIGMOD,
Chicago, IL, June 2006.
35
References
[LDR05] K. LeFevre, D. J. DeWitt, and R.
Ramakrishnan. Incognito: Efcient full-domain kanonymity. In ACM SIGMOD, 2005.
[MGK06] A. Machanavajjhala, J. Gehrke, and D.
Kifer. l-diversity: Privacy beyond k-anonymity. In
IEEE ICDE, 2006.
[MW04] A. Meyerson and R. Williams. On the
complexity of optimal k-anonymity. In PODS,
2004.
[SS98] P. Samarati and L. Sweeney. Protecting
privacy when disclosing information: k-anonymity
and its enforcement through generalization and
suppression. In IEEE Symposium on Research in
Security and Privacy, May 1998.
36
References
[WFY05] K. Wang, B. C. M. Fung, and P. S. Yu.
Template-based privacy preservation in
classification problems. In IEEE ICDM, pages
466.473, November 2005.
[WFY06] K. Wang, B. C. M. Fung, and P. S. Yu.
Handicapping attacker's condence: An alternative
to k-anonymization. Knowledge and Information
Systems: An International Journal, 2006.
[WYC04] K. Wang, P. S. Yu, and S. Chakraborty.
Bottom-up generalization: A data mining solution
to privacy protection. In IEEE ICDM, November
37
2004.
References
[WLFW06] R. C. W. Wong, J. Li., A. W. C. Fu, and
K. Wang. (,k)-anonymity: An enhanced kanonymity model for privacy preserving data
publishing. In ACM SIGKDD, 2006.
[YWJ05] C. Yao, X. S. Wang, and S. Jajodia.
Checking for k-anonymity violation by views. In
VLDB, 2005.
38
Descargar

Anonymizing Sequential Releases