PRIVACY PRESERVING DATA PUBLISHING
WITH MULTIPLE SENSITIVE ATTRIBUTES
Computer Science and Engineering, Ph.D. Dissertation, 2012
Assoc. Prof. Yücel Saygın (Thesis Supervisor), Assoc. Prof. Albert Levi, Assoc. Prof. Mehmet Keskinöz, Asst. Prof. Ali İnan (Işık Üniversitesi), Asst. Prof. Mehmet Ercan Nergiz (Zirve Üniversitesi)
Date &Time: August 2nd, 2012 - 13:00
Place: FENS G029
Keywords: Privacy, Multiple Sensitive Attributes, Data mining, Probabilistic Algorithms.
Data mining is the process of extracting hidden predictive information from large databases, it has a great potential to help governments, researchers and companies focus on the most significant information in their data warehouses. A high quality data and effective data publishing needed to gain a high impact from data mining process. However there is a clear need to preserve individual privacy in the released data. Privacy-preserving data publishing is a research topic of eliminating privacy threats at the same time it provides useful information in the released data. Normally datasets include many sensitive attributes; it may contain static data or dynamic data. Datasets may need to publish multiple updated releases with different time stamps. As a concrete example, public opinions include highly sensitive information about individual and may reflect a person's perspective, understanding, particular feelings, way of life, and desires. On one hand, public opinion is often collected through a central server which keeps a user profile for each participant and needs to publish this data for researchers to deeply analyze. On the other hand, new privacy concerns arise and user’s privacy can be at risk. The user’s opinion is sensitive information and it must be protected before and after data publishing. Opinions are about a few issues, while the total number of issues is huge. In this case we will deal with multiple sensitive attributes in order to develop an efficient model. Furthermore, opinions gathered and published periodically, correlations between sensitive attributes in different release may occur. Thus the anonymization technique must care about previous releases as well as the dependencies between released issues.
This thesis identifies a new privacy problem of public opinions. In addition it presents two probabilistic anonymization algorithms based on the concepts of k-anonymity and `-diversity to solve both publishing datasets with multiple sensitive attributes and publishing dynamic datasets. Proposed algorithms provide a heuristic solution for multidimensional quasi-identifier and multidimensional sensitive attributes using probabilistic `-diverse definition. Experimental results show that these algorithms clearly outperform the existing algorithms in term of anonymization accuracy.