By: Bart Baesens, Seppe vanden Broucke
This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data science or analytics question of 140 characters maximum. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.
You asked: Can you highlight some important privacy issues that come with big data and analytics and discuss ways to address them?
Our answer:
The introduction of new technology, such as big data and analytics, brings new privacy concerns. Privacy issues can arise in two ways. Firstly, data about individuals can be collected without these individuals being aware about it. Secondly, people may be aware that data is collected about them, but have no say in how the data is being used. Furthermore, it is important to note that big data and analytics bring extra concerns regarding privacy as compared to simple data collection and data retrieval from databases.
Big data & analytics entails the use of massive amounts of data – possibly combined from several sources, including the internet – to mine for hidden patterns. Hence, this technology allows for the discovery of previously unknown relationships without the customer and company being able to anticipate this knowledge. Think about an example where three independent pieces of information about a certain customer lead to the customer being classified as a long term credit risk, whereas the individual pieces of information would never have led to this conclusion. It is exactly this kind of discovery of hidden patterns which forms an additional threat to a customer’s privacy. As illustrated by the previous example, analytics is more than just data collection and information retrieval from vast databases. This is recognized by the definition of analytics in several government reports. In the August 2006 Survey of DHS Data Mining Activities, the Department of Homeland Security (DHS) Office of the Inspector General (OIG) defined analytics as:
“… the process of knowledge discovery, predictive modeling, and analytics. Traditionally, this involves the discovery of patterns and relationships from structured databases of historical occurrences.”
Several other definitions have been given, and generally these definitions imply the discovery of hidden patterns and the possibility for predictions. Thus, simply summarizing historical data is generally not considered analytics.
There are several regulations in place in order to protect an individual’s privacy. The Fair Information Practice Principles (FIPPs), which have been stated in a report of the U.S Department of Health, Education and Welfare in 1973, have served as the main inspiration for the Privacy Act of 1974. In 1980, the Organization of Economic Cooperation and Development (OECD) defined its “Guidelines on the Protection of Privacy and Transborder Flows of Personal Data”. Example guidelines are: the collection limitation principle, the data quality principle, the purpose specification principle, the use limitation principle, the safety safeguards principle, the openness principle, the individual participation principle and the accountability principle. These guidelines are widely accepted, have been endorsed by the U.S. Department of Commerce, and are the foundation of privacy laws in many other countries (e.g. Australia, Belgium, etc.).
Given the increasing importance and awareness of privacy in the context of analytics, more and more research is being conducted on privacy preserving data mining algorithms. Consider an example where explicit identifiers are removed from a data set, but there is a combination of a number of variables (e.g. age, zip code, gender), which serves as a quasi-identifier (QID). This means that it is possible to link the record owner, by means of the QID, to a record owner in another data set. To preserve privacy, there should be several records in the data set with the same QID. There are several methods to anonymize data. Most of these methods will remove information from the quasi-identifiers, until the records are not individually identifiable, as illustrated in the below figure.
Zip Code | Age | Gender | Zip Code | Age | Gender | |
83661 | 26 | M |
⇒ |
836** | 2* | M |
83659 | 23 | M | 836** | 2* | M | |
83645 | 58 | F | 836** | 5* | F |