Contributed by: Bart Baesens, Seppe vanden Broucke, Wilfried Lemahieu
This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!
In this column, we would like to elaborate on the concept of data security. It is based upon our upcoming book Principles of Database Management, The Practical Guide to Storing, Managing and Analyzing Big and Small Data, see www.pdbmbook.com for more details.
Although security is often related to privacy, they are not synonyms. Data security can be defined as the set of policies and techniques to ensure the confidentiality, availability and integrity of data at all times. On the other hand, data privacy refers to the fact that the parties accessing and using the data do so only in ways that comply with the agreed upon purposes of data use in their role. These purposes can be expressed as part of a company’s policy, but are also subject to legislation. In this way, several aspects of security can be considered as necessary instruments to guarantee data privacy.
More concretely, data security pertains to the following concerns:
- Guaranteeing data integrity: preventing data loss or data corruption as a consequence of malicious or accidental modification or deletion of data. Here, the replication and recovery facilities of the database management system (DBMS) play important roles. Replication means copying the updates in a source system in (near) real time to a target data store which serves as an exact replica. The replica can then serve as a fallback system in case something goes wrong with the source system. Recovery is the activity of ensuring that, whichever problem (e.g., hard disk failure; application. operating system or DBMS crash, power outage) occurs, the database is returned to a consistent state without any data loss afterwards. Modern day DBMSs have advanced transaction management facilities to ensure recovery at all times.
- Guaranteeing data availability: ensuring that the data is accessible to all authorized users and applications, even in the occurrence of partial system malfunctions. Many techniques exist to safeguard data by means of backup and/or replication. Examples are tape backup, hard disk backup, electronic vaulting, replication and mirroring.
- Authentication and access control: access control refers to the tools and formats to express which users and applications have which type of access (read, add, modify, …) to which data. Relevant techniques here are SQL privileges and views. An SQL privilege corresponds to the right to use certain SQL statements such as SELECT, INSERT, DELETE, UPDATE, etc. on one or more database objects. Privileges can be granted or revoked. Views are part of the external data model. A view is defined by means of an SQL query and its content is generated upon invocation of the view by an application or by another query. In this way, it can be considered as a virtual table without physical data tailored towards the needs of one or more applications or users. An important condition for adequate access control is the availability of authentication techniques, which allow for unambiguously identifying the user or user category for which the access rights are to be established. The most widespread technique here is still the combination of a user id and password, although several other approaches are gaining ground, such as fingerprint readers or iris scanning.
- Guaranteeing confidentiality: this is the flipside of access control, guaranteeing that users and other parties cannot read or manipulate data to which they have no appropriate access rights. This is the data security concern most closely related to privacy. One possible technique here, especially in the context of analytics, is anonymization which is the process of transforming sensitive data so the exact value cannot be recovered by other parties. Another important tool is encryption, rendering data unreadable to unauthorized users that do not possess the appropriate key to decrypt the data back into a readable format.
- Auditing: especially in heavily regulated settings such as the banking and insurance sector, it is key to keep track of which users performed which actions on the data (and at what time). Most DBMSs automatically track these actions in a rudimentary fashion by means of the log file. Regulated settings require a much more advanced form of auditing, with extensive tracking and reporting facilities, maintaining a detailed inventory of all database accesses and data manipulations, including the users and user roles involved.
- Mitigating vulnerabilities: this class of concerns pertains to detecting and resolving shortcomings or downright bugs in applications, DBMSs or network and storage infrastructure that yield malicious parties opportunities to circumvent security measures with respect to the aforementioned Examples here are wrongly configured network components or bugs in application software that provide loopholes to hackers. A very important concept in the context of DBMSs is avoiding SQL injection where one injects malicious fragments into normal-looking SQL statements. The well-known three-layer database architecture consisting of an internal, logical and external layer is also instrumental to this purpose. By hiding implementation details from users and the outside world by means of logical and physical data independence, it becomes much harder to discover and exploit potential vulnerabilities.
To summarize, in this column we zoomed in on data security and illustrated how DBMSs can assist with this.
For more information, we are happy to refer to our upcoming book, Principles of Database Management, see www.pdbmbook.com.