GDPR: Approaches for Protecting Personally Identifiable Information (PII) and Sensitive Personal Information (SPI)

Jeremy Wittkop, CTO


Many companies are currently in different stages of projects to comply with the European Union’s General Data Protection Regulation (GDPR) ahead of the May 2018 enforcement deadline. Many vendors and service providers speak generally about GDPR and often, in my view, over simplify solutions to issues that are raised. Rather than try to address the whole of the regulation, I want to speak specifically about a practical issue that most companies will, at some point, need to address.

GDPR covers two categories of personal information, Personally Identifiable Information (PII) and Sensitive Personal Information (SPI). The two types of information are very different from each other and require separate approaches in order to accurately identify and protect them as they flow through an organization’s data environment.

Protecting Personally Identifiable Information (PII)

The first category of information that GDPR protects is commonly referred to around the world as Personally Identifiable Information. This category of information covers information that is generally accepted as personally identifiable such as names and national identifiers like Social Security Numbers (SSN) in the U.S., European identifiers such as driver’s license numbers in the U.K. and Italy’s Codice Fiscal. It is important to note that GDPR expands the definition of PII to things like email and IP addresses.

While the definition of PII has been expanded to include new types of identifiable information, the identifiers have commonality in the fact that they generally follow defined formats and are relatively easy to program into a content analytics system through the use of regular expression. Because of these commonalities, Data Loss Prevention (DLP) technologies are ideal in identifying and protecting this type of information. DLP technologies can be enterprise class or integrated into other products like firewalls, cloud access security brokers (CASBs), or web gateways.

There are two key areas within GDPR that identify DLP as the optimal solution for PII protection. First, the sections related to data security stipulate that the organization have reasonable controls to monitor the flow of data throughout the environment. In my interpretation, this means that an organization must have the ability to monitor the use of personal information at the endpoint, in transit via web and email channels, and where it is stored throughout an environment. It should also include visibility into how information is stored in cloud applications and how it is transferred between cloud environments. Second, as a practical matter, I cannot imagine a scenario in which an organization could comply with Right to be Forgotten or guarantee a Right to Erasure without the capability to find that PII throughout all of their systems, including cloud applications, and remove it. Therefore, a DLP capability, while not making an organization compliant in and of itself, is a required element in order to achieve compliance.

It should be said that building a proper DLP program for the purposes of complying with the relevant GDPR articles requires planning, coordination between business units, and a good deal of care and feeding. However, protecting PII has been a best practice for more than a decade and many people have experience building such programs.

Protecting Sensitive Personal Information is a far greater operational challenge.

Protecting Sensitive Personal Information (SPI)

Sensitive Personal Information refers to information that does not identify an individual, but is related to an individual and communicates information that is private or could potentially harm an individual should it be made public. SPI includes things like biometric data, genetic information, sex, trade union membership, sexual orientation, etc. The challenge with traditional data security tools like DLP in protecting SPI is that many of those things exist in common usage without being related to an individual, and it is very difficult to program a content analytics engine to find information that is in scope with GDPR without finding large volumes of information that is not in scope at the same time. The most elegant solution to protect SPI in my experience is to add a Data Classification program to the overall security program and integrate it with DLP programs.

Data Classification allows a user to select a classification from a list to tag data. Many people are familiar with classification schemas used by governments and militaries, which classify information by levels of secrecy. For example, classifications may include public, sensitive, secret, top secret, etc. The most effective Data Classification tools are very flexible, allowing for multiple levels of classification and customizable fields. For unstructured SPI data, an organization could develop a classification schema that had simple drop down menus that ask the user whether a document contains PII and SPI with yes or no choices. Then, the Data Classification solution would apply metadata tags to those documents which would be leveraged by security tools like DLP to apply rules to the information based on those tags. This is a far more efficient and effective method of protecting SPI than trying to find all instances of sensitive personal information categories referencing an individual as opposed to the same terms in common usage.

Data Classification programs can be used to communicate effectively in a human readable fashion as well. Many people may interact with PII and SPI on a frequent basis and not really think about the potential sensitivity of the information they handle. A large part of the spirit of GDPR is to cause people to think about the information they are handling and to handle it with due care. Complying with the spirit of the regulation will require a culture change in some organizations, which can be aided considerably by building a Data Classification program. This way, users can easily identify when they are handling sensitive information and perhaps handle such information with more care as they go about their daily routine. Many Data Classification solutions also have the ability to communicate with the end user through tips or pop up messaging to reinforce the behavioral change.

Breaches of personal data can happen in a variety of ways. Those that garner the most attention are large scale breaches often caused by incorrect technical configurations or a lack of due care on an industrial scale, but far more frequently, information is compromised on a small scale due to carelessness or a general lack of awareness. In these cases, Data Classification can help significantly.


Many organizations have what I call GDPR fatigue, meaning that there have been so many technology and service providers using fear to sell products and services without addressing specific solutions to the challenges posed by GDPR that many organizations have stopped listening. I do not look at GDPR as a reason for fear, but rather a positive way for organizations to enhance their security programs to protect critical client data and personal information.

GDPR compliance is relatively straight-forward. However, the basis of compliance is understanding how to identify and protect Personally Identifiable Information (PII) and Sensitive Personal Information (SPI). Therefore, programs to enable PII and SPI identification and protection are the foundational elements of compliance from a tools and capabilities perspective. Data Loss Prevention and Data Classification form a powerful combination for protecting both PII and SPI. The challenge then becomes one of leveraging those capabilities properly to fulfill controller and processor obligations and protect data subject rights.