Seven Critical Tools for Boosting Data Protection and Innovation

April 22, 2021
Share on:

The winds of change are gusting through the business world.

More than a dozen state governments are crafting data compliance legislation, while Virginia recently passed a data privacy law mostly similar to California's Consumer Privacy Act. This legislative flurry adds to a long list of established regulations such as GDPR, HIPAA, PCI DSS, and Sarbanes-Oxley. At the same time, consumers are attuned to how businesses control their personal data.

Responsible businesses feel these headwinds and recognize customers and employees have a fundamental right to privacy—and these organizations are taking active steps to ensure sensitive data is fully protected.

But it's a big undertaking—especially as organizations look to unlock the full value of data while keeping it secure. Advanced analytics, machine learning (ML), cloud, the Internet of Things (IoT), and extended supply chains place new and often onerous demands on organizations. As data streams in from various sources and across organizational boundaries, protecting trade secrets and personally identifiable information (PII) is imperative.

Techniques for preserving privacy can be divided into three categories, each with its own benefits and constraints: reversable (R) data transformations, non-reversable (N) software-based techniques, and hardware-based security mechanisms (TEE).

Here is a look at seven privacy-preserving tools and technologies that rely on those techniques. They could benefit your organization, especially as it further pursues data-driven AI projects.

Differential privacy (N)

This framework allows organizations to publicly share information from a dataset using an algorithm to compile statistics about the dataset. It extracts characteristics and data without revealing personal or private data, regardless of how unique an individual's data is. This ensures the identity of anyone and everyone in the database will not be revealed.

However, differential privacy and related k-anonymity work best on larger datasets, and it's important to note the technique can add "noise" to data. This means that in some cases, the technique can degrade the accuracy and correctness of a statistical operation.

Data de-identification (N or R)

Frequently, there's a need to strip identifying information from data. This may include names, phone numbers, medical record numbers, and sensitive dates. Data de-identification accomplishes this task. In some cases, it can function as a subset of differential privacy. The technology doesn't alter or impact the original dataset; it simply extracts relevant data and creates a new dataset that's typically referred to as the "destination dataset."

De-identification technologies frequently work at the dataset level, the FHIR store level and the DICOM store level, depending on the requirements of a project. They can be used to share information with non-privileged groups, assemble datasets include two-way reversable methods and non-reversable one-way methods.

Tokenization (R)

This technique is increasingly popular because it preserves the format of data. It substitutes a sensitive data element with a non-sensitive, randomized equivalent that can be used for various analytics and ML tasks and is resistant to attacks from quantum computers. The replacement data element is known as a token, which essentially serves as a mapping or translation mechanism.

Tokenization is suitable for analytical applications as well as other applications that may require fast operations. It also can search on encrypted data values, in some cases translating clear text values or enabling "fuzzy search" on protected data. The high level of flexibility built into tokenization technology is appealing to businesses.

Format-preserving encryption (R)

This technique is used to preserve the format of data. It substitutes a sensitive data element with a non-sensitive encrypted equivalent that can be used for various analytics and ML tasks. It's often used when there's a need for a masked data set, such as validating production data or when it's desirable to maintain the actual number of digits on a credit card number or Social Security number, so the data can be used by legacy systems or adhere to regulatory standards. It is less secure than tokenization and approximately ten times slower than traditional encryption. Like the Advanced Encryption Standard (AES), it is not resistant to attacks from quantum computers.

Hashing (N)

This algebraic function converts data into a compressed numeric hash or hash value. While encryption is designed to work two ways (encrypt and decrypt), hashing involves an irreversible one-way operation. Although the technique can be used for a variety of purposes, it is particularly valuable when it's applied to certain aspects of security and privacy.

For example, hashing technology makes it possible to store passwords and other sensitive data without revealing the actual string. In fact, a company or website can't view the plaintext password—and if a user forgets or loses it, a complete reset is required. Yet, there are risks associated with hashing, including the incorrect use of the technology, which can open the door to security breaches.

Trusted execution environments (TEE)

Another tool for protecting data is a TEE. It relies on an isolated area on a processor that functions independently from the main operating system. This trusted environment allows data to be stored and processed while in a protected state. The technology is already widely used in smartphones, tablets, smart TVs, set-top boxes and IoT devices.

TEE often complements encryption and essentially uses a root of trust, a set of functions that can always be trusted, usually because a TEE resides at the silicon level and cannot be accessed by outside devices. Another thing that makes the technology attractive is that it can operate on clear-text information, meaning it's faster and more scalable than homomorphic encryption, particularly in clouds.

Homomorphic encryption (R)

This emerging technique makes it possible to perform computations on encrypted data. Because the underlying data remains invisible, it's ideal for industries like finance and healthcare—or when there's a need for multi-party computing. For example, a group of credit card companies would use it to share data to improve fraud detection without revealing customer data.

Homomorphic encryption is ideal for advanced analytics and ML tasks. There's partially homomorphic encryption (PHE), which is simpler and better suited to situations where only some data must be hidden, and fully homomorphic encryption (FHE), which locks down the data completely. The technology remains in a nascent state, and it isn't widely used due to relatively slow speeds. But algorithms are improving and homomorphic encryption will likely be a powerful tool for protecting data when more powerful quantum computers appear.

In some cases, an organization may want to use more than one of these seven methods with the same data—or at various points in the data lifecycle. Also, specific industry standards and solutions call for different privacy preserving techniques. In the end, it's critical to understand how and where you need the different data protection tools and how they can lead you down a path toward responsible AI.

By preserving privacy in analytics and machine learning it's possible to enjoy the best of both worlds: extracting the maximum value from data while building greater trust with customers, business partners, and others. Effective use of these techniques can fuel business growth and success. Privacy and innovation go together. You can no longer have one without the other, and neither is optional.

Hear more about this topic by registering for my upcoming webinar.