Why Protegrity Our unique data security technologies are built for today's data-driven businesses. Learn More
Data Discovery Uncover where sensitive data resides
Data Management Control every corner of the enterprise
Data Protectors Protection beyond platforms
Professional Services Security experts for every need
Security Gateways Data security that goes with the data
Vaultless Tokenization Go beyond encryption
Contact Us
Posted on: August 21, 2018

Building a Secure Cloud Data Lake with AWS, Protegrity and Talend

The Rise of the Cloud Data Lake

By now, you already know that cloud computing is rapidly changing the landscape of IT and brings many business benefits.  The question is how do you balance the promise of cloud computing with concerns over the erosion of data security controls and scaling security into hybrid and public clouds?

Migration to the cloud securely is a concern that every enterprise needs to tackle.  In this blog, we will show you how you can build your data lake securely with Talend and Protegrity.  This solution is a real-world use case that generated value for a large financial services firm in Canada.  Leveraging Talend and Protegrity they can protect sensitive data from Oracle, as a source database on-premise, to a Cloudera data lake on AWS.  Let us look at how this financial firm accomplished their cloud migration securely.

 Solution Overview

  secure cloud data lake

Any legacy data you may be dealing with will need to be migrated to the cloud securely.  In the case we will walk you through today, we will migrate data from an Oracle database to Cloudera Impala.  As shown in the diagram, Talend was instrumental in building a data pipeline that sourced data from Oracle, transformed it, protected with the Protegrity Application Protector and loaded secured datasets into Cloudera Hadoop Data Lake running on AWS.

When using Talend, developers can leverage Java code routines to apply custom methods where the many components available do not implement all capabilities that may be needed.  Protegrity provides a Java Application Protector to secure the data which is a high-performance, versatile solution that packages an interface to integrate comprehensive, granular security and auditing into enterprise applications.  It eliminates the need for application developers to master the complexities of cryptography, while keeping the security team in control of sensitive data protection and access.

The Protegrity Java Application Protector along with the ESA (Enterprise Security Administrator) security policy server, features a flexible deployment with simple, API based integration.  It is implemented as a small-footprint server application with a variety of deployment options to match the target application and data requirements.  The easy-to-use interface can be accessed from several programming languages, including C, C++, Java, and .NET.  For the financial service firm, JAVA was used in tJava component of Talend.

With the variety of data types and formats in use today, in addition to the multiple regulations that may apply to your data, it’s vital to select a protection method that meets the needs of sensitive data and application.  Policies are designed and deployed on end points using easy to use Web interface to ESA.  The JAVA Application Protector is a policy enforcement endpoint. Protegrity provides both Tokenization and NIST Format Preserving Encryption (FPE) plus technologies to allow for maximum flexibility and transparency.  You can select these options while designing policy using ESA.

With CDH (Cloudera Data Hub) installed on AWS EC2 a Data Lake can avail Protegrity’s Big Data Protector (BDP) for Cloudera Data Hub (CDH).  BDP is installed on each Cloudera node.   This is a protector optimized for the CDH environment.  BDP comes with support for Apache Hive, Spark and Impala.

This ‘consistency’ between different environments is a key strength of the Protegrity platform.  One data element policy can then be applied consistently across environments: So, if, for example, the policy defines that social security numbers are encrypted, and date of birth always tokenized; then this policy will be applied across environments.  If we use the same encryption keys and code books, then the output is consistent, and we can join records using these ‘fake’ values, according to the same SOD (separation of duty) principles.  In addition to the security benefit, the power of Hadoop scalability can be exploited to achieve scaled protection and unprotection as needed.

Talend Integration with Protegrity Data Protectors

 The following diagram depicts the solution in more detail:

Secure Cloud Data Lake

Ingestion (Protection):

  • Using the Protegrity Java Application Protector installed at the Talend Job server, on-premises, sensitive data are protected (either encrypted or tokenized) during the ETL transformation phase and before Ingestion into data lake (Big Data, Cloudera cluster on AWS)

Consumption (Unprotection):

  • From a Big Data cluster setup (on AWS), only authorized users will be able to see the data in the clear or masked form when using the Protegrity Big Data Protector User Defined Functions (UDF) depending upon what roles they are in in the ESA policy
  • The ESA policies are created with appropriate permissions (roles) associated for each data element (encryption key or tokenization code book) and the policy deployed to the Cloudera cluster
  • Data gets unprotected when running HIVE/Impala queries that has Protegrity Big Data Protector UDF’s embedded in their queries

Why Choose Talend and Protegrity on your cloud journey?

No organization is too big or too small to leverage the power of the cloud.  Integrated solutions deployed with Talend and Protegrity benefit from:

  • Experience working with strict cloud compliance frameworks in on-premise, cloud and hybrid deployments according to security standards like GDPR, PCI-DSS and more
  • Partnership with the industry’s largest and most powerful cloud platforms (AWS, Google and Azure)
  • Cost savings by migrating your existing legacy datasets or implementing new cloud services in a secure cloud environment
  • Specialized solutions to safeguard your sensitive data within a Talend Job and according to your business needs.
  • Sensitivity to your time-to-market needs and scalability requirements. Significant experience providing industry-specific cloud solutions to health, financial, and SaaS-focused organizations.


 As discussed in this blog, Talend and Protegrity can help you migrate to cloud securely.  A Data Lake can be hosted in the cloud or be running on-premise. A combination of Talend and Protegrity protectors like Application Protector and Big Data Protector will help you fulfill your requirements of managing life cycle of sensitive data. Protegrity’s application protector is used as part of the Talend ETL process to protect and ingest the data.  To consume the protected data at scale and in a secured way on the cloud, Protegrity’s Big data protector for Cloudera is also being used.


Talend is an enterprise data integration software vendor. Talend delivers a single platform for data integration across public, private, and hybrid cloud, as well as on-premises environments. The company provides enterprise software solutions for big data, data integration, data management, master data management, data quality, data preparation and enterprise application integration.

 Protegrity protects sensitive data – that hackers try to reach – wherever it exists: at rest, in transit and in use. Protegrity is the only enterprise data security software platform that leverages scalable, data-centric tokenization, encryption and masking to help businesses secure sensitive information while maintaining data usability. Built for complex, heterogeneous business environments, the Protegrity Data Security Platform provides unprecedented levels of data security certified across applications, mainframes, databases, data warehouses, big data and cloud environments. Protegrity is offered on a subscription basis that bundles software, support and consulting services in flexible tiers based on how much sensitive data is protected.