Protegrity & StreamSets
Live Demo
View DemoProtegrity Non-Native
This integration embeds Protegrity AP Java directly into StreamSets pipelines, enabling in-flight tokenization, masking, and encryption without relying on external security services or post-processing steps.
Integration type
- Non-Native
- Streaming
Partner
Yes
Supported platforms
- AWS
- Azure
- GCP
overview
Experience data protection that runs inside your data pipelines, not alongside them. Protegrity’s Application Protector (AP) Java is purpose‑built to embed directly within the StreamSets environment, securing sensitive data continuously as it flows. Whether you’re orchestrating complex multi‑cloud movement or managing real‑time streaming ingestion, protection is applied directly within the active dataflow—before data ever reaches a downstream system.
The result: StreamSets users can scale modern data architectures with confidence, maintaining high‑speed ingestion and routing while meeting strict global privacy and compliance requirements.
Key Integration Feature
Modern data pipelines move fast—but security often lags behind. Protegrity’s integration with StreamSets solves this by embedding data-centric protection directly into the stream itself. By embedding Application Protector (AP) Java directly within StreamSets pipelines, sensitive data is secured in motion, not after the fact. This eliminates the need for external security services, network calls, or post-processing controls that introduce latency and risk. Teams can build, modify, and scale streaming pipelines knowing sensitive data is protected by default, without sacrificing throughput or architectural flexibility.
Features & Capabilities
See how Protegrity protects data at stream speed inside StreamSets—combining in-flight enforcement, centralized policy, and flexible pipeline integration for modern data movement.
01
Vaultless Tokenization at Stream Speed
Why It Matters
Sensitive fields need protection that preserves structure and analytical value as data moves. Vaultless tokenization keeps schemas intact and avoids lookup latency, so downstream systems can continue to join, validate, and analyze data without changes.
How It Works
A retailer streams customer transactions through StreamSets into a cloud data lake. AP Java tokenizes identifiers in flight, allowing analytics teams to work on production-scale data while real identities stay protected.
02
In-Flight Protection with Embedded AP Java
Why It Matters
Streaming security must keep pace with data velocity. By running AP Java inside StreamSets, Protegrity avoids external API calls and centralized bottlenecks, helping protection scale with pipeline throughput.
How It Works
A healthcare provider ingests continuous patient telemetry. Protection runs locally within StreamSets processors, securing sensitive fields before the data lands in downstream repositories.
03
Dynamic Policy Enforcement Within Live Pipelines
Why It Matters
Not every field or destination requires the same treatment. Dynamic policy enforcement lets a single pipeline apply different protection methods based on sensitivity, destination, or use case without duplicating flows.
How It Works
When an HR data stream is routed to multiple targets, Social Security Numbers are tokenized before analytics delivery while less-sensitive attributes continue through the pipeline unchanged.
04
Embedded Pipeline Integration Without External Services
Why It Matters
Data engineering teams need security that fits existing workflows. Embedding protection directly in StreamSets reduces operational risk by removing external services, hard-coded secrets, and extra network dependencies from the pipeline design.
How It Works
Teams implement AP Java within StreamSets processors or evaluators, where cryptographic operations and policy checks run locally on execution nodes and sensitive keys stay out of pipeline logs and configurations.
05
Centralized Policy Management Across StreamSets Flows
Why It Matters
Streaming environments become hard to govern when each pipeline is configured differently. Centralized policy management helps ensure the same rules are enforced consistently across flows, teams, and destinations without manual updates.
How It Works
A financial institution defines a global credit-card policy in ESA. That rule is then enforced across StreamSets pipelines automatically, regardless of destination system or cloud platform.
Architecture &
Sample Data Flow
Protegrity’s StreamSets architecture is designed for high‑velocity, in‑flight protection. By embedding Application Protector (AP) Java directly into the pipeline, security becomes a built‑in layer of the data flow—not a downstream control. This approach ensures sensitive data is consistently protected as it moves between sources, pipelines, and target systems, across on‑premise, cloud, and hybrid environments—before exposure risk ever exists.
The data journey
Visualizing the data journey
The data journey
The data journey explained
-
01
Protect data at ingestion
As data enters StreamSets from source systems, Protegrity can apply tokenization, masking, or encryption immediately within the active flow. This helps ensure sensitive values are protected before they move into downstream pipelines, lakes, or warehouses.
-
02
Enforce protection during pipeline transformation
Protegrity runs inside StreamSets through embedded AP Java, allowing protection to happen in flight as data is transformed, routed, enriched, or filtered. This keeps sensitive fields governed without introducing external calls, added latency, or separate post-processing steps.
-
03
Deliver governed data to downstream platforms
Protected data can be streamed safely into destinations such as Databricks, Cloudera, cloud storage, or analytics platforms. Because sensitive fields remain tokenized or masked in transit, downstream teams can work with governed data without direct exposure to clear-text values.
-
04
Enable controlled unprotection and auditability
When approved users, systems, or workflows require access to original values, Protegrity supports controlled unprotection at the destination or within authorized consuming applications. Protection and access events are logged to support audit readiness, compliance reporting, and centralized governance.
Use Cases
See how organizations use Protegrity + StreamSets to protect sensitive data in motion—so teams can modernize streaming, ingestion, and cloud data pipelines without exposing raw values or slowing throughput.
Finance
Protecting payment and customer data before it reaches cloud and analytics platforms.
Challenge
Financial institutions often move payment, customer, and transaction data through streaming and ingestion pipelines into lakes, warehouses, and fraud-analysis environments. The challenge is securing sensitive values in motion without adding latency, external dependencies, or separate protection steps that slow delivery.
Solution
Protegrity applies vaultless tokenization and policy-based protection directly inside StreamSets pipelines, so sensitive fields are protected before they land in downstream systems. This allows the same flow to support analytics and reporting use cases while reducing exposure of clear-text PCI and PII data.
Result
Organizations can move sensitive financial data into modern cloud and analytics environments with stronger protection by default, while preserving pipeline throughput and reducing the risk of unprotected data moving across systems.
Healthcare Payers
Protecting PHI in motion across high-volume data pipelines and analytics feeds.
Challenge
A healthcare organization needed to move sensitive patient and telemetry data through StreamSets into downstream analytics environments while maintaining HIPAA-aligned controls. The challenge was protecting PHI in motion without introducing bottlenecks or increasing the complexity of the pipeline architecture.
Solution
Protegrity was embedded directly within StreamSets using AP Java, enabling in-flight tokenization and masking as data moved through active pipelines. Policies were managed centrally in ESA and enforced locally on execution nodes, allowing sensitive health data to be secured before it ever reached target platforms.
Result
The organization validated that sensitive data could be de-identified in motion while remaining analytically usable downstream. Teams preserved pipeline performance, improved compliance posture, and gained confidence in using StreamSets for governed, high-speed data movement.
DEPLOYMENT
Deploy Protegrity inside StreamSets so protection runs in the active dataflow—across on-prem, cloud, edge, and Kubernetes environments—without introducing external bottlenecks or post-processing controls.
Embedded AP Java in pipeline execution
Custom processors and scripting evaluators
High-speed ingestion and streaming pipelines
Cloud, edge, and Kubernetes deployment patterns
RESOURCES
Provide links to comprehensive documentation, guides. Include information for both developers and non-developers.
Docs Center
Explore product documentation, policy guidance, and implementation patterns for tokenization, masking, encryption, and AP Java-based protection across streaming and pipeline environments.
READ MOREIBM StreamSets Documentation
Official product documentation for StreamSets Control Hub and Data Collector.
READ MOREFrequently
Asked Questions
Here are five common questions related to the integration, deployment, and features of the Streamsets and Protegrity solution, with their answers:
Protegrity integrates with StreamSets by embedding protection directly into the active pipeline using AP Java. In practice, that means sensitive fields can be tokenized, masked, or encrypted while records are being processed in flight, rather than waiting until after the data lands in a downstream platform. StreamSets Control Hub is used to design and monitor pipelines, while Data Collector engines execute the data processing.
The main StreamSets components for this integration are Control Hub and Data Collector. Control Hub provides the web-based interface for building, deploying, and monitoring pipelines, while Data Collector is the engine that runs the pipelines and processes records. StreamSets also supports stage libraries and scripting/evaluator stages, which makes it possible to embed protection logic directly into the flow where it is needed.
Yes. In this integration pattern, protection is applied in flight within the StreamSets pipeline before data is written to a downstream target. That allows teams to deliver already-governed data into destinations such as cloud object storage, warehouses, or analytics environments, reducing the risk of clear-text sensitive data landing unprotected. IBM StreamSets is designed to process streaming pipelines continuously as data becomes available, which aligns well with this in-motion protection model.
Yes. IBM StreamSets supports self-managed environments, cloud-based deployments, and Kubernetes-based deployment patterns for engines. That means the same integration approach can be used across on-prem, hybrid, and cloud environments, as long as the pipeline runtime includes the required Protegrity components and policy connectivity.
Not necessarily. The goal of this integration is to place protection logic inside the running pipeline so data engineers can apply protection where it makes architectural sense, rather than relying on a completely separate downstream security step. StreamSets already supports reusable stages, stage libraries, and scripting evaluators, which helps teams introduce protection into existing flows without redesigning the whole platform.
See the
Protegrity
platform
in action
Accelerate data access and turn data security into a competitive advantage with Protegrity’s uniquely data-centric approach to data protection.
Get an online or custom live demo.