Unicode: Name a Language, And We’ll Protect the Data

September 22, 2021
Share on:

Data is international currency. No matter the language, kinetic data empowers businesses around the world to develop products and services, and to answer customers’ needs.

While data is indeed valuable in many ways, it must be made worthless in the hands of cybercriminals. That means data protection should also be universal. No matter the language, data protection should render data useless if it is breached in a hack or ransomware attack.

With the recent release of version 8.1 of the Protegrity Data Protection Platform, we’re improving how we tokenize the languages of the world. Gen 2 Unicode Tokenization is our new language-preserving capability for tokenization. It allows organizations to safeguard their data in English, French, German, Russian and several other languages. It’s an upgrade that our French customers will hopefully say is “parfait,” or “excellent.”

Providing Safe Passage

Unicode is an international standard for the consistent encoding, representation, and handling of text expressed in the world’s writing systems, both modern and ancient. It provides the basis for processing, storage, and interchange of text data in any language in all modern software and information technology protocols, according to the Unicode Consortium, a non-profit that maintains the Unicode Standard. It includes technical symbols, punctuations, and many other characters used in writing text—including those loveable (or, depending on your taste for flair, disdained) emojis that color digital conversations.

With v8.1, Gen 2 Unicode Tokenization, Protegrity is bringing all the advantages of our industry-leading tokenization solution to all text that’s encoded in one- and two-byte Unicode standards. These improved capabilities deliver character-encoding and length-preserving tokenization without performance penalties, allowing organizations to easily tokenize text from all Western alphabets.

Gen 2 Unicode Tokenization enables the tokenization of languages such as Italian, French, German, Russian, Greek, Turkish, and Hindi. The tokenization is made “language-aware” so that if the input is Russian, for example, the token output will be with Russian characters. Those previously unprotected Cyrillic characters now have safe passage on their tokenization journey.

Indeed, sometimes other methods of Unicode tokenization don’t always provide safety on data’s journeys around cloud applications and databases and on-premises systems. These methods can leak data. 

Say you have the German alphabet defined for tokenization but need to enter a German name. Some traditional tokenization settings would not cover an umlaut, for instance, and the letter would not be protected. Gen 2 Unicode Tokenization allows our customers to use both languages in tokenization, giving the umlauted character and many other unique characters the safe voyage that organizations expect when they exchange and use data around the world.

More Tokenized Languages to Come

We’re not done translating, so to speak. For our next iteration of Unicode tokenization—coming hopefully later this year—we’re aiming to implement the three- to four-byte Unicode ideograms, or characters, of the Chinese, Japanese and Korean languages.

The ability to preserve language when tokenizing data gives organizations confidence to share secure data around the world. With more languages to choose from, they can expand markets and reach out to customers and partners regardless of geography. 

As our Italian customers would say, “la flessibilità fa bene agli affari.” Or, flexibility is good for business.