Machine Translation and Confidentiality
Machine translation and confidentiality: not a problem often discussed. Once text is typed into a translation engine, such as Babylon, Google Translate or Bing, who owns the translation? When working with a language service provider, it is standard practice for a Non-Disclosure Agreement (NDA) to be signed to guarantee the privacy of the information being shared and produced. However, most websites that offer automatic translation software stipulate that, once uploaded to their engine, they reserve certain rights to use, analyze, or publish said source or translated text. (This is not even to mention the transmission of data across unsecured http protocol, which many websites or third-party software programs will use by default when communicating with a cloud-based translation engine.) So, this seems like a pretty big issue. What does it mean for us in the translation industry?
What is Machine Translation?
Machine translation (MT) – as opposed to computer-assisted translation, such as digital glossaries – refers to a form of artificial intelligence that translates content from one language to another. This AI can be grouped into three major techniques, though there is significant overlap: rule-based, statistical, or neural.
Rule-based approaches try to map linguistic universals (i.e. grammar) between languages, but these fail to handle the complexity of syntax, semantics, and idiom. Statistical approaches – the dominant model for engines such as Microsoft and Google until machine learning overtook it – essentially take large amounts of bilingual text, (such as UN or EU policy documents) and make mathematical predictions based on the correspondences between them with no attempt to “model” the language’s grammar itself. Neural machine translation (NMT) builds on statistical models but uses breakthroughs in AI – namely deep learning models – to accomplish similar or better results using smaller sets of training text. Each has its own strengths and weaknesses, but all of them require a human linguist to review and finalize their output to guarantee quality because all of them can produce wildly inaccurate or nonsensical results, depending on the particulars of the text in question.
We’ve written about MT in passing a few times here: talking about its relationship to translator productivity, the role of translation, and what we offer. As a growing sub-field of computational linguistics and a key innovation in the localization industry, machine translation is a prominent topic in almost any forum dealing with translation. While MT technology and theory has evolved greatly since the 1950s, it still has major limitations. Most of these are well known, such as: lack of accuracy, lack of contextual specificity, and difficulty with highly idiomatic or morphologically complex languages.
Why Use MT at All?
So why use it at all? The main advantage it has over pure human translation is speed. You can insert text of almost any length and receive near instantaneous output in your target language. This is a far cry – especially for high volume work – from the days or weeks-long process when done completely by humans. So, when there is time pressure, it can seem intuitive that automatic translation would be a good option for high volume work. Not surprisingly, then, two areas that have seen growing interest in the benefits of MT are clinical trials (pharmaceutical industry) and e-discovery (a legal procedure).
Pharmaceutical companies and law firms are two of the heaviest users of translation services. Both fields are frequently confronted with large quantities of multilingual material that needs to be processed quickly. For the life science industry and healthcare translations, there is a tight regulatory framework which limits the use of MT since the quality of all documents is strictly monitored and reviewed. This means that inaccurate translation can impact the approval process and – down the road – a product’s time to market. There are also high standards in the legal world, but there is some flexibility in the discovery process, since discovery is an internal process wherein each side of a lawsuit sifts through information that may be useful in the case. Only the relevant pieces of evidence need to be subjected to a full, human translation to be used in court. This seems like an ideal scenario for automatic translation services – you only need to know the “gist” of most of these documents to decide what warrants further investigation. But what about data privacy? If private documents made available for discovery were leaked inadvertently online, that could be very bad news. Let’s look at two of the biggest players to see where their policies stand on machine translation and confidentiality.
A Look at Policies: Google and Microsoft
Here’s what Google has to say generally about their products and services, which applies to the public Google Translate interface:
When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content. The rights you grant in this license are for the limited purpose of operating, promoting, and improving our Services, and to develop new ones. This license continues even if you stop using our Services (for example, for a business listing you have added to Google Maps). Some Services may offer you ways to access and remove content that has been provided to that Service. Also, in some of our Services, there are terms or settings that narrow the scope of our use of the content submitted in those Services. Make sure you have the necessary rights to grant us this license for any content that you submit to our Services.
Not very comforting. There are no guarantees that content put into Google Translate would not end up reproduced or shared in another context down the line. In fact, just the opposite is said.
Although nobody thinks of the Microsoft Translator itself as a go-to translation resource, this engine powers a multitude of products, such as: Bing, Office, and Skype. Microsoft says this about their Translator products:
Microsoft Translator (which includes apps for Android, iOS, Windows, Translator Hub, Translator for Bing, and Translator for Microsoft Edge, collectively “Translator”) processes the text, image, and speech data you submit, as well as device and usage data. We use this data to provide Translator, personalize your experiences, and improve our products. Microsoft has implemented business and technical measures designed to help de identify the data you submit to Translator. For example, when we randomly sample text to improve Translator, we delete identifiers and certain text, such as email addresses and some number sequences, detected in the sample that could contain personal data.
While this is not as far reaching as Google’s broad policy, it does mean that they leverage any text or speech provided to train their engine and to develop further technologies or products. Depending on how you feel about the “de-identify” process, you may or may not feel that your content is being kept totally confidential.
Just as Microsoft Translator powers a variety of translation programs across platforms, Google also has an integrative Cloud Translation API, which has much stricter controls than its public site/app. When using their Cloud Translation API, they claim:
Google does not use any of your content for any purpose except to provide you with the Cloud Translation API service.
When you send text to Cloud Translation API, we must store that text for a short period of time in order to perform the translation and return the results to you. The stored text is typically deleted after 7 days, but can be temporarily stored up to 14 days in the event of a service failure.
Google does not use the content you send to train and improve our Google Translation features.
So, this suggests that the Google cloud-based API is a much more secure service, which does protect your content – and importantly does not use it to train its NMT for public translations via Google Translate.
Conclusion for machine translation and confidentiality?
When choosing translation services, there is a lot to consider: price, quality, language support, project management, capacity, goals, and workflow. With the increasing use of MT as a necessary part of an increasingly connected and digitized market, it behooves everyone involved to also consider data use and privacy. We’ve seen that machine translation and confidentiality are not mutually exclusive, but are also not guaranteed to go hand-in-hand. So, when evaluating an agency or product, ask about their privacy guidelines and what technology is being used. And maybe think twice before popping that paragraph into Google Translate to see if it’s relevant to your study or court case! Better to be safe than sorry and protect you and your client’s data.