Language Weaver’s statistically based translation software represents an important advance in the state of the art in automated translation. The software uses statistical techniques from cryptography, utilizing learning algorithms that learn to translate automatically from existing translations. What the software learns is up to date, appropriate and idiomatic, because it is learned directly from human translations.
Creating new language pairs and tuning existing language pairs to a specific domain involves a process called “training.” For statistically based translation software, training material consists of previously translated data. The translation system learns the statistical relationships between two languages based on the samples that are fed into the learning system. Because it looks for patterns, the more samples the system sees, the stronger the statistical relationships become.
Translated material for training may come in any number of formats, as shown on the left side of Figure 1 below. Translated material ranges from translation memory data and glossaries to translated archives and target language texts that are available on intranets, hard-drives, and websites.
Once translated data is collected, parallel documents (the original and its translation) are identified and aligned sentence by sentence to create a “Parallel Corpus.” The Language Weaver Learner processes this corpus and extracts statistical probabilities, patterns, and rules, which are called the “Translation Parameters” and “Language Model.” The Translation Parameters are used to find the most accurate translation, while the Language Model is used to find the most fluent translation. Both of these components are used to create a new language pair and become part of the delivered translation software for each language pair.
Figure 1: Language Pair Training and Development. The above figure outlines the process Language Weaver goes through to create new language pairs, or to train existing pairs for a specific domain. The LW Learner learns statistical patterns and relationships from human translations. It distills the patterns of the language and translation observed across millions of words of pre-translated text. The resulting translation parameters capture the statistical models of language, and of the translation correspondences between languages.
The user receives translation software which consists of the Language Weaver Decoder, Translation Parameters and Language Model for the language(s) purchased (See Figure 2 Below). New translations are submitted to the Decoder. Using the pre-populated probability tables in the Translation Parameters and Language Model that were created during the training process, the Decoder uses statistical algorithms to generate the highest probability translation output.
Figure 2: Run-Time Process. The deliverable software consists of the Decoder (together with interfaces and control software components), and the translation parameters & language model for the language pair. When a user submits new texts to the decoder to translate, it refers to the translation parameters & language model to complete the translation.