Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Automatic Transliteration Can Help Alexa Find Data Across Language Barriers

Alexa Blog: “As Alexa-enabled devices continue to expand into new countries, finding information across languages that use different scripts becomes a more pressing challenge. For example, a Japanese music catalogue may contain names written in English or the various scripts used in Japanese — Kanji, Katakana, or Hiragana. When an Alexa customer, from anywhere in the world, asks for a certain song, album, or artist, we could have a mismatch between Alexa’s transcription of the request and the script used in the corresponding catalogue.  To address this problem, we developed a machine-learned multilingual named-entity transliteration system. Named-entity transliteration is the process of converting a name from one language script to another. We describe the design challenges of building such a system in a paper we are presenting this month at the 27th International Conference on Computational Linguistics (COLING 2018). The first challenge is obtaining a large dataset that contains name pairs in different languages. Since we could not find a publicly available dataset that satisfied our needs, we created a new dataset based on Wikidata, a central knowledge base for Wikipedia and other Wikimedia projects. We have released our dataset online, together with our code, under a Creative Commons license. The Wikidata page for a given person will usually list versions of his or her name in multiple languages. We automatically collected all available pairings of English versions of names with Japanese, Hebrew, Arabic, or Russian versions. We then applied a few heuristics to filter out noisy pairs, which we detail in the paper. (We initially collected data on titles of works as well, but they too frequently involved translation, not just transliteration.) In most names, the pronunciation of the last name is independent of the first or middle names. So it makes sense to train a transliteration system on independent pairs of first names, last names, and so on.  Wikidata doesn’t include separate tags for first, middle, and last names, but there are systematic correspondences between the positions of names in different transliterations. So we wrote some scripts that use those correspondences to extract pairs of one-name transliterations. For example, the English/Russian Wikidata label pair [“Amy Winehouse”, “Эми Уайнхаус”] would produce two data instances in our training set: [“Amy”, “Эми”] and [“Winehouse”, “Уайнхаус”]. The result was a dataset containing almost 400,000 one-name pairs. We then used our dataset to train several machine-learning systems, employing both traditional approaches and more recent neural approaches that have yielded strong results on machine translation tasks. We achieved the best results using the Transformer, a neural-network architecture that dispenses with some of the complexities of convolutional or recurrent networks and instead relies on attention mechanisms, which focus the network on particular aspects of the data passing through it…”

Sorry, comments are closed for this post.