Appen is proud to announce that Dorota Iskra will be speaking at the 25th Meeting of Computational Linguistics in the Netherlands (CLIN25) at the City Campus of the University of Antwerp on February 5-6, 2015. Dorota’s presentation is entitled “Approach to Non-standardised Languages in Asian and African Markets.”
As a global provider of language resources and linguistic services, Appen often has to work with languages where no established standard is present for orthography, vocabulary or grammar. This problem, although it may concern some minorities’ languages in Europe, becomes only apparent in its full dimensions when entering Asian and African markets. Examples are Arabic, Pashto, Urdu and various Indian languages.
This presentation describes the problems we have encountered when moving to Asian and African languages and proposes a methodology for establishing internal standards for non-standardised languages. In the initial phase of the project when creating orthographic transcription from speech we end up with multiple spellings for the same word. These are all added to the dictionary since there is no spell checker. Our linguistic expert checks the dictionary and makes decisions about which words should actually have the same spelling. As a result a rough spell checker is created which can be used in the next phase of the project. But because it is far from comprehensive, an exception list is kept. In multiple iterations the linguistic expert goes through the list and adds words to the dictionary.
Language technology expects consistency in the data which can only be achieved through a systematic approach in the face of lacking standards. The approach we propose has been developed and tested in a number of languages and with extensive volumes of data, not only by ourselves, but also by our clients who […]