Technological modernization of the vocabulary

Leturia Azkarate, Igor

Informatikaria eta ikertzailea

Elhuyar Hizkuntza eta Teknologia

In the development of dictionaries, as in almost any other activity, there have been profound changes in the last years of technology. We have gone from having as base and as objective the paper and demanding a great manual work to use electronic texts and corpus, to automate of the process and use digital supports for publication. In the preparation of Elhuyar dictionaries we have also taken this step of modernization accompanied by linguistic technologies.
hiztegigintza-teknologiaz-modernizatzen
New website of the Elhuyar dictionary. - Ed.

One of the four main departments of Elhuyar is Language and Technology. Within it they are subdivided into translation services, lexicography and linguistic technologies. Linguistic technologies are many and are useful in many areas. And we also research, develop and market those that are useful for many fields, but, as is normal, we especially work those that are useful for other areas of Elhuyar. For example, in translation services we work on machine translation and translation memory technologies that can provide a competitive advantage, as well as many technologies of interest to vocabulary.

Facilitating the work process

One of the works to be done in the elaboration of dictionaries is the selection of words. We have developed tools to support this, through the application of textual corpus, combining linguistic techniques and statistics that extract the most significant words, terms or locations of them.

One of them is Erauzterm. After offering a corpus specialized in a certain area of the Basque language, Erauzterm detects the terms that appear in it. It is not perfect in the measurement of automatic tools, but has an interface to perform a manual review.

ElexBI does something similar but in bilingual. From a parallel corpus (compilation of texts that are translations with each other, aligned at the phrase level), it extracts its equivalences of terms, that is, the pairs of terms of both languages. This tool has been enabled as a web service under the name of Itzulterm. And with this tool has been elaborated the dictionary of Professional Training.

AzerHitz does the same as Elexbi, but instead of taking parallel corpus as raw material (since parallel corpus is not as much as you want or are not as large as you want, especially in specialized areas or in certain pairs of languages) it uses comparable corpus. These are collections of multilingual texts that deal with the same subject without having to translate them among themselves. AzerHitz is able to extract a bilingual terminology of this type of corpus.

Another instrument to extract lexicographic information from the texts is Konemat. This extracts from the texts in Basque the combinations, glues, phraseology, etc. At the moment, he draws the most usual combinations of names, adjectives and names.

We also have the PopLex tool, which creates new dictionaries using two dictionaries and a bridge language. Five dictionaries were published online in Basque created with it in the portal of dictionaries built automatically, as we told you in July.

Raw material of work, corpus

As you have seen, many of these technologies need corpus, and that is why it is one of the areas where we work a lot on digital corpus. Together with the IXA Group of the UPV/EHU we created the Corpus of Science and Technology; for the Eroski Foundation we formed the multilingual corpus of the magazine Consumer; and for Euskaltzaindia, we are forming the Corpus of the Lexicon Observatory together with the IXA Group and UZEI.

However, since the production of corpus is expensive, in recent years we are creating tools to use the web to form corpus. To be able to consult the Internet as a corpus, a few years ago we launched the service Fere Eus online. And from the web we also have tools to automatically create large general corpus, specialized corpus, parallel corpus and comparable corpus. Through a large general corpus in Basque, built automatically from the web, a large parallel corpus Euskera-Spanish and the combinations extracted from the great general corpus through the aforementioned tool, we were consulted in the Portal de corpus Web, as we mentioned in February.

New website of Elhuyar Hiztegiak

In addition to facilitating the dictionary-making work process and providing electronic corpus for raw materials, technology in general and language technologies in particular can greatly improve the experience of dictionary users. Since a few years ago dictionaries began to be placed on the web, in most cases the option of search boxes has been offered to be able to perform quick searches instead of going to search in an alphabetically ordered list (although there are already those that simply limit themselves to putting online the PDFs of the dictionaries). But the results offered after the search are similar to those offered by the dictionaries on paper. In the new website of Elhuyar Hiztegiak (http://hiztegiak.elhuyar.org/), which has Basque, Spanish, Basque and French dictionaries, we wanted to go beyond and offer more advanced options.

For example, you can hear how a word is pronounced searched through two options: Through the audios recorded by users on the web Forvo, or through the technology TTS (text-to-speech or voice synthesis), that is, through the synthetic voice created by the computer. The TTS system we use is the one developed by the Consultab Group of the UPV and which we sell.

In addition, when we want to search for a word, as we type the word, it shows us the list of words that have that beginning, thus avoiding having to write everything and reducing the chances of writing erroneously.

On the other hand, in terms of the examples of words, in addition to the habitual ones introduced by the authors in the dictionary, this new website allows to visualize the examples that are found in the parallel corpus euskera-castellano extracted from the aforementioned web. These examples are not only of the target language, but of pairs of phrases that are translations among themselves.

In addition, in addition to the usual search for entries of the source language, it offers the possibility to search them in the entries of the target language. And we want to offer the possibility of looking for future examples.

Options are also offered to customize the dictionary, such as saving the latest searches, saving some searches on a list of personal favorites, etc.

Although for the moment we have published these innovations, in the future it is planned to introduce more things little by little. For example, the possibility of going directly to the aforementioned combinations search, also showing the results of other dictionaries and corpus, proposing a correct word when miswritten, showing the declinations or inflections of the word sought…

And more future!

In addition, in the coming years we want to further technology our vocabulary section. We continue to work on the construction of corpus to improve and create new automatic corpus construction tools, with which more and more corpus, larger and new pairs of languages are being formed. Our intention is that these new corpus will also be placed online in the Portal de Corpus Web.

But the main novelty will come from the field of vocabulary automation. Most of this kind of technologies that we have worked so far extracted from the corpus words and terms for the dictionary and its counter-performance, but in addition a dictionary needs definitions, senses and examples. Well, now we have also started working on how to get them automatically, that is, in the automatic extraction of definitions, acceptances and adequate examples of texts and/or websites.

Following the exploitation of the linguistic technologies we already had and developing the ones we have just launched, we want Elhuyar's dictionary to be a key point so that in an increasingly globalized world the Basque language can continue in contact with other languages.
Babesleak
Eusko Jaurlaritzako Industria, Merkataritza eta Turismo Saila