OpenTrad, opposite the Tower of Babel

Galarraga Aiestaran, Ana

Elhuyar Zientzia

Basque, Catalan, Spanish and Galician, four languages that converge in a system: OpenTrad machine translation system. It allows the automatic translation of texts and websites from Spanish into Basque, Galician and Catalan, as well as the translation of texts in Galician and Catalan into Spanish. In addition, they have developed the system in open source. To understand us without obstacles.
OpenTrad, opposite the Tower of Babel
01/04/2006 | Galarraga Aiestaran, Ana | Elhuyar Zientzia Komunikazioa
(Photo: A. Galarraga)

The Journal of Catalonia is published daily in two languages: Spanish and Catalan. To do this, they don't have twice as many employees as other newspapers. Its secret is an automatic translator. Journalists write the newspaper in Spanish and then the machine translator places it in Catalan. They direct the text between several correctors and is willing to go out to the street along with the Spanish one.

The Journal of Catalonia is a significant example of the value of automatic translators. In addition, the translator that has the newspaper is not the only translator from Spanish to Catalan, but many other examples. For example, the University of Alicante created interNOSTRUM for the Mediterranean Savings Bank. Translation is done in two directions and now allows anyone to use it for free on the website of the same name. Yes, it supports texts up to 16,384 characters.

In addition, in the Spanish state there is an automatic translator from Galician to Spanish, but it is a very closed and limited product. And in Basque, what? So far little. The IXA group of the UPV/EHU Faculty of Computer Science was developing an automatic system for translating English into Basque, but they did not move as fast as they wanted.

That was the situation two or three years ago. However, the OpenTrad development project was launched in 2004. In fact, researchers who developed interNOSTRUM were known to the IXA, and Eleka Linguistic Engineering and IXA work together. They joined similar jobs in Galicia and began to create an automatic translator of open source thanks to the grant of the Ministry of Industry, Tourism and Commerce.

According to Eleka, Iñaki Arantzabal defined objectives at two levels from the beginning: "On the one hand, we wanted to get a good automatic, fast and open source translator for Galician and Catalan and Spanish couples and, on the other, a prototype to translate from Spanish into Basque. It should be noted that the starting point of all languages was not the same: the Castilian-Catalan couple was quite advanced and, at the other end, to automatically translate from Spanish into Basque, almost everything was about to be done."

Close by area

In this, logically, the difference between languages influences a lot. It is evident that Spanish, Galician and Catalan are much closer to each other than any of them. Consequently, it is much easier to get a good translation system between Romanesque languages than when Basque is present.

As can be seen in these examples, automatic translators not only serve to translate texts but also to navigate in the chosen language.

Therefore, OpenTrad has two automatic translation engines, one for translations between Romanesque languages, Apertium, and another for translating from Castilian to Basque, Matxin.

Both are based on language rules. There are several forms of machine translation, but the main ones are those based on collections of previously translated texts, that is, corpus, and those based on linguistic rules --word order in the phrase, declination, verbs...--.

Iñaki Alegría, from IXA, explained that "linguistic rules-based systems work in three phases. First they perform a syntactical-morphological analysis of the original text, then make a transfer to the other language and finally create the text in that second language."

For translation between Romanesque languages, although the transfer is superficial, good results are obtained. This is what InterNOSTRUM does and from there they have set out to develop the Apertium engine. In a way, Apertium interNOSTRUM is an improved open source version.

That is what the Catalans have won over all, that the code be opened. In addition, OpenTrad has the code completely separate from language resources. Thanks to this, the system offers all the facilities for interaction and adaptation to the needs of the user. The system is willing to take on all the changes you want to make to enrich and improve.

Apertium not only performs syntactic transfer. In addition, it has several 'filters' to refine the translation. For example, it is able to detect structures typical of one language and give them its equivalent in the other. Consequently, a higher quality translation is achieved. For example, the translator of the Spanish-Catalan duo has a 95% reliability, that is, only five out of every hundred translated words are wrong.

Far in depth

However, Apertium does not serve to translate from Spanish into Basque. Languages are so different that superficial syntactic transfer is not enough. The structure of sentences also changes radically, so it takes a deep syntactical-morphological analysis engine capable of building a dependency tree, making a transfer and producing the text in Basque. They created Matxin.

The IXA has recognized that to develop Matxin they have had to do a "hard job", and the result is not as good as what Apertium offers in the translation between Romanesque languages. In any case, they have achieved the initial goal, since it was about creating infrastructure.

Analysis, transfer and generation.
IXA Group
On the other hand, the quality of translation in the development of the machine translator has been one of the main concerns, but it has also dealt with the speed of the system. In this sense, they have recognized their satisfaction. This allows you to browse the web pages in the translated language. According to an example of Arantzabal, being the original Spanish, there is the possibility to navigate in gipuzkoa.net in Catalan and Galician through OpenTrad.

Looking forward

So far, a good and fast automatic system has been achieved that translates in both directions for the Galician-Spanish and Catalan-Spanish couples, as well as a prototype to translate from Spanish to Basque. In the words of the head of Eleka, "we have achieved the goal".

But they have no intention of staying there. "We want to keep improving and completing. One way to improve results is to focus on specific areas. Each area uses its own language, with less ambiguity problems than acting in general. Therefore, quality increases by adapting the translator to a field, for example, by incorporating the corresponding terminological vocabulary." They hope to improve their reliability.

In addition, they intend to add others to the technology as a rule, specifically, they want to use parallel corpus. "In this way, if you want to translate a phrase, you will first see if it is already translated or if there is something similar. If there is something earlier, it will start from there to translate. Instead, if there is nothing similar, it will use rule technology."

The results are better if the translator adapts to an area. In fact, each area uses its own language and it is not so many problems of ambiguity.
A. Galarraga

In addition to improving and complementing, they want to create an automatic translator to translate from Basque into Spanish. Thanks to this, the outsiders would have the opportunity to know what is created in Basque. Another objective in the future is to be able to translate from English to Basque.

To make these advances, Arantzazu hopes to have the support of the Basque Government. In fact, a few years ago the Basque Government commissioned a Catalan company to develop a prototype machine translation. Now, OpenTrad is the state's most advanced system. That is why Arantzazu says: "We want to convince the Basque Government to boost our system. We believe that at least you can't stay out."

· http://www.opentrad.net

· http://opentium.sourceforge.net

· http://matxin.sourceforge.net

Services
Participants and distribution of tasks and responsibilities
Eleka Ingeniaritza Linguistikoa, S.L. : coordinator and responsible for the integration of all systems.
Transducens Group of the University of Alicante: Development of the Apertium engine.
UPV IXA Group: Matxin engine development.
Polytechnic University of Catalonia: analysis of Spanish and linguistic resources of Catalan.
University of Vigo: Galician language resources.
Imaxin Software company: verification of the Galician section.
Elhuyar Fundazioa: accreditation of the Basque language and language resources section in Basque.
Iñaki Alegría, coordinator of the IXA group: "The biggest problem is ambiguity"
The IXA group of the Faculty of Computer Science of the UPV/EHU was in charge of developing the translation engine from Spanish to Basque. Consequently, they know perfectly what the main difficulties are.
Iñaki, have you had the most complicated work?
The truth is that it is not easy, especially compared to the translation between Romanesque languages. The Catalans were more advanced than us in this matter. They had a solid foundation and also have enough superficial transfer because they are translating between related languages.
Our case is very different. The truth is that we have not left zero either, we have spent years researching and working on this subject and we have also based on FreeLing.
What is FreeLing?
(Photo: R. Cardboard)
FreeLing is a Spanish analyzer created by the Polytechnic University of Catalonia. This analyzer performs the whole syntactic tree, since a superficial analysis is not enough to translate Basque. The order of the elements within the sentence is very different, so you have to make the whole tree to then make the transfer and build the phrase correctly in Basque.
In addition to syntax or sewing, headache will also give you the lexicon...
Of course. The other languages are similar, but between Basque and Spanish there are many more ambiguous meanings. A word in Spanish may have two or more meanings in Basque, and the problem is that it must be selected. There is therefore a lexical-semantic difficulty. And then there is the morphological difficulty, that is, of the prepositions in Spanish one has to choose the case in Basque.
Is it possible to give an example?
I put you the example of always: leather shoes lady . Whose skin is it? For a machine, leather can be feminine, it is a syntactic problem. Also, where, where, where, who, the apposition? It is ambiguous in the case of Basque. Apart from this, skin can be leather or skin, and if the machine chooses the skin is wrong.
These are the main problems. Among the related languages, these problems are much lighter, but they give us a lot of work. We are still working to solve these and other problems, but at least we have created the basis and we intend and desire to move forward.
Galarraga de Aiestaran, Ana
Services
218
2006
Security
Venezuela
Software
Babesleak
Eusko Jaurlaritzako Industria, Merkataritza eta Turismo Saila