Corpus are reference samples of written language, large collections of texts and words. And they are at the foundation of language technologies. Machine translation systems and voice recognition applications would not exist without corpus, nor would modern dictionaries today.
Led by the research group Ixa of the Faculty of Informatics of the UPV, the researcher of the Language and Technology unit of Elhuyar Igor Leturia has come to the web to create corpus in Basque. When the research work began, the largest corpus in Basque had 25 million words; “other languages exceeded 100 million words in the 1990s,” explains Leturia. “We set ourselves the goal of overcoming this barrier, when we started analyzing whether the web could be a good source for creating corpus in Basque,” he added.
Leturia has taken advantage of the approach “as web corpus”, since the use of the web as a source allows to obtain more easily large corpus than manually. The extraction of corpus from the web by automatic methods allows obtaining varied, updated and large corpus in a much faster and cheaper way. In fact, the greatest limitation of traditional corpus is its cost: the collection and adaptation of texts in different formats and places requires a large workforce, from which the collections of reference words are extracted in the language.
Through the research work, Leturia has shown that it is possible to consult directly the web as if it were a corpus in Basque and, through the tools developed, has created from the web a general corpus of 210 million words (consultable in the Web-Corpus Portal). “More than 95% of the words that appear in the hand-developed Corpus are also in our country,” explains Leturia, “and many others that do not collect them.”
In addition to creating general corpus, Leturia has shown that the website is useful for creating corpus in certain areas of knowledge, both for obtaining text collections in full Basque and for creating bilingual text collections. In both cases, the domain corpus extracted from the web has been assimilated to those created manually. He has worked with corpus of computer science, particle physics and tourism, among others.
Leturia has used automatic methods applied in other languages for its development, taking into account the characteristics of Basque, and has sought solutions adapted to those characteristics of Basque. “Because Basque has a smaller mass of text than other languages and is more complex for automatic processing, it has put us in the face of more difficult problems”, explained Leturia, who has allowed developing tools that do not have “large” languages. According to Leturia, they have had the opportunity to make original and innovative contributions to the field of language technology, which are useful not only for Basque, but also for other languages with needs and characteristics similar to Basque.