Lexical-semantic resources for the linguistic industry

A new language industry is being created, which aims to treat language with the computer. In order for this field to advance, lexical resources are necessary that provide meaning to words. The criteria of the Language Engineering program of the European Union highlight the fundamental role of lexical resources.

The Hiztegia 2002 project, which is also supported by the European Union (ERDF, 2FD97-2000-2001), is related to the following projects that have aimed at creating lexical resources: Wordnet, EuroWortdnet and ITEM. With this project, the IXA group aims to:

  • Structured version of the Basque Dictionary following the guidelines of TEI (Text Encoding Initiative). Standard Generalized Markup Language will be used.
  • Lexical knowledge base of the Basque Dictionary: composed of semantic relationships extracted from it.
  • Euskal Wordnet: Adaptation of EuroWordnet relating the concepts of English to those of Basque.

The creation of these resources aims to develop, among others, the following commercial products:

  • Structured electronic version of the Basque Dictionary (integrated in CD-ROM, Internet and/or text processors)
  • A thesaurus for the Basque language integrated in word processors: to be able to consult the relations between synonymy, hyperonimia, hiponimia and other concepts.

Historically, lexical resources were made by hand, but given that the amount of information that had to be worked required a great effort, in the last decade the path of automatic or semi-automatic aids has begun. Based on the information contained in the dictionaries and corpus, lexical knowledge bases (LAE) have been developed. The LAE has structured lexical resources with word and acceptance information. For example, in the EBL Wordnet, which is distributed free of charge, each English meaning is expressed with a set of synonymous words (the synset) and all the meanings are hierarchical. EuroWordnet is another LBL of the same design that has been extended to eight European languages (German, Spanish, Estonian, French, English, Italian, Dutch and Czech). Since most EBLs have been created for English, all other languages are vulnerable to new technologies. To address this situation we see two complementary solutions:

1. Creation of EBL from corpus and dictionaries available for each language. In our case, as a lexical source. We used the Basque Dictionary. The first work consisted of structuring the Basque Dictionary following the SGML-TEI standards. Thus, Basque will be useful for anyone who is studying or has a working tool. Analyzing the definitions of this structured version, we will obtain a series of lexical-semantic relationships: synonimia, hyperonimia (class-subclass relationship; for example: insect animal), meronimia (relation osoa-parte; for example: txori-moko), etc.

2. Use BLRs made in English to create BLRU for other languages. In our case, starting from EuroWordnet we want to do Wordnet for the Basque language, relating the concepts of English with those of Basque. To make this Wordnet into Basque we will use semi-automatic methods, but then we will comb the results manually.

Babesleak
Eusko Jaurlaritzako Industria, Merkataritza eta Turismo Saila