First Online Version of the Corpus of Science and Technology

Gurrutxaga Hernaiz, Antton

Elhuyar Hizkuntza Zerbitzuak

On 14 December, the UPV Group IXA and the Elhuyar Foundation will present the online version of the Corpus de Science and Technology. It is the first special corpus or specialized in Basque. It is a structured and tagged collection of texts in Basque in the field of science and technology, whose main objective is to serve as a resource for the research of the use of Basque in these areas.
First Online Version of the Corpus of Science and Technology
01/12/2006 | Gurrutxaga Hernaiz, Antton | Elhuyar Hizkuntza Zerbitzuak
Presentation of the Corpus of Science and Technology at the LREC congress on linguistic resources
Genoa, 2006
(Photo: A. A. Gurrutxaga)

To feed the corpus, the science and technology works published between 1990 and 2002 have been taken into account. The corpus is classified by field (area of knowledge) and gender.

The corpus is labeled, both in terms of the structure and format of the text and linguistic level. Linguistic labeling has been carried out using advanced automatic processing technology of the Basque language (Eustagger label of the IXA group). The motto and category/subcategory of each word of the text are labeled. In this version of the corpus there are 8 million words, of which 1.6 million have been revised, disambiguated and manually corrected. The corpus is labeled in XML and the TEI standard has been followed.

A powerful consultation interface of the corpus has been organized, in which the user can perform simple and complex searches of all types, using for it a wide set of parameters: slogan, form of text, category, field, gender, section of corpus (manual corrected/corpus complete...). The results can be of two types. On the one hand, the short contexts (KWIC) and the extended contexts of the object of study, and on the other, the quantitative information, expressed in tables and graphs (frequencies, publications, distribution by res, etc. ).

The corpus will be available at www.ztcorpusa.net. In addition, from 2007 it will be available among the OECD resources for its commercial exploitation by license.

The texts introduced in this first version of the corpus have been collected in digital format by different suppliers thanks to the agreements signed with them. To all also our most sincere thanks.

The Corpus of Science and Technology project began to develop within the Hizking21 strategic research project. The Hizking21 project has received the following grants: Etortek Program of the Department of Industry of the Basque Government (2002-2004) and Gipuzkoako Zientzia Program, Teknologia eta Berrikuntza Sarea de la Diputación Foral de Gipuzkoa (2004). On the other hand, the Corpus de Science and Technology has had the collaboration of the Department of Culture of the Basque Government in the program Euskara and New Technologies 2005.

Gurrutxaga Hernaiz, Antton
Services Services Services
226 226
2006 2006 2006 2006 2006
Description Description Description Description
040 040
Elhuyar news; Enpresa
News from Elhuyar
Services Services Services
Babesleak
Eusko Jaurlaritzako Industria, Merkataritza eta Turismo Saila