Nowadays, more and more texts are computed with the help of a text processing program that offer great ease and help for subsequent correction and revision tasks.
In the production of texts in Basque, in addition to the problems of any language (typing errors, etc. ), other aspects also appear from the point of view of correction. There are, among other things, errors of all kinds that derive from the fact that the person who models the text or composes it in the printing press is an unliterate person, or in Spanish, the problems that affect the particular situation of current linguistic unity, that in recent years the Basque language has spread to countless new fields, etc.
Therefore, various "irregularities" continue to appear in the texts in Basque. Less and less. It seems that fewer and fewer errors are due to ignorance of the rules or negligence. But the need for correction is there and it will be there.
And what can the computer help? It has already been commented that the use of the computer in the current production is increasingly frequent due to the facilities offered by word processing programs. Among these facilities are, among others, aids to adapt the format of the text, to change from one text to another, to introduce new texts without the need to rewrite the whole text, etc. In addition, systems are appearing that allow you to automatically check the spelling of what we write in several languages, especially in English, of course, but also in others closer. Some of these systems want to go further, offering assists in dot correction, syntax and style.
Therefore, from now on we will have the opportunity to perform a spelling correction in the menus of the word processing programs, but it is evident that the system that will give this help is not for all languages, but must have a different version for each language. And for the Basque language, I think that if we don't do it here, at least for a while. Soon we will have – already exist – systems that will offer us this help, although in their original in Spanish they allow it.
Let's see what problems this automated correction presents. At the moment we will limit ourselves to the field of spelling.
We have two types of programs or systems: on the one hand, verifying spellings, that is, they let us know the words that appear poorly written in that text and then we must correct. On the other hand, we have corrective spelling that allows us to perform a spelling review and perform a correction in interaction with the computer, proposing possible alternatives to the wrong word or considered erroneous.
The first research in this field dates back to 1957. The first finished product is the program called SPELL (1971).
The only task of the first programs was to provide a list of the different words of a text (usually ordered frequently appearing). Then someone with patience would analyze that list and find misspelled words (note that errors appeared at the end of the list because of their low frequency). The following programs have begun to perform a certain analysis of the words (based mainly on the analysis of digrams and trigrams, that is, taking into account the different frequencies that have the different pairs of letters and triples of letters in each language, one could calculate an index of singularity of the word), which only list those that could be miswritten. But the particularity of current programs is that they are programs that use vocabulary. That is, to know if a word is correctly written or not, we use the dictionary: if the word is there it is given for good and if not.
Dictionary construction is very important in these systems. And the measure of the dictionary goes through a fundamental decision: what to include and what not?. Without thinking too much, it seems that it is best to put everything in. But we will soon realize the danger of doing so: this dictionary will have many obsolete words, most useless, and the possibility of giving good miswritten common words will be greater. Euskera, for example:
with the word 'aueta' in the dictionary
auger
(bn-gar), augeta (bn-sal) serenata, alborada / aubade, sérénade (Col.)
'
will have to accept the word 'auetako', although it is much safer than writing 'these' is a mistake.
The errors of poor vocabulary are also evident, with the danger that well-written words (for not being in the dictionary) will be considered bad.
In addition, with regard to the effectiveness of these systems it is evident that one of the most critical tasks to perform is the search in the dictionary, so the measurement and organization of the dictionary are very important factors. The most used strategy for gaining time in search consists in the treatment of the most frequent words: through a statistical analysis these words will be identified and the search in the dictionary will be divided into two levels: first it will be checked if the word of the text is among the most frequent (this search will be done with aguro, since they are not so many words), and if it does not exist (and only then) will be used to the general dictionary. In this subdictionary of more frequent words, a number of words could be available ranging from 250 to 500, of which approximately 50% of the words in the text are expected to be understood.
So far we have limited ourselves to verifying spelling. However, most of the programs currently on sale also offer help for interactive correction: we have corrective spelling. Its peculiarity lies in the way of working. While the program performs spell checking the user is on the screen. When the program detects an error, it will note the word on the screen and ask the user what to do. Then the user has different options: you can correct the word or ask the system to give possible alternatives to that word and then choose between them, of course correct. Nor will he be denied the possibility of keeping the word. In addition, most systems handle a user's vocabulary in which the user does not know the system and can enter certain words that he usually uses. Once it reappears, the system does not consider them wrong.
In the face of all this, there are months in which a project has been initiated that has as its first objective the spell checker for the Basque language. This project involves the APIKA computer services company, UZEI and the team dedicated to the processing of natural language at the Faculty of Computer Science of San Sebastian. As has been said, the first intention of this group is to offer an interactive corrective spelling to anyone who writes in Basque with the help of the computer. Remember that, for the moment, we refer to spelling and, therefore, to that succession of characters to admit a word (succession of characters between spaces), without becoming aware of its context. I believe that many of the errors that appear in the texts in Basque cannot be corrected in this way, since they are often syntax or other errors.
For example, the teaching, the examination, or the writing of the ins and outs, are considered wrong, but it will not be caught in anything empty, as you have told me, in phrases like today Monday, because the words can be approved individually. To be able to detect them, in addition to the mere spelling information of vocabulary, it is necessary another large amount of information, such as that provided by the morphosyntactic analysis of the sentence. Let's leave this for later, because it's not a slow job, although one day it will have to be addressed.
The team from the natural language treatment area that is part of the project has previously been mentioned. The artificial languages commonly used in the computer world have given rise to a curious denomination of natural language to speak of common language. The importance of this field, which focuses on the understanding of language and the automatic creation of language, is growing. On the one hand, the importance of being able to communicate with computers in natural languages (in our case, in Basque), and on the other, the contribution that this adaptation to the logic and systems of computers supposes to the theoretical knowledge of the language itself.
The existence of a series of general tools and systems for this field of natural language processing makes each language require its own infrastructure: the basics in any language are automatic morphological and syntactic analyzers. Then come the most confused problems of semantics and pragmatics.
A language with a high degree of bending like Basque presents special morphological problems when you want to deal with its automatic treatment. However, solving these problems, the information obtained from the same morphological analysis is much richer than in other languages with a simpler morphology. This information is of great importance in the levels after the analysis, that is, when it is intended to analyze the syntax and semantics of the language. In languages like ours, morphological analysis is the first problem that any linguistic treatment system must solve.
On the other hand, we talked about the importance of the vocabulary measures that the spell checker needs. It is evident that in languages such as Basque, every word spoken with all its push-ups (and we only speak of flexions at the declination level) will greatly increase the vocabulary, since the search time is too long. In other languages (think for example in English morphology) this problem has been underestimated and has sometimes introduced all forms of word into the dictionary. In the most complex languages of morphology, however, this problem must be addressed properly and in the vocabulary there will only be slogans, although the treatment is more complicated.
According to what was said, checking the spelling correction of a word is not just ensuring that that word is in the dictionary. Because the whole word has nothing to do with the dictionary. When the root of the word is in the dictionary of slogans, plus the succession of suffixes that could be associated with that motto, it will be from behind what must be verified, so that the word is accepted. Therefore, a morphological (though not exhaustive) analysis of the word will be performed to verify its correction.
As can be seen, the relationship between the morphological analyzer and the spell checker is extremely narrow and can be seen as a by-product of baptism. Both would like to be the first results of this group, which was based on the task of breaking the path of automatic treatment of Basque.