Speaking of the language of machines. Soles of experts

Roa Zubia, Guillermo

Elhuyar Zientzia

We have brought together some experts to discuss the trends in language processing and the peculiarities of Basque with respect to other languages. We have been with the computers of the IXA group of the UPV Kepa Sarasola, I aki Alegria and Eneko Agirre. In fact, the IXA group has organized in Donostia the SEPLN congress on the processing of language this year and has brought together many experts in the field.
Speaking of the language of machines. Soles of experts
01/11/2009 | Roa Zubia, Guillermo | Elhuyar Zientzia Komunikazioa
(Photo: Up I no)
What are currently the main challenges of language processing?

Eneko Agirre: I think they are issues related to understanding. The research carried out in recent years has meant a great qualitative leap, but that does not mean that the machine "understands" us now. I think they have taken small steps and the machines understand things in more and more areas. What is a place, for example. With the surnames there is always a problem, is Azpeitia a person or a place? Or a company? Beginning to understand these things is a step forward. And even if people seem very simple, without context they are difficult. Therefore, the challenge is to teach the machine fragments of this type of knowledge.

In fact, the mathematical and statistical methods based on corpus are playing in some way, they are doing what they could do and have difficulty advancing. Those based on the rules also gave their own and were somewhat stuck. Therefore, I believe that now the challenge is to learn the rules of the texts, and from the corpus try to contrast them by learning and contrasting them in some way, and know what has learned well and what evil.

Kepa Sarasola: To see what are the challenges we face today, there can be two levels: one of applications and one of guts within the language, basic tools that should then be used in applications. It can be affirmed that the needs of lexicon currently are almost 100% covered. Twenty years ago there were no computer dictionaries, all of them paper. Now you have on the Internet the meaning of all words, how they are said in other languages, etc. Morphologically, for difficult languages (such as Basque), 95-98% is full. In syntax it does 90% well for English.

So, what are we going to go to? So to the semantic and the pragmatic. And for this, here is a tremendous change. 20 years ago, to talk about any topic, we do not talk about what. Today, for example, we have Wikipedia, or Wordnet, Internet itself, etc. Now we have new resources to understand the meaning of the texts. And that has opened a door for us, but still not much has been worked.

Kepa Sarasola. Ed. : Up I no.
Has strength been given at the SEPLN 2009 congress?

I Aki Joy: The Congress was attended by guest speakers who reflect on the subject. For example, the syntax expert at the University of Uppsala, Joakim Nivre, pointed out that the syntax problem is not solved at 100%, but is very much worked. Following with the semantics, Eneko presented the situation to which he referred. The KYOTO project was also presented, a system that allows defining the meanings of words and terms through a wiki platform. The extraction of data knowledge was also discussed. And in his speech, Horacio Rodríguez, of the Polytechnic University of Catalonia, pointed out that we have to try to resume some of the challenges of classical artificial intelligence, but with more data and new ways. And I'm also a bit of that opinion.

On this path, Google has obtained very good results using some basic artificial intelligence methods. But if they do not use a deeper knowledge, in the short term little innovation will emerge.

You mentioned Google, to what extent are these large companies investigating in language processing?

I. A. A. I think Google is inventing taking advantage of what is done. He invests a lot, takes good advantage, has gained fame and has made a mark. This knowledge or tools could be integrated into applications for all audiences and at the industrial level. But they do not provide enough information and the demand for applications is less than expected.

I Aki Alegría. Ed. : Up I no.

R. R. A. A. In research you do not know who will come with the good idea. Although there is a large research team, perhaps good ideas don't come out of there, it can't be predicted. For this reason, large companies, such as Google, in addition to developing their projects, sign successful researchers.

Many people have gone to Google. In the United States they have mentioned that the best researchers have gone to Google. Among the young people, many people have been received and that has been noticed in the universities. People have gone there, then they have said that in Google not everything is so nice, but very few have made fame from there.

I. A. A. In this area, the applications that give money are detailed. Killer applications. Historically, three types of applications have been included in this group: machine translation, proofing tools (that is, tools for text editors, mainly correctors) and search. Precisely, the beginning of Google was the world of search. Machine translation is now being processed and lately it is also working on phone operating systems and on proofing tools. Somehow, the risk may be that Google monopolice all these investigations.

That risk will affect your work, right?

C. C. S: We, on the one hand, are happy because it is clearly seen that the techniques we work are useful. It is shown again and again. But, on the other hand, we are concerned that Google has data because they are the only ones. They know what people ask for, what they seek. And what people choose in the search results. For them it is very important to improve the system. Asking for a word most people click on the fourth option and soon after that fourth will be the same. These usage data are very important, but they are owned by Google.

R. R. A. A. Google knows that innovation is the way to move forward. They direct all energies to innovation.

Eneko Agirre. Ed. : Up I no.

I. A. A. And they give priority to money. The money, there, they. And that has some consequences. For example, Google seeks very badly in Basque. And they have been told. But they are not interested. At a given time it was decided to work with a maximum of forty languages. In the rest they make a literal search. That's a problem, but the brand has a lot of strength. It also integrates into many applications, etc. But today the Elebila app is looking for a much better version in Basque.

In what situation is the Basque language in relation to other languages for the treatment of the language?

I. A. A. English is the reference. For example, a researcher from Ethiopia came to congress. There they speak in their mother tongue. It is a Semitic language, they have to use another type of keyboard, but since there are no such keyboards on mobile phones, messages are sent only in English.

It is clear that the Basque language is small. From an economistic point of view, demand is low, so there are problems. At the research level, we are satisfied. In some areas, at least, we are a reference for other minority languages. Applications based on corpus require investments to obtain their own corpus.

R. R. A. A. As a language, Basque has its own typology, but it is not especially difficult to compute if we compare it with other languages. Although morphology is more difficult to treat, in other areas, such as phonetics, it is very easy. Each language has its difficult and simple differences, but in general, taking into account all the characteristics of the language, the difficulty of all languages is similar.

I go up Ib no

And to compare with other languages, you have to see each language according to the number of speakers. I believe that the Basque language is quite close to the languages spoken most. The most significant difference is the small size of the corpus used, which I think is the main lack in Basque. In English, for example, there are a corpus of billions of words. And the machines learn from large corpus. But, depending on the resources, we are at the top of the list.

C. C. S: As for the number of speakers, I saw the Basque language in the list 256, and in the research we are among the first 50. Why? Because there have been official aids, and I think those of us here do the things ordered. We have done things in an orderly and planned way. The tools and resources you generate at a given time are valuable in the future. We work incrementally.

The IXA group works on the processing of the Basque language. They are not the only ones. But a robot is a reference researcher in the effort to speak in Basque. If large companies, for example, would like to develop applications in Basque, they should probably go to them. Among others, they have participated in the development of the ANHITZ project, creating a virtual character that answers scientific questions. In short, a robot that speaks. It is a good example of language processing; seen from the outside, ANHITZ does not seem like a revolutionary application, since it does not respond as quickly and easily as a fictional robot. On the contrary, who knows the work behind the project carries out a very positive assessment. There is much to do in language processing, there is no doubt. But what is done is a huge job, there is no doubt about it.

Imma Hernaez: "The voices of the present synthesis systems are perfectly understandable"
Imma Hernaez works in the laboratory Aholkularitza of the UPV/EHU. Expert in systems of recognition and synthesis of voice by machines. Among other things, he has participated in the ANHITZ project, carrying out a virtual character that answers scientific questions. In this project, Hernaez and the staff of the laboratory Aholab developed systems of recognition and voice of the characters.
What are the main difficulties in knowing and synthesizing voice?
Difficulties are not the same in knowledge or in synthesis. To know the voice, the linguistic variety itself hinders the work, since there are dialects, accents, registers, etc. In addition, the voice is very variable according to different factors. The person's mood, health, time of day, and other factors change speech. In addition, there are environmental problems such as noise, the quality of audio systems, etc.
The difficult thing is to give the synthetic voice of naturalness, spontaneity and humanity, that is, to give the voice of the 'identity' we want.
What do you think is over and what is not?
(Photo: Imma Hernaez)
In the knowledge of the voice, when the vocabulary to be known is reduced and voice is given to the system, very satisfactory results are obtained, despite the adverse environmental conditions. Problems start when you get away from these conditions: for an instant conversation (that is, with unrestricted vocabulary and which is cut continuously), very satisfactory results are not yet achieved. It is obligatory to use a “pilot” microphone and the system adapts to the speaker’s voice, that is, it is formed with speaker’s voice samples.
The voices of the present synthesis systems are perfectly understandable. The naturalness of the voice is also achieved when the phrases or paragraphs are brief and a neutral style is used in reading texts. When expressing emotion or expressiveness, however, the systems of synthesis fail for the moment; the current systems close to naturalness are based on corpus, that is, they use gigantic databases, and the final quality depends on the size of these databases: the larger the database, the better the quality.
In addition, it is always the voice of a single person and if you want to change voice, new databases must be created. Therefore, the best method would be to use smaller databases, but to create different voices, modify certain parameters in the model used to create voice, although for the moment we do not know exactly what parameters should have, to avoid significant losses in the quality of the signal.
How is Basque compared to other languages? (I do not know if it is a specific language in itself from the point of view of pronunciation).
From the point of view of the research, the Basque language is not far from other languages, especially if we focus on methods and techniques. From a commercial point of view (especially in the field of knowledge), the construction of commercial systems requires standard databases by the developer companies for the training and testing of systems that allow to use the software they use with other languages. And we have very few. On the other hand, the developments so far have been limited to unified Basque in general, and the reality of spoken Basque is not the same as that of our neighbouring languages (for example, the main language of Europe). The distance between the batua and the dialects can be very large, and if the systems of knowledge do not fit the dialects, it is possible that only a limited part of society uses them.
Puente Roa, Guillermo
Services Services Services
258 258
2009 2009 2009 2009 2009
Results Results Results
021 021 021 021 021 021
Robotics Robotics
Article Article Article
Others
Babesleak
Eusko Jaurlaritzako Industria, Merkataritza eta Turismo Saila