Tesouro Informatizado da Lingua Galega was originally conceived of as an aid for the development of a dictionary. Despite a fairly long modern Galician lexicographical tradition since the appearance of the first dictionary in print in 1863, in the course of which several works of note were produced, the fact is that most of them are primarily oriented to the Galician dialects rather than to the literary language. Even though the occasional passage from written works is cited from Valladares onward, the number of such citations is negligible in comparison to the attention paid to items of dialect origin or products of the lexicographer’s own introspection. There thus exist countless words and senses in literary Galician that never got into the dictionaries. This would be reason enough to justify the creation of a corpus; but modern reference dictionaries, and not just historical ones, generally include various types of information that are needed to profile lexical items adequately with regard to their meaning and combinatory potential, frequency, chronological distribution and so on. This is the context in which it was decided to develop TILG, which was originally known as the Base de datos lexicográfica para un dicionario da lingua galega.
Up until 1985, the ILG had been collecting a word bank based on materials originating from dialect work, recorded on index cards sorted in box files in the old way. In the eighties, when computing was made widely available and concordancing programmes became commonplace which greatly simplified the process of extracting lexis from written texts, the idea occurred to us of developing an electronic system to take advantage of these new technological possibilities. By 1985 the idea of such a data base was of course nothing new: one such was already created in France in the sixties (when punched cards were still in use) for the Trésor de la language française. Closer to home, in the eighties digital files were in course of development for the Basque and Catalan languages, under the direction of Ibon Sarasola of the Academy of the Basque Language and Joaquim Rafel of the Institut d’Estudis Catalans, respectively. Talking to the teams in charge of those projects, we were able to benefit from their experience when deciding about various aspects of how to codify the texts and process them on computers.
The manipulation of a text taking it through from the printed page to a finished lemmatized data base involves many steps. First it must be digitalized through scanning followed by optical character recognition, except when the poor quality of the image makes it necessarily to type the text in manually. Once it is in text format, the text must still undergo several processes, including the development of a special machine-readable edition.
Once digitalized, software automatically performs most of the further processing of the data except for filling in the headword and part of speech fields. There are now automatic tagging programmes that are capable of generating headwords and parts of speech, but in an under-standardized language such as Galician they create a lot of issues given the multiple ambiguities inherent in any language, which are hard enough to resolve anyway. But here the difficulties are increased by the existence of countless unpredictable forms on account of a profusion of morphological and phonetic variants which, while in some cases dialectal, in others are the product of a given author’s purist or even hyper-purist notions. An automatic lemmatizer can assign the unmarked form (the dictionary form) to any word form, but it cannot classify non-morphological variants. In other words, it can deduce a headword azucre from the forms azucre and azucres (and identify the former as the singular and the latter as the plural), but that is all. These considerations, plus the fact that when we began working on this, tagging programmes were in their infancy, led to the decision to write software to help us to fill in the headword and part of speech columns semi-automatically, allowing disambiguation at a single keystroke so that empty fields can be filled in without having to type in their content manually. This allows us to include under a single headword all a word’s morphological and phonetic variants. So if you look up the headword azucre you will find eight variants (with their corresponding plurals when they exist): asúcar, asucre, azúcar, azucr’, azucre, sucre, zúcaro and zucre. The programme groups all these under the canonical form (which coincides with the current standard form) azucre. This modus operandi has the drawback of being time-consuming but its advantage is that it makes it possible to return all the variants of a single lexical form as a group.
The process summarized here required an enormous effort in terms of human and eonomic resources. Uninterrupted funding has been received from the Dirección Xeral de Política Lingüística (and its successor the Secretaria Xeral de Política Lingüística) of the Xunta de Galicia. The project’s human resources consist of the staff it has employed and other contributors to the centre’s work.
As we said above, the data base was originally intended for use as material for a dictionary. Initially no one suspected that this was going to be anything but a rather sui generis file of little utility beyond the purpose for which it had been designed and only searchable in situ. Not long afterwards, however, such limitations on its use and physical location vanished, and the material was adapted to make it instantly searchable for many purposes from anywhere in the world.
The need to expand even further its possibilities of use as a linguistic resource led to it being incorporated into the Integrated Resources of the Galician Language (RILG) as of 2006. To this end the corpus was brought up to date in cooperation with the Computer Linguistics Seminar of the University of Vigo. Finally, the version presented here has undergone a revision of the project, which has been updated with a new search interface and several additional tools which have been provided by the team responsible for the project. The following graphs provide information about the make-up and chronological distribution of the corpus.
Número total de lemas: 95,409
O gráfico mostra o número de lemas rexistrados nos distintos treitos cronolóxicos. Para cada un deles ofrécense desagregadas as cifras correspondentes aos lemas novos (rexistrados por primeira vez nese período) e aos xa documentados nos períodos precedentes.
Número total de palabras: 26,253,108
Número total de obras: 1,958