What is the CLUVI Corpus?

The CLUVI Corpus (ISLRN: 910-993-402-072-9) is an open collection of human-annotated sentence-level aligned parallel corpora developed by the SLI at the University of Vigo and originally designed to cover specific areas of the contemporary Galician language in relation to other languages. With over 49 million words, the CLUVI collection currently comprises twenty-three parallel corpora in nine specialised registers or domains (fiction, computing, popular science, biblical texts, law, consumer information, economy, tourism, and film subtitling) and different language combinations with Galician, Spanish, English, French, Brazilian Portuguese, European Portuguese, Catalan, Italian, Basque, German, Latin, Simplified Chinese and Traditional Chinese.

At the moment, the CLUVI is the parallel corpus that contains the greatest number and the most varied thematic range of translations from/to the Galician language. Galician texts present in the CLUVI collection sum up to about 12 million words, which means a quarter of the total of the tokens in the corpus for all the languages and domains of translation.

Since September 2003, the SLI offers the possibility of searching and browsing the CLUVI parallel corpora online at http://sli.uvigo.gal/CLUVI/. Our search application allows for very complex searches of isolated words or sequences of words, and shows the bilingual equivalences of the terms in context, as they appear in real and referenced translations. When the term search is a lemma in a language, the result texts could include suggestions on its lexical equivalences in the translation languages using colour codes.

Furthermore, the multimedia corpora integrated in the CLUVI collection (the VEIGA multimedia corpus of English-Galician film subtitling and the LITTERA audio-textual corpus of English-Spanish literary texts) allow access to video sequences or sound fragments corresponding to the translations offered as a result of the search.

Due to copyright issues, the application returns a maximum of 1,500 hits only, in order not to exceed the limits of the right to quote. Users can search for terms in either language of the corpus, although it is also possible to carry out true bilingual searches, that is, to search for bilingual segments that simultaneously contain a term in the source language and another term in the target language. Search results are displayed in a parallel fashion as a list of translation units. In addition, the LEGA Corpus from CLUVI, a corpus of Galician-Spanish legal texts with 6.5 million words, can be freely downloaded from http://hdl.handle.net/10230/20051 with CC BY-NC-SA 3.0 license.

The number of aligned works and language pairs available in the website increases regularly, since the CLUVI is an academic research project in progress.