Full-text search
Types of translation Language combinations Translation languages Collections

What is the SensoGal Corpus?

The SensoGal Corpus (ISLRN: 9653-144-288-768-2) is an open collection of sentence-level aligned parallel corpora, lemmatised and sense-tagged based on WordNet 3.0. The SensoGal Corpus was developed by the SLI at the University of Vigo from a selection of the corpora in the CLUVI Corpus and the English-Galician translation of 30 texts part of the English SemCor Corpus, with the aim of studying the possibilities of parallel corpus semantic processing in research, in didactics and in the development of language technology applications.

As a whole, the SensoGal Corpus has an extension of 35,731,774 words (843,276 translation units) and covers translations in different language combinations with Galician, Spanish, English, French, Portuguese, Catalan and Basque, in six specialised registers or domains: fiction, popular science, biblical texts, law and public administration, consumer information and film subtitling.

The English-Galician SemCor Parallel Corpus

The SemCor Corpus is a monolingual semantically annotated textual corpus created by WordNet team for English. It consists of 360,000 words in 352 texts extracted from the Brown Corpus. It is the largest open access semantically annotated corpus, with 192,639 lexical words (nouns, verbs, adjectives and adverbs) originally annotated for their meaning according to WordNet 1.6. From the 352 texts, only 186 are completely annotated for part of speech, lemma and meaning, while the other 166 have only the verbs semantically annotated.

The original English SemCor was processed at the SLI to incorporate new semantic labels referred to WordNet 3.0 ILIs (interlinguistic indexes) and to link them to Galnet, the Galician WordNet. The Semcor-ILI Corpus thus re-labelled can be consulted through a query interface on http://sli.uvigo.gal/SemCor/. Likewise, both the re-labelled corpus and the mapping used for its elaboration are available for download at http://sli.uvigo.gal/download/.

The English-Galician SEMCOR Parallel Corpus is a corpus in development that is integrated into the SensoGal Corpus collection, aimed at aligning the 186 fully annotated texts from SemCor with their equally annotated translations into Galician. In its current state, it contains the sense-tagged translations of 30 texts, totalling 2,734 translation units, with 61,236 words in English and 62,577 in Galician. In the SensoGal Corpus as a whole, the English-Galician SemCor Corpus constitutes a quantitatively small section. However, its importance lies in the fact that it is the only section to have undergone complete human revision, thus presenting a very high degree of reliability, both in terms of lemmatization and semantic labeling.

The CLUVI semantic tagging

With the exception of the English-Galician SemCor Parallel Corpus, the bulk of the SensoGal Corpus comes from the linguistic processing of a selection of 12 of the 24 corpuses gathered in the CLUVI Corpus. This corpus selection was lemmatized and semantically tagged by applying various open source utilities offered by FreeLing, IXA pipes and UKB. These automatic linguistic analysis tools apply techniques that are relatively accurate and therefore include incorrect analyses in the tagged corpus that would require human review.

Unfortunately, we cannot undertake this review in SLI due to lack of human resources. Even so, we think that the SensoGal Corpus, in its current state and in spite of all its defects, can be an interesting linguistic resource for language processing and useful as a consultation tool in the fields of language learning and translation. Be that as it may, we publish it in the hope that its existence will inspire and encourage other similar future developments with greater support and work force.

The search application

Since May 2015, the SLI offers the possibility of searching and browsing the SensoGal parallel corpora online at http://sli.uvigo.gal/SensoGal/. Our search application allows for searches of words, lemmas or concepts, optionally specifying the part-of-speech, and displays the results as bilingual equivalences of the terms in context, as they appear in real and referenced translations and in their tagged version. Furthermore, the LITTERA Audio-Textual Corpus of English-Spanish literary texts also provides access to the sound recordings corresponding to the translations displayed as search results.

A new utility of the SensoGal Corpus search application is the simultaneous querying of several corpora in the same language combination, aimed at obtaining the highest number of results for a particular pair of languages in the corpus. This added functionality allows the consultation of corpus collections in the Galician-Catalan, Galician-Spanish, Galician-French, Galician-Basque, English-Galician, English-Spanish and Basque-Spanish combinations.

In the menu How to Search, accessible from the header menu of this application, you can see a detailed description of all the search types and options, as well as an explanation of the display options and of the information contained in the results.