About the SensoGal Corpus

Version
2.0

ISLRN
653-144-288-768-2

Project management, web design and development
Xavier Gómez Guinovart (SLI-UVIGO)

URL
https://ilg.usc.gal/sensogal/

Resource Type
Lemmatised and Semantically-Tagged Parallel Corpus

Size
35,731,774 words (843,276 translation units)

Languages
Galician, Spanish, English, French, Portuguese, Catalan and Basque

SensoGal Corpus sections
English-Galician SEMCOR Corpus (123,813 words)
BIBLOGAL4C Corpus of Galician-Catalan-Spanish-Basque biblical texts (2,591,360 words)
BIBLOGAL4P Corpus of Galician-European Portuguese-French-English biblical texts (2,798,904 words)
LEGA Corpus of Galician-Spanish legal texts (6,582,415 words)
UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation (3,724,620 words)
CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information (5,586,431 words)
TECTRA Corpus of English-Galician literary texts (2,465,154 words)
FEGA Corpus of French-Galician literary texts (1,863,959 words)
VEIGA Corpus of English-Galician film subtitling (294,714 words)
LEGE-BI Corpus of Basque-Spanish legal texts (2,384,053 words)
TECTES Corpus of English-Spanish literary texts (3,602,857 words)
LITTERA Audio-Textual Corpus of English-Spanish literary texts (1,968,676 words)
TECTON Corpus of English-Portuguese literary texts (924,425 words)
BETA Corpus of English-Spanish television series subtitling (820,393 words)

SensoGal Corpora Collection
Galician-Catalan SENSOGAL Corpora Collection: CONSUMER and BIBLOGAL4C Corpus (3,997,312 words)
Galician-Spanish SENSOGAL Corpora Collection: CONSUMER, UNESCO, LEGA and BIBLOGAL4C Corpus (12,827,373 words)
Galician-French SENSOGAL Corpora Collection: FEGA, UNESCO and BIBLOGAL4P Corpus (5,075,023 words)
Galician-Basque SENSOGAL Corpora Collection: CONSUMER and BIBLOGAL4C Corpus (3,610,874 words)
English-Galician SENSOGAL Corpora Collection: SEMCOR, TECTRA, VEIGA, UNESCO and BIBLOGAL4P Corpus (6.130.433 words)
English-Spanish SENSOGAL Corpora Collection: UNESCO, TECTES and BETA Corpus (6,313,033 words)
Basque-Spanish SENSOGAL Corpora Collection: CONSUMER, LEGE-BI and BIBLOGAL4C Corpus (6,564,532 words)

Contributors
Susana Brandariz (Galician translations from original English texts in the English-Galician SEMCOR Corpus)
Miguel Anxo Solla Portela (project design and software development)
Michael Lang (project design and interface localisation into English)

Publications

Gómez Guinovart, Xavier and Miguel Anxo Solla Portela (2020): Construction of a WordNet-based multilingual lexical ontology for Galician. In María José Domínguez Vázquez, Mónica Mirazo Balsa and Carlos Valcárcel Riveiro (eds.): Studies on Multilingual Lexicography (ISBN: 978-3-11-060467-2, ISSN: 0175-9264), De Gruyter, Berlin & Boston, pp. 179-196. DOI: https://doi.org/10.1515/9783110607659

Gómez Guinovart, Xavier (2019): Enriching parallel corpora with multimedia and lexical semantics: From the CLUVI Corpus to WordNet and SemCor. In Irene Doval and M. Teresa Sánchez Nieto (eds.), Parallel Corpora for Contrastive and Translation Studies: New resources and applications (ISBN 978-90-272-0234-5), John Benjamins, Amsterdam, pp. 141-158. DOI: https://doi.org/10.1075/scl.90.09gom

Gómez Guinovart, Xavier and Miguel Anxo Solla Portela (2018): Building the Galician wordnet: methods and applications. In Language Resources and Evaluation, 52:1, 317-339 (ISSN 1574-020X). DOI: http://dx.doi.org/10.1007/s10579-017-9408-5 (or full-text view-only version)

Simões, Alberto and Xavier Gómez Guinovart (2018): Extending the Galician Wordnet Using a Multilingual Bible Through Lexical Alignment and Semantic Annotation. Pedro Rangel Henriques, José Paulo Leal, António Menezes Leitão and Xavier Gómez Guinovart (eds.): 7th Symposium on Languages, Applications and Technologies (SLATE 2018) (ISBN: 978-3-95977-072-9), Schloss Dagstuhl/Leibniz-Zentrum fuer Informatik, Dagstuhl (Alemaña), pp. 14:1-14:13. DOI: http://dx.doi.org/10.4230/OASIcs.SLATE.2018.14

Solla Portela, Miguel Anxo and Xavier Gómez Guinovart (2017): Diseño y elaboración del corpus SemCor del gallego anotado semánticamente con WordNet 3.0. In Procesamiento del Lenguaje Natural, 59, 137-140 (ISSN 1135-5948).


Composition of the SensoGal Corpus
The following table shows the quantitative data for the SensoGal Corpus sections relative to translation units (TUs) per corpus, words per corpus and words per language in each corpus.

TUsWordsESGLENEUFRCAPT
English-Galician SEMCOR Corpus2734123813 6257761236    
BIBLOGAL4C Corpus of Galician-Catalan-Spanish-Basque biblical texts312792591360706125656998 505043 723194 
BIBLOGAL4P Corpus of Galician-European Portuguese-French-English biblical texts312792798904 656998759824 719229 662853
LEGA Corpus of Galician-Spanish legal texts145387658241534251263157289     
UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation479053724620962085902232927698 932605  
CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information89780558643117689981248520 1200313 1368600 
TECTRA Corpus of English-Galician literary texts852922465154 12119021253252    
FEGA Corpus of French-Galician literary texts505631863959 898433  965526  
VEIGA Corpus of English-Galician film subtitling27837294714 126805167909    
LEGE-BI Corpus of Basque-Spanish legal texts8254923840531402521  981532   
TECTES Corpus of English-Spanish literary texts9019436028571756017 1846840    
LITTERA Audio-Textual Corpus of English-Spanish literary texts635081968676985058 983618    
TECTON Corpus of English-Portuguese literary texts35726924425  470365   454060
BETA Corpus of English-Spanish television series subtitling59243820393376780 443613    
Total in the SensoGal Corpus8432763573177411382710892175469143552686888261736020917941116913