GALEGO | ENGLISH | ESPAÑOL

ABOUT THE CORPUS

CORTEGAL is a corpus composed of 1,000 handwritten texts produced in 2017 by students from Galicia as part of the university entrance exam known as ABAU, the “Avaliación do Bacharelato para o acceso á Universidade” (Assessment of Baccalaureate for University Access). More specifically, the texts correspond to the essay section of the Galician language and literature exam. In this part of the exam, students are asked to write an argumentative text of between 200 and 250 words on a given topic, which is linked to a previously provided text. The essays, provided by the Interuniversity Commission of Galicia (CIUG), come from both the June and September exams of the 2016–2017 academic year. A total of 8,669 students took the exam in June and 1,197 in September, so the CORTEGAL sample represents 9.87% of all exams submitted.

In each session, two different exam models are proposed, from which students must choose one. The corpus includes essays from all four possible options for the essay question, which are listed below:

June 2017

Option A (initial text by Fran Alonso in Dorna 27, 2001)
In recent years, gastronomy and cooking have gained much popularity. Write a text expressing your opinion on this phenomenon: its causes, whether it is a passing trend or a more lasting cultural change...

Option B (initial text by J. Luís Sucasas in Vieiros, 2009)
Write a text about the importance of consumption and production (or consumerism and productivity) in our current way of life.

September 2017

Option A (initial text by Xavier Quiroga from Zapatillas rotas, 2014)
Present your personal opinion, with arguments, on the problem reflected in the text and, more generally, on this type of family conflict between parents and teenage children.

Option B (initial text by Mercedes Queixas in Palavra Comum, 09/10/2015)
The author is critical of the fact that children and young people mostly dream of becoming footballers or models (line 10). Write a text arguing either your agreement or disagreement with her point of view.

The educational institutions attended by students taking the ABAU exams are assigned to 26 Delegated Commissions (hereinafter DCs). Each DC covers a broad geographical area and includes both public and private schools, with students from diverse backgrounds (urban, peri-urban, small towns, or rural areas). The list of DCs and the educational institutions assigned to them for the 2016–2017 academic year can be found in the document Teaching centres associated with the delegated commissions of the ABAU tests (2016–2017).

The number of texts in the CORTEGAL sample is proportional to the total number of exams per DC and exam session (June and September), with two clarifications: on the one hand, the 29 exams corresponding to DC 25 were excluded, as this DC includes exams from students with specific needs from across all of Galicia, not from a geographically defined area. On the other hand, although the actual distribution of exams between June and September is 87.9%–12.1%, in the CORTEGAL sample this distribution was slightly adjusted (89.8%–10.2%) to reduce the weight of students who took the exam in both June and September (due to failing the first session).

The exact distribution of exams by session and topic in the CORTEGAL sample is presented in Table 1:

 

Session

Topic

Number of texts in the sample

Percentage of the total number of texts

Juin

Gastronomy

449

44,9%

Consumption and production

449

44,9%

Total Juin

 

898

89,8%

September

Family conflicts

51

5,1%

Os referentes da mocidade

51

5,1%

Total September

 

102

10,2%

Total

 

1000

100%

Table 1. Distribution of exams by exam session and topic in the CORTEGAL sample

The CORTEGAL texts are transcribed, tokenised, and annotated on the TEITOK platform (see the Transcription technical manual for further information). Regarding the annotations, it is important to note that non-standard forms written by students have been standardised and coded across six linguistic levels: orthographic, morphological, lexical, grammatical (syntactic), semantic, and discourse. The assigned codes indicate the type of deviation from the standard and, in some cases, also the source of the divergence (e.g. analogy, transfer from Spanish, etc.). A few codes, which affect or may affect word sequences (such as those identifying overly complex utterances), are annotated using a different system from the rest (multi-word annotations using the stand-off format). 

The document EAGLES codes and labels used in the anotation of the texts (only in Galician) provides a full list of code values. The Annotation manual for non-standard forms offers a detailed explanation of the standardisation criteria and code assignment procedures.

CORTEGAL is automatically lemmatised using FreeLing, which assigns a standard lemma and POS to each lexical form (the latter using EAGLES tags, whose values are also available in the aforementioned document). A key methodological feature of CORTEGAL is the manual assignment of a second lemma—called the original lemma—to most of the standardised lexical forms, and in some cases, a second grammatical category—called the original POS. For example, the form platos receives two lemmas: the standard lemma prato, assigned automatically by FreeLing, and the original lemma plato, manually assigned. The original lemma is the citation form representing the word written by the student (after orthographic and morphological standardisation, if applicable), including any inflectional variation found in the corpus. The original POS is also manually added when the written form differs in word class or subclass from the standardised lexical form. A clear example is found in forms labelled L_gen_su, which involve non-standard gender assignment. In such cases, the word leite in the phrase a leite is assigned the standard tag NCMS000 (common noun, masculine, singular), while its original tag is NCFS000 (common noun, feminine, singular).

Discourse connectors—used to link utterances (excluding temporal relations and intra-sentential links)—have also been annotated. These textual connectors are classified into 13 groups based on their pragmatic function, as described in the document EAGLES codes and labels used in the anotation of the texts.

Regarding text visualisation, the texts can be consulted in various layers: the full transcription, which includes student deletions (forms added later, such as those inserted between lines or over struck-through text, are highlighted in red); the final version of the student’s text, where deleted forms are removed; and the six standardisation layers mentioned above. The standardised forms are highlighted in different colours, provided that the "Colours" option is enabled. The colour scheme is as follows: orthographic in lilac, morphological in orange, lexical in green, grammatical in salmon, semantic in blue, and discourse in fuchsia.