GALEGO | ENGLISH | ESPAÑOL

HELP FOR THE QUERY APPLICATION

The application allows you to perform different search operations on the texts included in the corpus, either by using the “query builder” or by directly typing into the CQL search box. Additionally, it is possible to obtain information about the distribution of forms according to various criteria. This help section provides information about the search system, but a more detailed explanation can be found in the Advanced text search and visualization manual (only in Galician).

Query Builder

The query builder allows you to search both text and documents.

Text Search

Two general types of search

In the search fields for "Type of deviation from the standard", "Source of the non-standard form", "POS tag (standard)", "POS tag (original)", and "Connector", a selector is provided with different tags. You must select the desired tag and click on "Search" at the bottom left of the query builder.

In other cases, you must type the desired word into the appropriate search box. Specifically, by selecting the corresponding option from the dropdown menu, it is possible to search for:

(a) the full word entered in the search box,
(b) a word that begins with a specific sequence of characters,
(c) a word that ends with a specific sequence, or
(d) a word that contains a given sequence.

Once again, after selecting the search type, click on “Search”.

Concrete types of search

  • Student's Final Version

This allows you to locate a word exactly as it was written by the student. To search for a sequence of words, you can write the first word, then click “Add token” (located at the bottom of the query builder, just above the "Search" button), type the second word in the “Word” field, and repeat these steps for as many words as needed. Finally, click “Search” at the bottom. In any case, searching for word sequences is more easily done using the CQL search box.

It is important to note that in the case of contractions, verbs with enclitic pronouns, combinations of demonstratives with the indefinite outro, and the use of the second form of the article, the system splits the word into its component elements. Therefore, the “Student’s Final Version” search will not find, for example, das, trátase, estoutro, or tódolos. To find these, you must use a CQL query or enter the first component (e.g. trata) and then “Add token” for the next component (e.g. se), as previously described for word sequences. Accordingly, a search for the preposition de in the "Student’s Final Version" will return examples with de both as part of contractions and standalone; a search for trata will return both cases with and without an enclitic pronoun, and so on.

  • Orthographic / Morphological / Lexical / Grammatical / Semantic / Discursive Standard

Some words in the text are standardized at one or more of these six linguistic levels, and the corresponding search boxes allow users to locate the respective standardized forms.

In any case, it should be noted that a search using this method retrieves both forms standardized by the project team and those that were already standard in the student's writing. For example, someone searching in the “Morphological Standard” box for a sequence starting with estea will find examples such as esté, estemos, esteñan, estén, esten... both when they were standardized from estea, esteamos... and when the student wrote the standard form directly.

In some cases, standardization involves suppressing the form written by the student. These suppressed forms, in addition to being recoverable by selecting the “Type of deviation from the standard” with the _ad (addition) label, can also be retrieved by entering two dashes (--) in the standardized form search boxes.

  • Type of Deviation from the Standard

The “Type of deviation from the standard” search box allows you to identify all forms tagged with a non-standard form code at any of the six linguistic levels mentioned above. The codes available in this search selector, along with their meanings and examples, are listed in the document EAGLES codes and labels used in the anotation of the texts (only in Galician)

  • Source of the Non-standard Form

Some forms tagged as non-standard are accompanied by another label that identifies the origin or cause of the deviation. This search box allows users to locate such codes, whose meanings are detailed in the same document mentioned above.

  • POS tag (standard) and POS tag (original)

Searches can be made by morphosyntactic parts of speech (POS). The search tool provides a selector with major POS (noun, adjective, verb, etc.). If a more specific search is required (e.g., masculine singular noun or 1st person singular present indicative), it must be done using the CQL query box. The labels used can be found in the aforementioned EAGLES code document.

As for the difference between standard and original word class, note that some forms are corrected at the lexical level, which may lead to differences between the class of the word written by the student (original POS) and the normalized form (standard POS). For example, frente (masculine) might be corrected to fronte (feminine). Only in such cases will there be a difference between original (NCMS000) and standard (NCFS000) POS. In all other cases, search results for both paths will be identical.

  • Standard Lemma and Original Lemma

These fields allow users to locate forms corresponding to a particular lemma (e.g., all occurrences of the verb valer in any of its conjugated forms).

Differences between searches by standard and original lemma stem from lexical-level standardizations. For instance, the original lemma of plato and platos is plato, while the standard lemma is prato. Likewise, the original lemma of articulo, artículo, and artículos is artículo, and the standard lemma is artigo. So, the original lemma represents the canonical form of the word as written by the student, and the standard lemma represents the canonical form of the standard Galician equivalent. If no lexical correction is applied, both lemmas coincide.

  • Connector

The “Connector” search box lets users find discourse markers used to link utterances within a text. These are tagged with labels listed in the EAGLES code document used in the CORTEGAL corpus.

  • Multiword Annotations

At the bottom right of the query builder, you can search for multiword annotations. Some deviations from the standard affect sequences of words and are annotated differently from the rest using stand-off annotations. These can only be retrieved using the “Deviation Code” selector in this part of the query tool. The codes and their meanings appear in the EAGLES codes and labels used in the anotation of the texts document (in the list of deviation codes marked as stand-off).

Document Search

Besides text searches, users can search for documents or combine both types. Documents can be filtered by:

  1. Title (e.g., ABAU/2016-2017/CD13/June/04). This field allows you to search for a specific text. The first part of the title (ABAU/2016-2017) is the same for all texts. What changes are the delegated commission number (CD13, CD09, etc.; see point 4), the examination date (June or September), and the final digits, which distinguish texts within the same delegated commission and date.
  2. Topic. You can select from four text topics: “Gastronomy” and “Consumption and Production” (both from the June session), and “Family Conflicts” and “Youth Role Models” (both from the September session). Choose one to filter your search.
  3. Examination date. You may choose to search only exams from the June session or only those from September.
  4. Delegated commission. Each ABAU exam is linked to a delegated commission (CD) assigned to the student’s center. Each CD includes various public and private centers within a specific geographical area. Users can select a particular delegated commission. A list of centers associated with each CD is available in the document Teaching centers associated with the delegated commissions of the ABAU tests (course 2016-2017). In the September session, which has significantly fewer students than June, some CDs are grouped together; this is why some combinations (e.g., 01-03-05) appear in the search tool.
  5. Number of words. You can filter texts by selecting those whose word count falls within a user-defined range. Counting was done using the DContado application, which excludes punctuation marks. Contractions, verbs with enclitic pronouns, and forms accompanied by the second form of the article are counted as single units.
  6. Number of lemmas. You can filter texts by the number of lemmas (distinct words), selecting those within a specified range.
  7. Lexical density. This is calculated by dividing the number of lemmas by the number of words, resulting in a value between 0 and 1. Since the search tool does not account for decimals, we multiply this value by 100 for filtering purposes. For example, a text with a lexical density of 0.54 is assigned the value 54, so the search range runs from 0 to 100.
  8. Number of sentences. You can filter texts by the number of sentences they contain, based on a chosen range.
  9. Words per sentence. Another criterion is the average number of words per sentence, calculated by dividing the total words by the number of sentences. Note that the applicationanc only considers the integer part of these averages. For instance, searching for texts with an average between 13 and 15 words per sentence will also include those with averages slightly above 15 (e.g., 15.3).
  10. Words in the longest sentence. You can search for texts whose longest sentencee falls within a specified word count range.
  11. Words in the shortest sentence. Similarly, you can search for texts whose shortest sentence falls within a specified word count range.
  12. Number of paragraphs. The application allows filtering texts based on the number of paragraphs, within a user-defined range.
  13. Sentences per paragraph. Finally, you can filter texts by the average number of sentences per paragraph. The same rule about decimals (as mentioned above) applies here.

After performing a search exclusively through the Document Search tool (without combining it with a text search), a list of the retrieved documents is displayed, along with all their attributes. By clicking on the first column (ID), you can access each text. Additionally, clicking the Search button located to the right of the CQL query box—without entering anything in any of the search fields—returns a list of all the texts along with their characteristics.

On the other hand, it is possible to combine a text search with a document search to filter the results. For example, you can work only with the subcorpus of texts on the topic of “Gastronomy” by selecting the corresponding “Topic” in the dropdown menu and, at the same time, use the query builder to search for deviations from the standard, specific lemmas, etc., of interest. In this case, the result will be a concordance showing the lines of text that meet the search criteria—but only within the texts that address the topic of gastronomy.

CQL Searches (“Corpus Query Builder”)

CQL searches allow users to search for a word or sequence of words written by the student by directly entering such word or sequence into the “CQL Query” box. The wildcard * can be used to represent zero or more characters.

But beyond this, CQL queries enable complex searches and allow users to fully exploit the functionalities of the search engine. For instance, we can find any deviation—whether orthographic, morphological, or lexical—that affects accentuation by entering the following in the search box:

[problem = ".*ac.*"]

The logical operator & can be used to combine two search criteria that must be met simultaneously. For example, the query:

[pos = "CS" & lemma = "que"]

retrieves all instances of the lemma que tagged as a subordinating conjunction.

Likewise, we can use the logical operator | to search for several forms at once. For instance, the following query yields a concordance with lines of text containing meu, teu, or seu:

[lemma= "meu"|lemma= "teu"|lemma= "seu"]

Negative filtering is also possible using the ! element, which must be placed immediately before the equal sign (with no spaces in between). This element excludes the items to the right of the equal sign. For example, the following query retrieves all instances of fai that were not corrected to hai at the semantic level:

[form= "fai" & scform!= "hai"]

By default, the search engine is case-sensitive, but this can be disabled using %ci. To do so, we write this element inside the CQL search box, to the right of the target form:

[form="pero" %ci]

If we are searching for a character string, the %ci expression must be placed inside the brackets for the word(s) where case insensitivity is required. For example, to retrieve examples of De feito and de feito, the correct query would be:
[form="de" %ci] [form="feito"]

As previously mentioned, more detailed information on the search system—particularly on CQL queries—can be found in the Advanced text search and visualization manual (only in Galician)

Frequency Queries

Another feature offered by the search engine is the ability to obtain frequency and distribution data. To do this, go to the “Frequency options” section displayed after the results of a search.

For example, if you want to quickly retrieve information on the distribution of preposition addition cases across different prepositions, you should search for the code G_prep_ad under “Type of standard deviation.” Once the search is completed, go to the “Frequency options” section, choose “Frequency by”, and select “Standard lemma.” The default output is a table showing distribution by preposition, along with absolute frequency (second column) and the index per ten million words (third column).

Depending on the selected display mode under “Graph” the results can also be presented as pie charts, bar charts, line charts, scatter charts, or histograms. In the final tab, “Statistics,” you can access a variety of metrics (mean, median, standard deviation, etc.). For this statistical data, the “Count” option allows you to choose WPM (words per million). Additionally, under “Download,” the data can be exported in various formats.

Similarly, if you want to examine the distribution of omitted punctuation marks by type, you should search for D_pm_om, and then under “Frequency by,” select “Discourse standard.” (Note that forms tagged as omissions in the grammatical, semantic, or discourse levels are not lemmatized and therefore cannot be retrieved through lemma-based searches.)

Another search option provided by the application allows users to directly find the distribution of different “Student final versions” that result from a given search. For example, if you perform a search for a lemma (e.g., facer) and then click on “Student final version,” you will retrieve all forms that were assigned that standard lemma, along with their frequency. Similarly, if you search for all forms with the part-of-speech tag SP, you can then obtain the list of prepositions with their respective frequencies.