The CoEDL corpus platform

Tom Honeyman

October 5, 2017

CoEDL has a “proof-of-concept” corpus platform, currently restricted to corpus contributors. This is a project aimed at making textual materials more readily available to researchers.

This is a basic guide to the platform.

CoEDL corpus platform

The preliminary corpus platform is available, but currently limited to those with a password.

ANNIS is an open source corpus platform. A generic user guide is available, while this is a simple guide to just the features available within the CoEDL version.

Entering the corpus platform

The corpus platform can be accessed at http://go.coedl.net/corpora. Currently access is limited to contributors, via a password. Click on the ‘Login’ button located on the top-right corner of the main page, and then enter your username and password to continue:

password prompt

The main page looks like this:

main annis page

Note the logout button in the top right.

The basic layout

The page is made up of three components: the query panel and corpus list on the left and the results page(s) on the right.

To begin with, the query panel will be empty:

empty query panel

The (sub-)corpus panel will list one or more sub-corpora:

sub-corpus list

The “Visible:” drop down menu filters the sub-corpus list:

visible list

Corpus contributors can browse other corpora (which is encouraged, so you can see what other types of annotations others are contibuting). View all available corpora by choosing “All”:

All sub-corpora

Searching

To search the corpora, select one or more subcorpora in the list. In the example, we are searching in the Gurindji-Kriol corpus:

select a sub-corpus

Searching words/tokens

Every single corpus has a baseline layer called “tok” (for “token”). This is usually a word level representation of the primary text. It is the default layer to search on, and so a basic query can be either a search for a word (in quotes):

search for a word

Or a regular expression between forward slashes (//):

search for a-initial words

More complex searches can be built with the query builder. This is a good way to learn the full syntax of the annis query language (AQL).

Query builder

The easiest way to build a complex search is to use the query builder in the top left. Click on “query builder”:

query builder

After clicking “initialise”, we can begin to construct a query.

Queries can be sequences of one or more “tokens” (i.e., annotations on a specific layer or tier). They can fall under the scope of a “span” (e.g., limited to a specific speaker). Metadata for a file can also be used to constrain the search.

In order to fall under the scope of a span, these spans must first exist in the corpus. Not all corpora have these spans. If you provided segmented text (e.g., utterances) with extra information like speaker turns or translations, then you should have annotations for these categories of information. Spans can be any grouping that interests you. For instance, spans of reported speech, of syntactic units, or of any other grouping that may be of interest to you.

Linguistic sequences

Begin by choosing the “word sequences and meta information” search, and then clicking “initialise”. Then add an element/token to a linguistic sequence. First choose which layer you want to match:

choose a layer

If you choose the default “tok” layer, then you type in a word/token that you’d like to match. Regular expressions can be used, but note that the regular expression must match the whole token, not part of the token.

For any other layer, a list of possible values will appear.

choose a token

If you want to match more than one form, add a second token to match against:

choose a second token

Here we are matching on the same layer, but it is possible to match adjaceny on a different layer too.

In between the two, choose the kind of adjacency relationship:

adjacency relationship

Scope

If there are layers defining useful spans which tokens fall under, results can be limited to a specific scope.

First choose the relevant layer:

choose scope

And choose a value:

scope value

Metadata

Results can be limited to specific metadata values. In the same way, add a relevant metadata label, and choose relevant values:

add metadata

Once you are happy with the search criteria, click “Create AQL Query”

build query

The query is translated into a search in the query box. Click search to search for values:

full search

Search Results

Search results are shown in the panel on the right of the page:

query results

Document metadata for the result can be displayed by clicking on the (i):

document metadata

By default, the baseline text is shown. By clicking on “grid”, a layered display of the corpus is shown instead:

grid view

Document browser

A basic document browser is implemented for each of the corpora. At present the display of the text is fairly basic, but will be improved upon soon.

View the document listing for a given corpus by clicking on the documents/files icon next to the sub-corpus name.

document listing

Click on “full text” to view each individual text.

example document