About Monco

Language changes as we speak. New words and new senses of familiar words are recorded in dictionaries every year. Daily frequencies of 'content words' vary immensely as they are chosen to report events in the media. Words such as 'vape', 'hanger' or 'emoji' are either heavily under-represented or not present at all in reference corpora of English which were compiled only a few years ago. Also, frequencies of words such as 'migrant' or 'refugee' are significantly higher than they were only months ago. Monco is a corpus search engine which can be used to keep track of such change and temporal variation.

The data

Monco started monitoring a selection of English language news websites in late September 2015. Currently, its index contains more than 1.1 billion words and it grows by some 8 million words every day. The complete list of monitored sources and other corpus statistics is available here: http://monitorcorpus.com/stats#sources.

Corpus search

Query syntax

Monco supports queries for word forms, lemmas, phrases and basic lexico-grammatical patterns with open part-of-speech positions. We discuss the main features of the query syntax below.

Word forms

To get a list of concordances for a single word form, you simply have to enter it in the search field and hit enter or click/tap the magnifying glass button. For example, the following query:

  • vape
returns a list of sentences containing the word vape. All queries are case-insensitive, which means that upper-case instances of this word will also be matched.

Simple phrases

To find occurrences of exact phrases, simply type them in the search fields as in:

  • great way
  • a great way of
  • due to popular demand

Contracted words are separated before they are added to the index. This means that they currently have to be specified as separate tokens to be matched by the queries. This can be illustrated with the following examples:

  • I 'm not sure if
  • I 'd be surprised to
  • I do n't know if

Morphological expansion

To find different inflections of a base form (aka the 'lemma'), you need to specify this base form and append two asterisks to it. For example, these queries:

  • help**
  • break**
will match the following set of term vaiants: help, helps, helping, helped and break, breaks, breaking, broke, broken respectively.

However, It should be noted that the recall of such automatic morphological expansion is not complete. For example, we can't recognize the lemma's of relatively new word forms in the corpus. To be on the safe side, you should use the variant syntax as explained below.

Variants

Variants of query terms can be specified explicitly with the pipe symbol operator (|). For example, to find all forms of the word 'vape' or 'emoji', you could use the following queries:

  • vape|vaped|vaping|vapes
  • emoji|emojis

It is also possible to negate one of the variants by prepending it with an exclamation mark, e.g.

  • break**|!broke

This query returns different forms of the lemma break, except from the form 'broke'.

Slop factor

Monco offers a convenient way of finding word co-occurrences. This comes in the form of the slop factor parameter which can be specified in the query options form. For example, to find co-occurrences of the words kill, bird and stone, we can type the following query:

  • kill** bird** stone**

and set the slop factor parameter to 3. This means that up to 3 words may occur in between the explicitly specified query terms kill**, bird** and stone**:

The resulting concordances will contain occurrences of the idiom 'kill two birds with one stone'.

Relaxing the word order

To increase the number of potentially relevant results for multiword queries, you can also deselect the “In order” checkbox. For example, the following query:

  • strike** deal**

with a slop factor of 3 will match the following spans:

  • strike a deal
  • struck a deal
  • strikes a Faustian deal
  • strike a peace deal
  • strike a mutually benefitting deal

etc.

The same query with the 'In order' parameter unchecked will additionally match spans such as:

  • deal could be struck
  • deal was struck
  • deal has essentially been struck.

Part of speech tags

Basic lexico-grammatical patterns can be identified by specifying part-of-speech tags. For example, to match spans in which an adjective precedes the word “disaster”, you should use the following query:

  • <tag=jj.*> disaster**

which is likely to match spans such as:

  • complete disaster
  • culinary disaster
  • humanitarian disasters
  • natural disasters

etc. In general, Monco uses the Penn Tree Bank part-of-speech tagset.

Spans with nouns following the word rare could be found with the following query:

  • rare|rarer|rarest <tag=n.*>

By combining such PoS placeholders with the slop factor parameter, one can retrieve examples of collocational chains with underspecified positions, e.g.:

  • find** rare|rarer|rarest <tag=n.*>

with slop factor set to 2 would yield

  • find rare plants
  • find those rare instances
  • found a rare plant
  • found this rare balance

This query:

  • have** <tag=jj.*> finger|fingers

with slop=2 would match:

  • have green fingers
  • have chubby fingers
  • has two broken fingers
  • have slippery fingers

The most useful PoS tags include:

  • <tag=n.*> for any noun
  • <tag=j.*> for any adjective
  • <tag=v.*> for any verb
  • <tag=rb.*> for any adverb.
  • <tag=in.*> for most prepositions.

Here is how these additional tags could be used in useful corpus queries:

  • To find prepositions following a noun: information <tag=in.*>
  • To find nouns which may be prepositional objects in a partly specified phrase heap**|pile** of <tag=n.*>
  • To find adverbs preceding an adjective: <tag=rb.*> unacceptable

You can also try to find lemmas, which have a certain PoS tag associated with them. For example, by typing:

  • <lemma=test tag=v.*>

you are more likely to get 'verb instances' of the lemma 'test', although the exact results depend on the accuracy of the Apache OpenNLP tagger we use to annotate the indexed texts.

Monco syntax offers a few more more useful features. We are planning to describe them in our upcoming blog posts.

Sorting

Monco currently supports two types of sorting: deep metadata sorting and surface concordance sorting.

By default, the concordances are sorted by index order, but they can also be ordered by descending timestamps, which means that in the concordance table the most recently published sentences matching the corpus query will be shown first. You can change the sorting order in the search options form:

For queries with partly open term positions (i.e. when one or more words is not fully specified), it is often very useful to sort the concordances by the matching spans. By changing the value of the 'Concordance sort' option to 'match', you can sort the results alphabetically by the matching spans. An example list of concordances for the following query:

  • <tag=j.*> record|records

would result in an alphabetical ordering of concordances, e.g.:

  • criminal record
  • criminal records
  • financial record
  • financial records
  • medical record
  • medical records
  • public records

etc. Needless to say, such sorted lists can be used to identify recurrent co-occurrence patterns which are often the result of phraseological binding.

Search summary

A simple list of distinct spans matched by the query is provided in the Summary tab of the results page. For example, if you run the following query:

  • popular with|among

set the limit of concordances to 1000 and click on the Summary tab, you may see results similar to the following:

What this table tells us is that the combination 'popular with' was more frequent in the current set of concordances than the combination 'popular among'.

Facets

For every corpus query, Monco counts the total number of sentences which seem to match it. Additionally, the list of sources and recent time periods in which the matching spans were identified is also compiled and visualised in the facets tab. These facets are computed for all the results in the index. In other words, they are not only aggregated from the currently displayed set of concordances.

Exporting results

You can export full results of Monco searches as MS Excel spreadsheet from the Export tab. These spreadsheets contain concordances, facets with total sizes of aggregated categories as well as some corpus statistics. Up to 10 000 concordances can be generated in a single request.

Duplicates

You will notice a lot of duplicated sentences in Monco search results. Sometimes you may want them to be included in the list of concordances, for example to see how many websites republished the same news story or cited the same person saying the same thing. Other duplicates are mainly due to technical difficulties relating to crawling websites.

By default, Monco marks both types of duplicates detected in the current set of concordances as shown below.

You can also hide detected duplicates by changing the value of the 'Duplicates' option to hide in the search settings.

Programmatic access

Contact us if you are interested in obtaining programmatic access to Monco.