Language changes as we speak. New words and new senses of familiar words are recorded in dictionaries every year. Daily frequencies of 'content words' vary immensely as they are chosen to report events in the media. Words such as 'vape', 'hanger' or 'emoji' are either heavily under-represented or not present at all in reference corpora of English which were compiled only a few years ago. Also, frequencies of words such as 'migrant' or 'refugee' are significantly higher than they were only months ago. Monco is a corpus search engine which can be used to keep track of such change and temporal variation.
Monco started monitoring a selection of English language news websites in late September 2015. Currently, its index contains more than 1.1 billion words and it grows by some 8 million words every day. The complete list of monitored sources and other corpus statistics is available here: http://monitorcorpus.com/stats#sources.
Monco supports queries for word forms, lemmas, phrases and basic lexico-grammatical patterns with open part-of-speech positions. We discuss the main features of the query syntax below.
To get a list of concordances for a single word form, you simply have to enter it in the search field and hit enter or click/tap the magnifying glass button. For example, the following query:
To find occurrences of exact phrases, simply type them in the search fields as in:
Contracted words are separated before they are added to the index. This means that they currently have to be specified as separate tokens to be matched by the queries. This can be illustrated with the following examples:
To find different inflections of a base form (aka the 'lemma'), you need to specify this base form and append two asterisks to it. For example, these queries:
However, It should be noted that the recall of such automatic morphological expansion is not complete. For example, we can't recognize the lemma's of relatively new word forms in the corpus. To be on the safe side, you should use the variant syntax as explained below.
Variants of query terms can be specified explicitly with the pipe symbol operator (|). For example, to find all forms of the word 'vape' or 'emoji', you could use the following queries:
It is also possible to negate one of the variants by prepending it with an exclamation mark, e.g.
This query returns different forms of the lemma break, except from the form 'broke'.
Monco offers a convenient way of finding word co-occurrences. This comes in the form of the slop factor parameter which can be specified in the query options form. For example, to find co-occurrences of the words kill, bird and stone, we can type the following query:
and set the slop factor parameter to 3. This means that up to 3 words may occur in between the explicitly specified query terms kill**, bird** and stone**:
The resulting concordances will contain occurrences of the idiom 'kill two birds with one stone'.
To increase the number of potentially relevant results for multiword queries, you can also deselect the “In order” checkbox. For example, the following query:
with a slop factor of 3 will match the following spans:
etc.
The same query with the 'In order' parameter unchecked will additionally match spans such as:
Basic lexico-grammatical patterns can be identified by specifying part-of-speech tags. For example, to match spans in which an adjective precedes the word “disaster”, you should use the following query:
which is likely to match spans such as:
etc. In general, Monco uses the Penn Tree Bank part-of-speech tagset.
Spans with nouns following the word rare could be found with the following query:
By combining such PoS placeholders with the slop factor parameter, one can retrieve examples of collocational chains with underspecified positions, e.g.:
with slop factor set to 2 would yield
This query:
with slop=2 would match:
The most useful PoS tags include:
Here is how these additional tags could be used in useful corpus queries:
You can also try to find lemmas, which have a certain PoS tag associated with them. For example, by typing:
you are more likely to get 'verb instances' of the lemma 'test', although the exact results depend on the accuracy of the Apache OpenNLP tagger we use to annotate the indexed texts.
Monco syntax offers a few more more useful features. We are planning to describe them in our upcoming blog posts.
Monco currently supports two types of sorting: deep metadata sorting and surface concordance sorting.
By default, the concordances are sorted by index order, but they can also be ordered by descending timestamps, which means that in the concordance table the most recently published sentences matching the corpus query will be shown first. You can change the sorting order in the search options form:
For queries with partly open term positions (i.e. when one or more words is not fully specified), it is often very useful to sort the concordances by the matching spans. By changing the value of the 'Concordance sort' option to 'match', you can sort the results alphabetically by the matching spans. An example list of concordances for the following query:
would result in an alphabetical ordering of concordances, e.g.:
etc. Needless to say, such sorted lists can be used to identify recurrent co-occurrence patterns which are often the result of phraseological binding.
A simple list of distinct spans matched by the query is provided in the Summary tab of the results page. For example, if you run the following query:
set the limit of concordances to 1000 and click on the Summary tab, you may see results similar to the following:
What this table tells us is that the combination 'popular with' was more frequent in the current set of concordances than the combination 'popular among'.
For every corpus query, Monco counts the total number of sentences which seem to match it. Additionally, the list of sources and recent time periods in which the matching spans were identified is also compiled and visualised in the facets tab. These facets are computed for all the results in the index. In other words, they are not only aggregated from the currently displayed set of concordances.
You can export full results of Monco searches as MS Excel spreadsheet from the Export tab. These spreadsheets contain concordances, facets with total sizes of aggregated categories as well as some corpus statistics. Up to 10 000 concordances can be generated in a single request.
You will notice a lot of duplicated sentences in Monco search results. Sometimes you may want them to be included in the list of concordances, for example to see how many websites republished the same news story or cited the same person saying the same thing. Other duplicates are mainly due to technical difficulties relating to crawling websites.
By default, Monco marks both types of duplicates detected in the current set of concordances as shown below.
You can also hide detected duplicates by changing the value of the 'Duplicates' option to hide in the search settings.
Contact us if you are interested in obtaining programmatic access to Monco.