The repo, the n-gram and the lemmas

Tuesday, 29 March, 2016

Every day offers opportunities for learning new words. Example sentence: “This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google’s Trillion Word Corpus.”

The repo, there, is short for repository, and a repository — a kind of a folder filled with files — is the most basic element of GitHub, the largest host of source code in the world, with 12 million users and some 31 million repositories. As regards the n-gram, it is a type of language model for predicting the next item in a sequence of text or speech in computational linguistics.

It was a GitHub mention by Morten Just that inspired all this, and the Dane’s link is a gift that keeps on giving: “This repo is useful as a corpus for typing training programs. According to analysis of the Oxford English Corpus, the 7,000 most common English lemmas account for approximately 90% of usage, so a 10,000 word training corpus is more than sufficient for practical training applications.” You will have noticed “lemmas” there and if you’re wondering about its meaning, check out morphology.

Google’s Trillion Word Corpus contains lots of gems. With repo, n-gram and lemmas defined, we’ve still got a way to go until we reach the end of the exotics.

