Tag: GitHub

To singularize, or pluralize, that is the question

Monday, 2 May, 2016 0 Comments

The Rainy Day copy of the Concise Oxford English Dictionary, the twelfth edition, dates from 2011 and it’s beginning to show its age. Take the word “singularity,” which all nerds know is the approaching era when “our intelligence will become increasingly nonbiological and trillions of times more powerful than it is today.” According to our Concise Oxford English Dictionary, however, the definition goes like this:

singularity n (pl singularities) 1 the state, fact, or quality of being singular. 2 Physics & Mathematics a point at which a function takes an infinite value, especially a point of infinite density at the centre of a black hole.

The entry on “singularity” is followed by the definition of “singularize” or “singularise”, which is a verb, “1 make distinct or conspicuous. 2 give a singular form to (a word).” Its counterpart, “pluralize/pluralise”, is defined as “1 make something more numerous. 2 give a plural form to a word.” And this brings us to GitHub, the largest host of source code in the world, with 12 million users and some 31 million repositories, where Blake Embrey has added a module titled “pluralize” that uses “a pre-defined list of rules, applied in order, to singularize or pluralize a given word. There are many cases where this is useful, such as any automation based on user input,” he says.

Who, apart from lexicographers and coders, care about such wordy matters? Apple does, and tomorrow we’ll find out why Apple is at war with the singular and the plural of its product(s). Example: “It would be proper to say ‘I have 3 Macintosh.'”

The repo, the n-gram and the lemmas

Tuesday, 29 March, 2016 0 Comments

Every day offers opportunities for learning new words. Example sentence: “This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google’s Trillion Word Corpus.”

The repo, there, is short for repository, and a repository — a kind of a folder filled with files — is the most basic element of GitHub, the largest host of source code in the world, with 12 million users and some 31 million repositories. As regards the n-gram, it is a type of language model for predicting the next item in a sequence of text or speech in computational linguistics.

It was a GitHub mention by Morten Just that inspired all this, and the Dane’s link is a gift that keeps on giving: “This repo is useful as a corpus for typing training programs. According to analysis of the Oxford English Corpus, the 7,000 most common English lemmas account for approximately 90% of usage, so a 10,000 word training corpus is more than sufficient for practical training applications.” You will have noticed “lemmas” there and if you’re wondering about its meaning, check out morphology.

Google’s Trillion Word Corpus contains lots of gems. With repo, n-gram and lemmas defined, we’ve still got a way to go until we reach the end of the exotics.