in 2008 i was obsessed with editing wikipedia.
I was in my last year of university, and was tutoring ESL for money.
Canada's west-coast has a humongous number of fun, young koreans.
They're tacitly there to learn english, but mostly just to get away from their parents.
It's a fun gig, and you get a taste of just how terrible it is to learn english.
Wikipedia has a well-meaning, but not really thought-out side-project called the simple english wikipedia.
The idea is that some 3 billion people are learning english, and it's impossible,
and if you get a subset of wikipedia editors willing to pull-back on their flowery english, it could be a really helpful resource for them.
Why couldn't people read wikipedia in their own language?
Why invent a new language dripping with condescension?
I'm not sure.
But at the time, it felt like something meant for me.
I had previously written a wikipedia bot. I knew some python.
I figured I could write a bot to convert parts of english wikipedia into simple english.
I started just doing it without thinking. I made a list of substitutions.
From all the wikipedia work, I was suffering a repetitive strain injury in my arms.
I created the list in a notebook, and had my mom type them out into a spreadsheet:
What happened next was hilarious.
I still laugh about this often.
I put the script up on my website, ('spencerwaterbed.com') and although it failed as a wikipedia bot, it started getting cited in academic papers.
- A lot of them -
From time to time, I google 'the spencerwaterbed system' and see a new paper.
They unfortunately never cite my mom's work.
it turns out that nobody had built such a list of naive (a word I've come to love) substitutions before.
I sometimes wonder what people in linguistics departments do all day.
I know that's cheeky - but really - why was i the person that had to build this?
That was a weird feeling.
Charles Kay Ogden almost did. He was a linguist, and buddies with Bertrand Russel.
He advocated for reducing redundant english, and notably created a reduced vocabulary of 1,000 words.
I mean, there was a time when every church service was in latin, and nobody understood it.
Esoteric english is an awful prestige-signal and massive force of cultural exclusion.
Working to increase your vocabulary is just putting Korean kids through another boring year of ESL.
Of course, reducing a language by force, into a box, is an Orwellian nightmare.
The point is -
amazingly, 1,000 words is actually enough.
it feels like more.
a language feels like a huge thing!
but it isn't.
1,000 words would take a semester to learn.
My favourite description of this effect comes from this BBC documentary.
We often tell ourselves is that English is the great multicultural solvent -
you can say a sentence in english that unknowingly combines
Latin, Norman, Arabic, and Germanic etymologies
- and do it without blinking.
the truth though, is that many imported words often carry nothing but cultural caché
so english is especially dense with cultural pretension:
the usefulness of a word can't be quantified, but the frequency of a word sure can.
- some words are more common than others.
The frequency of words in english (and every language) fit a 'Zipfian distribution', which is notable for being outrageously steep.
like Ogden observed, the most popular words are basically all we use.
The curve is way more steep than people realise:
|# of words||% coverage|
i have some theories about this. They can wait.
as of 2019, compromise has 14,000 words in its lexicon.
i figure that's alright.
I grew-up as a jQuery developer. People would complain about the size of that library, then still use it.
it was about 200kb.
to me, that's as big as you can go without load-time being a problem.
i always wanted compromise to be smaller than that.
the source code alone is about, 110kb minified.
so it's clear that any lexical data will have to be compressed.
Mark Adler and Jean-loup Gailly, creators of gzip.
if you think about it,
compression is only removing duplicate information.
it can't do more than that.
gzip is the unsung hero of the web.
it makes everything possible.
printing out 14,000 words in plaintext is about 120kb.
running them through gzip drops the filesize to 45kb,
which is amazing.
but english is tremendously redundant already.
and ... compromise can pluralize it's nouns, and conjugate it's verbs...
so by storing just one form of verb, and just the singular form of nouns,
... then spitting them all out at runtime, that helps a lot.
conjugation is really compression, if you think about it.
i also did some reading about compressing these word-lists.
amazingly, in some bizarre circle-of-fortune, the best resource i found for this
was from the author of jQuery.
he recommended using a trie data-structure.
so i did.
the final size for 14,000 words was 40kb - just slightly beating gzip.
and amazingly, after a gzip, it is 27kb.
that's 99.99% of the english language
this gif is 77kb.
I couldn't believe this.
John Resig, a truly brilliant man
every computer science student has to learn big-O notation, like it will save their life.
and most web-developer job interviews will touch a time-complexity question.
one thing I didn't figure out, until late, was that everything you do to text is fast.
so fast, as to be effectively free.
a bunch of times in my career I've had people laugh at me for throwing a dozen regular expressions into a file.
they are basically instant.
Stephen Cole Kleene, creator of Regular Expressions
today it's more or less impossible to measure the delay of <1,000 regular expressions.
it's nanoseconds, even on a mobile phone.
you can write regular expressions all week, then press enter, and have blink-speed:
my (rough) guess is that every word put into comprommise will encounter ~4k regular expressions during processing.
it just doesn't care.
I don't either.