Mean Log Frequency of Lexemes

27th October 2015 / by James Tauber

One component of many readability measures on texts is the mean log word frequency. Here I do a basic calculation across chapters in the Greek New Testament (with code provided).

Usually, the mean log word frequency is used in conjunction with something like the log mean sentence length (for example in the Lexile® framework). The latter is used as a proxy for syntactic complexity but, having a syntactic analysis, I think we can do better and I’ll explore that in a future post.

For now, though, I wanted to get a per-chapter measure just based on mean log frequency of lexemes.

The code is available here. It’s easy to adjust the targets (by default chapters, specified on line 14) and the items (by default lexemes, specified on line 15).

The result of running the script is something like this:

6153 0101 436
5757 0102 457
5471 0103 331
5487 0104 428
5437 0105 821
5532 0106 648

where the first column is -1000 times the mean log frequency (so the higher, the harder to read), the second column is the book and chapter number and the third column is just the number of word tokens in that chapter.

If we sort this output, we should get a list of the easiest chapters to read (at least by the measure of mean log lexeme frequency):

4704 2304 449
4746 2305 429
4926 0417 498
4949 2301 207
4973 0414 577
5025 0408 905
5036 2303 467
5044 2302 585
5080 0403 657
5090 2710 291

It is perhaps not surprising that the easiest chapters are from 1John and John’s gospel (with Rev 10 coming it at number 10).

It will be interesting to see if we get similar results once we factor in some measure of syntactic complexity.

Incidentally, the most difficult chapter to read based on mean log lexeme frequency is 2 Peter 2 although 1 Timothy and Titus feature quite a bit in the most difficult ten chapters as well.

← Dependency Paths Updated Vocabulary Coverage Statistics →

Comments on “Mean Log Frequency of Lexemes”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Now • Projects • Articles • Labs • Blog

Atom Feed

By day I’m an entrepreneur, web technologist and open-source developer but my academic background is in linguistics (along with some classics, comparative philology, and educational statistics) and my main avocation is working on text, annotations, analysis and software relating to historical languages with a particular interest in facilitating better learning.

While my focus has mostly been on Biblical Greek, much of the work is highly relevant to other Hellenistic Greek texts, other dialects of Ancient Greek and, indeed, texts in completely different languages as well.

All code written for this endeavour is open source and text and data is made available under a Creative Commons license to the extent allowed by the sources used.

I can be contacted at jtauber@jtauber.com.

Mean Log Frequency of Lexemes

Comments on “Mean Log Frequency of Lexemes”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Get Posts by Email