Mean Log Frequency of Lexemes
Usually, the mean log word frequency is used in conjunction with something like the log mean sentence length (for example in the Lexile® framework). The latter is used as a proxy for syntactic complexity but, having a syntactic analysis, I think we can do better and I’ll explore that in a future post.
For now, though, I wanted to get a per-chapter measure just based on mean log frequency of lexemes.
The code is available here. It’s easy to adjust the targets (by default chapters, specified on line 14) and the items (by default lexemes, specified on line 15).
The result of running the script is something like this:
6153 0101 436 5757 0102 457 5471 0103 331 5487 0104 428 5437 0105 821 5532 0106 648
where the first column is -1000 times the mean log frequency (so the higher, the harder to read), the second column is the book and chapter number and the third column is just the number of word tokens in that chapter.
If we sort this output, we should get a list of the easiest chapters to read (at least by the measure of mean log lexeme frequency):
4704 2304 449 4746 2305 429 4926 0417 498 4949 2301 207 4973 0414 577 5025 0408 905 5036 2303 467 5044 2302 585 5080 0403 657 5090 2710 291
It is perhaps not surprising that the easiest chapters are from 1John and John’s gospel (with Rev 10 coming it at number 10).
It will be interesting to see if we get similar results once we factor in some measure of syntactic complexity.
Incidentally, the most difficult chapter to read based on mean log lexeme frequency is 2 Peter 2 although 1 Timothy and Titus feature quite a bit in the most difficult ten chapters as well.