If Only They Knew That One Rare Word...
I'm going to talk in more detail about alternatives to frequency order in a different thread but I wanted to share the results of a quite striking little test I did.
In my last post, I show the vocab/coverage table applied to fully inflected forms in the Greek NT rather than lexemes. You may have noticed that the 100% coverage column and even the 95% coverage column said 0.0% verses for the 100 most frequent forms.
If you did, you might then have wondered: is this just a rounding error? The answer is no. Even if you knew the 100 most frequent inflected forms in the GNT, there is not a single verse you would know all the forms in (of course assuming you couldn't guess).
I wanted to test if this was because of just one outlier. So I modified (added 4 extra lines) the code that produced the table to instead output a list of the top ten targets (i.e. verses) whose second least frequent item (i.e. form) is most frequent overall.
Here are the results:
032030 2 [1, 2, 1077] 030146 35 [1, 35, 524] 041135 46 [2, 46, 14597] 130528 66 [5, 19, 38, 45, 49, 59, 65, 66, 235] 071623 66 [5, 19, 38, 45, 59, 66, 235] 070323 68 [3, 3, 29, 65, 68, 131] 020940 72 [8, 18, 22, 22, 44, 49, 49, 72, 102] 012425 78 [36, 78, 2846] 060211 96 [8, 14, 18, 22, 79, 96, 4276] 130519 98 [7, 17, 98, 14731]
What this listing is showing is that, for example, target 032030 (Luke 20.30) consists of the 1st, 2nd and 1077th most frequent forms; target 030146 (Luke 1.46) consists of the 1st, 35th and 524th most frequent forms. So if the rarest word wasn't needed, they would jump from needing the top 1077 forms to just the top 2 and from needing the top 524 forms to the top 35.
Now you may argue that many of these are bad examples because the verse doesn't make sense in isolation (a good reason to be more careful about what to use as targets) or that the one rare word is actually the one carrying most of the semantic weight.
But this little test demonstrates that sometimes a single rare item can massively delay reading an otherwise quite readable target unit.
By the way, here's the same listing based on lexemes rather than fully inflected forms:
032030 2 [1, 2, 346] 030146 9 [2, 9, 509] 011615 9 [3, 4, 5, 7, 8, 9, 9, 33] 032448 13 [4, 13, 415] 090124 14 [1, 2, 6, 7, 14, 267] 021337 16 [4, 5, 9, 9, 12, 16, 588] 040620 17 [1, 3, 5, 7, 8, 9, 17, 180] 041135 19 [1, 19, 4752] 040426 19 [1, 1, 3, 4, 7, 8, 9, 19, 56] 031934 24 [1, 1, 3, 5, 9, 15, 23, 24, 311]
I'll check in the code that produces this shortly.
James
It's now available at
http://code.google.com/p/graded-reader/source/browse/trunk/code/if-only.py
James