Updated Vocabulary Coverage Statistics

26th October 2015 / by James Tauber

In various mailing list posts, blog posts and talks, I’ve shown vocabulary coverage statistics. It’s time to update the code to use more recent data and republish the results here.

The vocabulary coverage tables have a number of different parameters:

what are the items being learnt: lexemes or forms or something else?
what are the targets: verses or sentences or something else?
what ordering is being used: item frequency or something else?

and, of course, what text and lemmatization is being used.

Most of my published stats before were based on the UBS3 version of MorphGNT. Here I’m going to use the latest MorphGNT based on the SBLGNT (MorphGNT 6.06) and I’m going to explore not just verses but (in followup posts) clauses and sentences from the GBI Syntax Trees and paragraphs from the SBLGNT.

I also want to start incorporating the information from my morphological lexicon into the item/target modeling and ordering algorithms.

But first let’s just update the basic stats.

Verses-Lexemes with Frequency Ordering

A target-item file for verses-lexemes can be achieved with:

awk '{print $1,$7}' sblgnt/*-morphgnt.txt

if we then feed that to vocab-coverage.py we get the following result:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00% 
------------------------------------------------------------------
   100    99.91%    91.07%    24.36%     2.13%     0.64%     0.48% 
   200    99.92%    96.83%    51.80%     9.75%     3.43%     2.54% 
   500    99.97%    99.13%    82.23%    36.57%    17.81%    13.81% 
  1000    99.99%    99.71%    93.60%    62.57%    37.28%    29.99% 
  2000   100.00%    99.92%    98.41%    84.95%    65.38%    56.43% 
  5000   100.00%   100.00%   100.00%    99.51%    96.44%    94.58% 
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

What this table is saying is that if you learn, say, the 200 most frequent lexemes, you’ll be able to read 95% of the lexemes in 3.43% of verses.

Verses-Forms with Frequency Ordering

A target-item file for verses-forms can be achieved with:

awk '{print $1,$6}' sblgnt/*-morphgnt.txt

if we then feed that to vocab-coverage.py but with 10000 added as an item count, we get the following result:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00% 
------------------------------------------------------------------
   100    99.82%    57.63%     1.10%     0.04%     0.01%     0.01% 
   200    99.86%    78.86%     6.51%     0.34%     0.05%     0.05% 
   500    99.91%    92.85%    26.95%     2.23%     0.59%     0.52% 
  1000    99.94%    96.95%    51.23%     7.75%     2.31%     1.74% 
  2000    99.96%    98.65%    72.52%    21.74%     7.86%     5.80% 
  5000    99.97%    99.74%    90.97%    52.13%    28.52%    21.61% 
 10000   100.00%    99.94%    98.31%    78.28%    55.19%    45.28% 
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

What this table is saying is that if you learn, say, the 500 most frequent forms, you’ll be able to read 75% of the forms in 26.95% of verses.

Various talks, including those at BibleTech in 2010 and 2015 explain a ton of caveats around these numbers but I wanted to at least refresh them (and then code) with the latest data.

← Mean Log Frequency of Lexemes Blogging Every Day Between Now and SBL Annual Meeting →

Comments on “Updated Vocabulary Coverage Statistics”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Now • Projects • Articles • Labs • Blog

Atom Feed

By day I’m an entrepreneur, web technologist and open-source developer but my academic background is in linguistics (along with some classics, comparative philology, and educational statistics) and my main avocation is working on text, annotations, analysis and software relating to historical languages with a particular interest in facilitating better learning.

While my focus has mostly been on Biblical Greek, much of the work is highly relevant to other Hellenistic Greek texts, other dialects of Ancient Greek and, indeed, texts in completely different languages as well.

All code written for this endeavour is open source and text and data is made available under a Creative Commons license to the extent allowed by the sources used.

I can be contacted at jtauber@jtauber.com.