Some Initial Vocabulary Statistics

29th August 2017 / by James Tauber

Here are some very preliminary statistics from the Greek Vocab site’s first month.

So far 82 people have signed up to http://vocab.oxlos.org/ and 52 have completed at least the first activity, a common noun receptive vocabulary leveling test based on a test form developed (for English) by Paul Nation.

Recall from my initial post on the site, that vocabulary items in that activity are classified into one of five buckets based on how many times they occur in the Greek New Testament.

Here are the mean results (with standard error) for each bucket for the first activity (N=52):

bucket	occurences	mean ± std err
1	32 or more times	0.966 ± 0.008
2	16 to 31 times	0.837 ± 0.028
3	4 to 15 times	0.667 ± 0.041
4	2 or 3 times	0.556 ± 0.049
5	1 time	0.582 ± 0.047

The first four buckets get increasingly more difficult, as one would expect. But notice the buckets 4 and 5 are indistinguishable within the standard error of the two means.

Here are the results of the next three activities of the same type.

bucket	GNT Nouns 2	GNT Nouns 3	GNT Nouns 4
	N=30	N=19	N=15
1	0.985 ± 0.004	0.991 ± 0.005	0.985 ± 0.007
2	0.894 ± 0.020	0.901 ± 0.021	0.930 ± 0.018
3	0.631 ± 0.046	0.661 ± 0.039	0.689 ± 0.051
4	0.602 ± 0.060	0.570 ± 0.067	0.574 ± 0.059
5	0.450 ± 0.048	0.556 ± 0.064	0.611 ± 0.050

GNT Nouns 2 actually does successfully separate buckets 4 and 5 (apparently the hapax legomena in that test were harder) but it doesn’t do a great job distinguishing buckets 3 and 4. GNT Nouns 3 fails to distinguish buckets 4 and 5 and only barely separates 3 and 4. GNT Nouns 4 likewise doesn’t really distinguish buckets 4 and 5 and only barely separates 3 and 4.

It should be noted that the ability level of the average person doing an activity increases with each activity. This isn’t clear from the data presented here but is from other data. This is likely because a person who has done reasonably well on one activity is more likely to continue to do more activities.

I COULD mitigate this problem by only including results for earlier activities from people who have completed all four. But before I do that, I’d actually like to just see more people do all four activities.

Furthermore, the vast majority of people doing these activities are scoring above 50% and, in fact, no one scoring below 40% has attempted activities beyond the first. I NEED MORE BEGINNER-INTERMEDIATE LEVEL PEOPLE to do all four tests! They will better discriminate mid-to-hard difficulty items (more on that concept later).

But preliminary indications are that I haven’t quite got the buckets right yet. Fortunately, I can re-run analyses with different bucketing even if the distribution of items chosen for the tests are based on the existing bucketing scheme.

I’ll continue to blog more statistics over time. Some topics I’d like to explore include inter-test reliability, G-theory, ANOVA, and IRT modeling.

Thank you to everyone who is contributing to this. Please spread the word!

← More Vocabulary Statistics A Tour of Greek Morphology: Part 14 →

Comments on “Some Initial Vocabulary Statistics”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Now • Projects • Articles • Labs • Blog

Atom Feed

By day I’m an entrepreneur, web technologist and open-source developer but my academic background is in linguistics (along with some classics, comparative philology, and educational statistics) and my main avocation is working on text, annotations, analysis and software relating to historical languages with a particular interest in facilitating better learning.

While my focus has mostly been on Biblical Greek, much of the work is highly relevant to other Hellenistic Greek texts, other dialects of Ancient Greek and, indeed, texts in completely different languages as well.

All code written for this endeavour is open source and text and data is made available under a Creative Commons license to the extent allowed by the sources used.

I can be contacted at jtauber@jtauber.com.

Some Initial Vocabulary Statistics

Comments on “Some Initial Vocabulary Statistics”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Get Posts by Email