NT Book Similarity by Jaccard Distance of Lemma Sets

29th July 2017 / by James Tauber

I was thinking about vocabulary differences between books of the New Testament and decided to see what happens when you do a hierarchical clustering analysis of NT books using the Jaccard distance of their lemma sets.

UPDATE: I'm now convinced much (although not all) of this is due to length effects. If you think about it, the Jaccard distance between a large set and a small set is going to be large just by virtue of the large set having more in it than the small set. This will naturally group the non-letters together, the short letters together, Romans and the Corinthian letters together and so on. So until I come up with a way to correct Jaccard distance for text length, I'd take this post with a grain of salt.

This is some old-school stylometry but the results are still pretty interesting. For each book, I calculated the set of lemmas and then, for each pair of books, calculated the Jaccard coefficient (the ratio of the intersection of the sets and the unions of the sets).

I then did a cluster analysis using Ward’s criterion and rendered the results as a dendrogram:

Notice that the first split is between the letters and non-letters.

Within the non-letters, John’s Gospel and Revelation cluster together as do Acts and the Synoptics. The Synoptics cluster with each other more than they do with Acts. Matthew and Mark cluster together more than they do with Luke.

The highest division in the letters is between:

the non-pastoral Pauline epistles plus Hebrews, James and 1 Peter
the pastorals plus the rest of the general epistles (2 Peter, the Johannine epistles and Jude)

That first division of letters further clusters into:

Galatians, Ephesians, Philippians, Colossians, 1 Thessalonians, 2 Thessalonians
Romans, 1 Corinthians, 2 Corinthians, Hebrews, James and 1 Peter

Ephesians and Colossians cluster together, the two epistles to the Thessalonians cluster together, and Galatians and Philippians cluster together.

Romans, 1 Corinthians, and 2 Corinthians cluster (although 1 Corinthians clusters closer to Romans than to 2 Corinthians). James and 1 Peter cluster. Hebrews is in the same overall group but clusters closer to the Romans/Corinthian subgroup.

The second division of letters clusters into:

Philemon, 2 John, 3 John
Titus, 1 Timothy, 2 Timothy
Jude, 1 John, 2 Peter

with the second and third clustering slightly closer than the first.

2 John and 3 John cluster much closer to each other than to Philemon. The epistles to Timothy cluster slightly closer together than they do to Titus. 1 John and 2 Peter cluster slightly closer together than they do with Jude.

I haven’t thought about length effects here but they may influence the clustering of very short books together (and possibly very long books). A lot of the clustering does follow similar lengths so it’s definitely worth thinking more about.

Of course, there’s nothing new about this kind of analysis. As I said at the start, it’s old school—the sort of thing I can imagine being published in a “humanities computing” journal in the 80s. But it’s still interesting. And it might be even more interesting to apply to finer-grained text divisions and/or with properties other than lemmas.

← A Tour of Greek Morphology: Part 10 New Site for Vocabulary Experiments →

Comments on “NT Book Similarity by Jaccard Distance of Lemma Sets”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Now • Projects • Articles • Labs • Blog

Atom Feed

By day I’m an entrepreneur, web technologist and open-source developer but my academic background is in linguistics (along with some classics, comparative philology, and educational statistics) and my main avocation is working on text, annotations, analysis and software relating to historical languages with a particular interest in facilitating better learning.

While my focus has mostly been on Biblical Greek, much of the work is highly relevant to other Hellenistic Greek texts, other dialects of Ancient Greek and, indeed, texts in completely different languages as well.

All code written for this endeavour is open source and text and data is made available under a Creative Commons license to the extent allowed by the sources used.

I can be contacted at jtauber@jtauber.com.

NT Book Similarity by Jaccard Distance of Lemma Sets

Comments on “NT Book Similarity by Jaccard Distance of Lemma Sets”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Get Posts by Email