Generating Readers

7th November 2015 / by James Tauber

Back in April 2014, Brian Renshaw posted a Good Friday Greek Reader. It was presumably manually produced but I knew such things could be generated automatically and so went about building a system to do so.

You can see a sample PDF at https://github.com/jtauber/greek-reader/blob/master/example/reader.pdf which roughly looks like what Brian produced.

From a code point of view, it’s a fairly simple Python 3 script that generates LaTeX that is then typeset using XeTeX. There is also an experimental backend using SILE. The code is open source under an MIT license and is available at https://github.com/jtauber/greek-reader. It assumes you’re comfortable with those tools and editing text files to tweak things, but my hope is eventually a website could be built around this.

To produce a reader like this, whether manually or automatically, you need:

a text
lemmatization
frequency counts
glosses
full citation forms / headwords (e.g λαμπάς, άδος, ἡ) for nominals
parsing (e.g. AAI 3S) for verbs

MorphGNT gave me 1, 2, 3 and 6. 4 came from Dodson (although you can override both globally and per verse) and 5 came from Danker’s Concise Lexicon.

What’s nice about doing this programmatically, besides that fact you can make corrections upstream and have them applied to all the generated readers is that you can make this adaptive. In the example, I chose which words to annotate based on frequency but it could just as easily be based on other criteria such as what a particular student has learnt up to this point or what has been covered in a particular textbook up to this point.

One major feature I want to add, though, is richer annotation both morphologically AND syntactically so it becomes possible to generate something more akin to Zerwick and Gosvenor’s A Grammatical Analysis of the Greek New Testament.

One major motivation for my continuing work on a Morphological Lexicon is being able to provide more focused, helpful annotations for readers indicating not just a lemma but a principal part or some additional information that helps the student understand the form.

For the syntax, I’d like to eventually develop a catalog of constructions so, much like forms are only annotated if they are less frequent (or otherwise unknown to the student), particular syntactic constructions in a text can be called out based on similar criteria. Some of this is possible with existing syntactic analyses, the trick is knowing which annotations to include and which are already obvious. (I have some ideas for how to crowdsource difficult constructions, but more on that later).

The greek-reader project is a great example of a pretty simple tool that can do a lot because it builds on rich data. As we get better and better data, we can build better and better tools.

← At the Half Way Point Inline Annotation of Sandhi →

Comments on “Generating Readers”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Now • Projects • Articles • Labs • Blog

Atom Feed

By day I’m an entrepreneur, web technologist and open-source developer but my academic background is in linguistics (along with some classics, comparative philology, and educational statistics) and my main avocation is working on text, annotations, analysis and software relating to historical languages with a particular interest in facilitating better learning.

While my focus has mostly been on Biblical Greek, much of the work is highly relevant to other Hellenistic Greek texts, other dialects of Ancient Greek and, indeed, texts in completely different languages as well.

All code written for this endeavour is open source and text and data is made available under a Creative Commons license to the extent allowed by the sources used.

I can be contacted at jtauber@jtauber.com.

Generating Readers

Comments on “Generating Readers”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Get Posts by Email