Converting the GBI Syntax Trees to a Dependency Analysis
Non-leaf nodes in the GBI syntax trees have a Head
attribute which indicates the index of the child considered the head.
So the algorithm is fairly straightforward. For each leaf-node:
- walk up the tree until you find a node whose
Head
attribute is NOT the index of the child we just came from - follow the
Head
attributes back down the tree until you hit another leaf-node - that second leaf-node is the head of the leaf-node you started on
- the “type” of the dependency is the
Cat
of the second-to-last node you visited walking up in step 1.
The only catch is the source data this script uses omits a Head
altogether in three types of cases. The original GBI analysis treated the Head
as being "1"
in these cases so I special case that in the code. I don’t necessarily agree with the choice but it’s easy to change (see below).
I’ve put the code in a gist: https://gist.github.com/jtauber/c02d0928811b7ed21c9a
The result (on the first part of John 3.16) is:
64003016001 Οὕτως 64003016003 ADV 64003016002 γὰρ 64003016003 conj 64003016003 ἠγάπησεν None CL 64003016004 ὁ 64003016005 det 64003016005 θεὸς 64003016003 S 64003016006 τὸν 64003016007 det 64003016007 κόσμον 64003016003 O 64003016008 ὥστε 64003016013 conj 64003016009 τὸν 64003016010 det 64003016010 υἱὸν 64003016013 O 64003016011 τὸν 64003016012 det 64003016012 μονογενῆ 64003016010 np 64003016013 ἔδωκεν, 64003016003 CL
The dependency relationship color highlighting experiment on this site shows a possible way of conveying this dependency information in a text (in this case, 2 John).
As mentioned, I don’t necessarily always agree with the GBI choice of head, however, it’s fairly straightfoward to alter the code to override the choice of head in certain contexts.
For example, if you consider the complementizer the head, you can just add code that takes Head="0"
where Rule="that-VP"
and so on. Similarly with prepositions, determiners, etc.
Finally note that it’s not quite possible to reconstruct the original tree from the dependency data because the algorithm effectively eliminates information on some intermediate nodes. Some may consider this an advantage.