Category Archives: Uncategorized

Topic Tree

Many of the techniques I have been applying to my corpora were specifically designed for use with “Big Data” —which I do not have, unless one considers about eighty something poems “Big” data. I believe that it is possible to apply quantitative analysis to smaller groups of text and obtain meaningful results. Ted Underwood would probably disagree with me on this, but I think even he would be surprised at the results I am getting from using his Topic Tree method to Nájera’s poetry.

Topic modeling is very popular at the moment, but when I started working with modernista poetry I had my doubts about that approach. I didn’t know if it was the right one for my data as topic modeling tends to emphasize  the isolation of key themes in a single text. Because I was treating each poem as an individual work, I needed a technique that could help me establish links among the poems based on the recurrence of similar words/topics. Underwood’s technique (an alternative to topic modeling, which he calls Topic Tree), does exactly that. He applied it to a huge collection of 18th century documents and produced this dendrogram tree. He also has a post explaining his technique in detail.  Underwood uses a vector space model to compare words among corpuses, but instead of employing the tf-idf scores normally used by search engines, he has developed his own formula, which he explains in this Tech Note. In the same Tech Note, he has released his R code (as well as a very handy script to divide large trees into manageable sections).

Let me now show the results I got using the Topic Tree technique.  I fed my script 250 common words and that produced four main branches. As expected the branches of my tree reflect Nájera’s main topics.

In branch one, the topics are poetry, faith and childhood:rama1In Branch two, all the words are related to the experience of death:

rama2

In the third branch the focus is on natural elements, with strong emphasis on flying animals and love:rama3The fourth and last branch is also the most interesting one. In the top part the main topic is poetry and its representation of beauty/nature, towards the middle the dominant topic is sexual attraction and the lower part of the branch shows the poetic representation of women in modernista poetry (notice the words that cluster around “white”) .rama4

The experiment in my opinion was very successful and, of course, I am now curious about what a topic tree of modernista poetry would look like.