Anyone who has studied modernista poetry has an image of what modernista language looks like. There are certain words that one automatically associates with this movement: “cisnes,” “princesas,” “estatuas”, “céfiro” and so on. But how many of the so-called “modernista” words are also the most frequent words (MFW) employed by Darío and his followers? For example, I do not know why I always think of “nácar” as one of the quintessential modernista words. For me, it was surprising to discover that Darío does not use it in any of his poems in my database (don’t know if he uses it in his prose work).
The following is a list of twenty of Darío’s MFWs (excluding articles, prepositions and common verbs):
And then there are those two words one usually associates with Darío’s poetry. They appear much less frequently in his texts:
The gap between these two classical modernista words and the MFWs in Darío’s poetry is clearly shown in the following image.
I know it is not right for me take Darío’s vocabulary as representative of modernista poetry in general, but I am not sure that adding corpuses from other writers and compiling a list of their shared MFRWs would solve the problem. And, is this “gap” simply a problem between perception and reality?
I suppose that in order to contrast the public perception of modernista vocabulary with the actual MFWs in their texts, one needs to find a way to compile a list of modernista words as understood by readers and critics.
Many of the techniques I have been applying to my corpora were specifically designed for use with “Big Data” —which I do not have, unless one considers about eighty something poems “Big” data. I believe that it is possible to apply quantitative analysis to smaller groups of text and obtain meaningful results. Ted Underwood would probably disagree with me on this, but I think even he would be surprised at the results I am getting from using his Topic Tree method to Nájera’s poetry.
Topic modeling is very popular at the moment, but when I started working with modernista poetry I had my doubts about that approach. I didn’t know if it was the right one for my data as topic modeling tends to emphasize the isolation of key themes in a single text. Because I was treating each poem as an individual work, I needed a technique that could help me establish links among the poems based on the recurrence of similar words/topics. Underwood’s technique (an alternative to topic modeling, which he calls Topic Tree), does exactly that. He applied it to a huge collection of 18th century documents and produced this dendrogram tree. He also has a post explaining his technique in detail. Underwood uses a vector space model to compare words among corpuses, but instead of employing the tf-idf scores normally used by search engines, he has developed his own formula, which he explains in this Tech Note. In the same Tech Note, he has released his R code (as well as a very handy script to divide large trees into manageable sections).
Let me now show the results I got using the Topic Tree technique. I fed my script 250 common words and that produced four main branches. As expected the branches of my tree reflect Nájera’s main topics.
In branch one, the topics are poetry, faith and childhood:In Branch two, all the words are related to the experience of death:
In the third branch the focus is on natural elements, with strong emphasis on flying animals and love:The fourth and last branch is also the most interesting one. In the top part the main topic is poetry and its representation of beauty/nature, towards the middle the dominant topic is sexual attraction and the lower part of the branch shows the poetic representation of women in modernista poetry (notice the words that cluster around “white”) .
The experiment in my opinion was very successful and, of course, I am now curious about what a topic tree of modernista poetry would look like.