Category Archives: Text mining

Modernista Vocabulary

Anyone who has studied modernista poetry has an image of what modernista language looks like. There are certain words that one automatically associates with this movement:  “cisnes,” “princesas,” “estatuas”, “céfiro” and so on. But how many of the so-called “modernista” words are also the most frequent words (MFW) employed by Darío and his followers? For example, I do not know why I always think of “nácar” as one of the quintessential modernista words. For me, it was surprising to discover that Darío does not use it in any of his poems in my database (don’t know if he uses it in his prose work).

The following is a list of twenty of Darío’s MFWs (excluding articles, prepositions and common verbs):

oh           0.2890521
vida         0.2870448
luz          0.2609499
oro          0.2288330
amor         0.2268256
alma         0.1987234
sol          0.1906941
azul         0.1666064
esperanza    0.1605845
canto        0.1565699
tierra       0.1505480
rosa         0.1485407
día          0.1465334
dios         0.1445261
gloria       0.1445261
rosas        0.1445261
sangre       0.1425188
ojos         0.1364968

And then there are those two words one usually associates with Darío’s poetry. They appear much less frequently in his texts:

cisne        0.05821189 
princesa     0.04616805

The gap between these two classical modernista words and the MFWs in Darío’s poetry is clearly shown in the following image.
vocabulario-dario

I know it is not right for me take Darío’s vocabulary as representative of modernista poetry in general, but I am not sure that adding corpuses from other writers and compiling a list of their shared MFRWs would solve the problem. And, is this “gap” simply a problem between perception and reality?

I suppose that in order to contrast the public perception of modernista vocabulary with the actual MFWs in their texts, one needs to find a way to compile a list of modernista words as understood by readers and critics.

Advertisements

Topic Tree

Many of the techniques I have been applying to my corpora were specifically designed for use with “Big Data” —which I do not have, unless one considers about eighty something poems “Big” data. I believe that it is possible to apply quantitative analysis to smaller groups of text and obtain meaningful results. Ted Underwood would probably disagree with me on this, but I think even he would be surprised at the results I am getting from using his Topic Tree method to Nájera’s poetry.

Topic modeling is very popular at the moment, but when I started working with modernista poetry I had my doubts about that approach. I didn’t know if it was the right one for my data as topic modeling tends to emphasize  the isolation of key themes in a single text. Because I was treating each poem as an individual work, I needed a technique that could help me establish links among the poems based on the recurrence of similar words/topics. Underwood’s technique (an alternative to topic modeling, which he calls Topic Tree), does exactly that. He applied it to a huge collection of 18th century documents and produced this dendrogram tree. He also has a post explaining his technique in detail.  Underwood uses a vector space model to compare words among corpuses, but instead of employing the tf-idf scores normally used by search engines, he has developed his own formula, which he explains in this Tech Note. In the same Tech Note, he has released his R code (as well as a very handy script to divide large trees into manageable sections).

Let me now show the results I got using the Topic Tree technique.  I fed my script 250 common words and that produced four main branches. As expected the branches of my tree reflect Nájera’s main topics.

In branch one, the topics are poetry, faith and childhood:rama1In Branch two, all the words are related to the experience of death:

rama2

In the third branch the focus is on natural elements, with strong emphasis on flying animals and love:rama3The fourth and last branch is also the most interesting one. In the top part the main topic is poetry and its representation of beauty/nature, towards the middle the dominant topic is sexual attraction and the lower part of the branch shows the poetic representation of women in modernista poetry (notice the words that cluster around “white”) .rama4

The experiment in my opinion was very successful and, of course, I am now curious about what a topic tree of modernista poetry would look like.