Distant Reading and Early Twentieth Century Latin American Literature

Slowly, but surely working on my Fall 2016 course on Digital Humanities. I am still working on a website for that course, but you can see my syllabus here, for now. I have benefitted a great deal from Andrew Goldman’s and Ted Underwood’s syllabi for similar courses in English.


The Inevitable José Martí Bookworm

I was adding José Martí’s texts to my modernista database, when it occurred to me that it would be interesting to create a bookworm based on his writings. Bookworm, an online tool created by Benjamin M. Schmidt, is rapidly becoming a practical —and fun— way to allow internet users to interact with a project’s data. The data for my bookworm comes from the Edición Crítica de las Obras Completas de José Martí, which is freely available from the Centro de Estudios Martianos (CEM)’s website. For the Martí bookworm, I used a total of 984 texts. I chose to include only documents written between 1875 and 1885. CEM’s project of compiling Martí’s complete works is still—no pun intended—incomplete and they have edited very little material after 1885. In addition to his creative work (poetry, drama, and fiction), I have of course included his crónicas, and other newspapers articles, as well as many of his private letters.

Martí and modernismo.

Needless to say that my main interest in Martí lies in his relationship to the modernista style. My expectation, given his commitment to the Cuban Independence movement, was that I was going to find in his writings a specific “political” vocabulary not shared by other modernista writers. That assumption proved correct, but I was still surprised by how strong the presence of that vocabulary was. Words such as “america,” “patria,” “guerra,” “vida,” “muerte” and, of course, “españa,” “cuba” and “isla” have a consistently high frequency.


Is it important that Martí’s use of the word “america” increases after 1881? Probably. Especially when one considers that “america latina,” “hispanoamerica” and “norteamerica” begin to appear more often in his writings from the same period.


But I am not really interested in questions about the transformation of Martí’s discourse after he moves to New York, if there is any. For my project, it is more significant that words that appear frequently in Darío’s poetry (azul, flores, rosa, for example) are not as relevant for Martí. The dramatic contrast can be appreaciated when they are placed side by side as in the following image.


Here one could argue that it is unfair to compare Darío’s poetic style with the entirety of Martí’s texts, including non-literary documents. I agree, but I also believe that this unfair contrast generates valid questions: Would the most frequent words (MFW) in Martí’s poetic production show signs of being closer to the modernista vocabulary than to the MFWs in Martí’s complete works? Should one study the MFWs of a genre, instead of the MFWs of a specific time period? Would an analysis of Darío’s complete works reveal a similar situation? (and if it doesn’t, what does that mean? )

In case you missed it, here is the link to the Martí bookworm.

Bookworm and Spanish Texts (Tech Note).

Bookworms with complex interfaces require installing the software in a machine that fulfills all the requirements (MySQL, Python 2.7 or 3, with modules ntlk, numpy, regex and pandas, GNU parallel and a webserver software such as Apache). As this is a simple one, I opted to use Culturomics’ bookworm creator. Even if one chooses to go the Culturomics route, I would still recommend installing a copy of the software in a personal computer as it the best way to test if a bookworm is working properly.

The main issue I had using Bookworm was its inability to accept Spanish accented characters. The Bookworm site says that their “system does pretty decent job of encoding ugly characters, but after too many of them it starts to get upset and may cause your Bookworm to fail when building.” Well, it turns out that accented letters and other characters that are part of the Spanish language are treated as “ugly characters” by Bookworm. The fact that all the texts used for a bookworm are UTF-8 encoded makes the situation even more mysterious. Though I could think of a couple of ways of circumventing the problem, in order to use the Culturomics site I needed to strip all the accented characters from my texts. In other words, it is useless to use accents when searching the José Martí bookworm.

Stop Anthropomorphizing Literary Periods, or Why the Most Frequent Words Don’t Matter.

Looking for a method to trace the evolution of Gutiérrez Nájera’s poetry from one period to another, I came across an article published
by David Hoover earlier this year. In the essay, Hoover contrasts the word frequencies in three of Henry James’novels, each one written at a different stage in the writer’s career, to analyze their stylistic changes. The article, “A Conversation Among Himselves: Change and the Styles of Henry James” (Chapter 5, in Hoover, Culpeper and O’Halloran. APPROACHES TO CORPUS STYLISTICS, Routledge, 2014), employs a interesting system for comparing the frequencies of the three periods. Hoover assigns a pattern to each word depending on whether its frequency is different in the three periods, the same in two of the three periods, or the the same in all periods. For example, a pattern like LMH (“L” = low, “M” = medium, “H” = high) indicates that a word increased in frequency from the first period to the last. The number of possible patterns is thirteen:

HML, HLM, MLH, MHL, LHM, LMH if the frequencies are different, HLL, LLH, MLL, LLM, HMM, MHH, if for two of the periods the word has same frequency, and LLL (or HHH) if the frequency is the same for all three periods.

Loosely employing the periodization presented in my previous post, I divided Gutiérrez Nájera’s poetry in three major periods, from 1875 to1879, 1880 to1887, and 1888 to 1895. I combined all the poems from each period to create a “text” and then, following Hoover’s example, I reduced the two of the samples to the size of the smallest one, by simply eliminating the part of the “text” exceeding that size. (It occurs to me now that selecting a random sample of the text is possibly a better approach). “Texts” are then tokenized and, according to their frequency, tokens are classified into one of the thirteen possible patterns. Hoover is, of course, known for his use of Excel spreadsheets to perform his text-analyses, but the idea for this technique is simple enough that a few lines of code in R can easily allow us to assign a pattern to each token.

Here is an RStudio image of part of the matrix resulting from applying this classification technique.


Employing these patterns to study the changes in a writer’s style works quite well, producing interesting insights, as Hoover himself shows in his article. In Henry Jame’s case, Hoover is not only interested in words that “show substatial change” across the periods, but also in the MFWs within each pattern (He develops an interesting alternative to determining frequencies, which I will not addressed here, but it is thoroughly explained in his article). The uniqueness of Gutiérrez Najera’s poetic corpus, however, led me into a different direction.

The dates of Gutiérrez Najera’s “late” stylistic period, 1888-1895, coincide with the beginning of the modernista movement. This of course means that the changes in his style towards the end of his life, are not only “personal” changes, but they also could be signs of the advent of a new literary period in Spanish American letters. Patterns such as LLH, LMH and even MLH, which identify words with higher frequencies in the late period, are also possibly pointing to modernista words that have become influential in the late 19th century.

I suppose that until now I have been guilty of anthropomorphizing literary periods. I have assumed that strategies for analyzing a writer’s style can be used to understand the “style” of a literary period. Assembling MFWs list of modernista words has so far led me to frustrating results, and perhaps I should be focusing on words that either emerge or increase in frequency in a literary period in relation to the previous literary period. Ideally, for an analysis of this kind, I would need really “big data,” which I do not have at the moment.

Unlike Hoover, I am not interested in the top words belonging to a specific pattern. Any word that appears overrepresented in the late period in relation to the other periods is interesting because it might indicate the emergence of a new language. Thus, a modernista word like “ninfas,” which follows the pattern LLH, would not appear in the MFWs list because its frequency in the text is not high enough. But if we consider that “ninfas” went from not appearing in the first two periods to appearing five times in the third one (0-0-5), one must acknowledge this change as a significant one (esp. in poetry). In contrast, a token that goes from 11 and 11 to 13, is less relevant for determining the style of a period, but it would probably appear as part of the MFWs because of its high frequency.

To notice the difference between the two methods for obtaining the most significant words in a literary period (MFWs vs pattern analysis), let’s take a look at the top 150 MFWs for Gutierrez Najera’s 1888-1895 period:

 [1] "la"        "de"        "y"         "el"        "en"       
  [6] "que"       "a"         "las"       "los"       "no"       
 [11] "se"        "con"       "qué"       "es"        "mi"       
 [16] "al"        "del"       "su"        "por"       "tu"       
 [21] "ya"        "como"      "me"        "un"        "para"     
 [26] "lo"        "te"        "si"        "sus"       "muy"      
 [31] "mis"       "tus"       "todo"      "pero"      "alma"     
 [36] "amor"      "más"       "ni"        "una"       "oh"       
 [41] "yo"        "cuando"    "vida"      "dios"      "son"      
 [46] "tan"       "tú"        "flores"    "está"      "sin"      
 [51] "le"        "mar"       "noche"     "luz"       "entre"    
 [56] "esa"       "sombra"    "blanca"    "ha"        "porque"   
 [61] "hay"       "o"         "ojos"      "triste"    "mañana"   
 [66] "nos"       "ser"       "así"       "casa"      "cielo"    
 [71] "quién"     "rosas"     "va"        "cual"      "alas"     
 [76] "hasta"     "poeta"     "brazos"    "siempre"   "también"  
 [81] "versos"    "azul"      "cómo"      "ella"      "fin"      
 [86] "fué"       "labios"    "madre"     "amores"    "blancas"  
 [91] "mas"       "sólo"      "amante"    "bien"      "dos"      
 [96] "era"       "hermosa"   "primavera" "sé"        "sobre"    
[101] "sueño"     "tal"       "tiene"     "tristes"   "día"      
[106] "dolor"     "ese"       "espera"    "esperanza" "muerte"   
[111] "nada"      "pues"      "quien"     "rosa"      "señor"    
[116] "aquí"      "ay"        "blanco"    "bueno"     "mientras" 
[121] "musa"      "nadie"     "nunca"     "ondas"     "parece"   
[126] "queda"     "ti"        "tierra"    "todas"     "todos"    
[131] "vez"       "viene"     "aire"      "ama"       "beso"     
[136] "buena"     "coro"      "él"        "eres"      "hoy"      
[141] "luego"     "poco"      "voz"       "acaso"     "almas"    
[146] "altar"     "belleza"   "busca"     "busco"     "cuán" 

The following are some of the words with the patterns LLH, LMH, MLH for the same period, listed alphabetically, without taking into consideration the frequencies.


  [1] "acero"      "acude"      "acuerdo"    "afrodita"   "algunos"   
  [6] "alto"       "amada"      "ambiente"   "ancha"      "apaga"     
 [11] "apagados"   "aparece"    "arena"      "arte"       "azahares"  
 [16] "bajar"      "bonito"     "bosque"     "bote"       "botones"   
 [21] "brillan"    "brillante"  "brillantes" "buenos"     "calla"     
 [26] "callada"    "calles"     "cauda"      "cerca"      "cisnes"    
 [31] "copa"       "copas"      "correr"     "cristo"     "cuanto"    
 [36] "daré"       "déjame"     "dejan"      "dejemos"    "dí"        
 [41] "día"        "dichoso"    "digno"      "dió"        "dioses"    
 [46] "dura"       "edad"       "encaje"     "encanto"    "enciende"  
 [51] "entreabre"  "envuelto"   "escalera"   "esposo"     "estatua"   
 [56] "fronda"     "fue"        "fuerza"     "gardenia"   "gracia"    
 [61] "grecia"     "griega"     "guerrero"   "haber"      "hadas"     
 [66] "heladas"    "hizo"       "hombros"    "id"         "ideas"     
 [71] "iras"       "licor"      "lirios"     "mala"       "mayor"     
 [76] "mire"       "modo"       "muñeca"     "naranjos"   "naturaleza"
 [81] "ninfas"     "nota"       "nuestra"    "obscuras"   "olvides"   
 [86] "pasar"      "perezoso"   "peso"       "pide"       "piensa"    
 [91] "pierde"     "plantas"    "plumaje"    "prometida"  "puerto"    
 [96] "puñal"      "querido"    "quita"      "raudos"     "regatas"   
[101] "riqueza"    "roban"      "roca"       "rocas"      "saben"     
[106] "secas"      "señores"    "senos"      "sentí"      "sentir"    
[111] "sigue"      "subir"      "tener"      "tengo"      "tiembla"   
[116] "tocar"      "toda"       "toma"       "última"     "venid"     
[121] "verde"      "vestidos"   "ví"         "viendo"     "vivo"      
[126] "volcán"     "vuelva"


  [1] "abandona"    "abrir"       "acaso"       "ah"          "alameda"
  [6] "alas"        "álbum"       "alguno"      "alondra"     "altar"
 [11] "amado"       "amantes"     "amiga"       "amigos"      "apacible"
 [16] "aprisa"      "arco"        "áureo"       "baja"        "barca"
 [21] "barranco"    "blanco"      "bocas"       "breves"      "buena"
 [26] "buenas"      "bueno"       "busca"       "busco"       "cae"
 [31] "caja"        "calle"       "campo"       "cantan"      "cantando"
 [36] "cariños"     "casa"        "cautiva"     "cirios"      "ciudad"
 [41] "claridad"    "compasión"   "conchas"     "corales"     "coro"
 [46] "cosas"       "cristal"     "cuánta"      "cuántas"     "cuánto"
 [51] "cuántos"     "cuentos"     "da"          "débil"       "decir"
 [56] "descansa"    "desnuda"     "desnudo"     "dicen"       "dije"
 [61] "dijo"        "dónde"       "é"           "en"          "enlutada"
 [66] "entré"       "esas"        "escuela"     "espera"      "esta"
 [71] "están"       "estás"       "estremece"   "fiesta"      "fin"
 [76] "follaje"     "frescas"     "gallardo"    "gran"        "guantes"
 [81] "guardó"      "hablan"      "hablar"      "hace"        "hada"
 [86] "hago"        "hay"         "hermana"     "hermosa"     "hondo"
 [91] "húmedas"     "huyeron"     "i"           "impacientes" "inmensa"
 [96] "iracundo"    "juega"       "jugando"     "juguetona"   "laurel"
[101] "ligera"      "lo"          "luego"       "malo"        "mamá"       
[106] "mañana"      "mar"         "marfil"      "margarita"   "mariposas"  
[111] "metal"       "mías"        "misa"        "muda"        "mudas"      
[116] "mueren"      "musa"        "muy"         "nadie"       "negras"     
[121] "ni"          "nieve"       "niños"       "no"          "noches"     
[126] "novia"       "nuestro"     "nuevas"      "oh"          "oís"        
[131] "otra"        "padres"      "paje"        "papá"        "para"       
[136] "pasa"        "pedestal"    "pensando"    "pero"        "piernas"    
[141] "plata"       "plumas"      "pobres"      "poco"        "poesía"     
[146] "poeta"       "primavera"   "primero"     "príncipe"    "pues"       
[151] "qué"         "queda"       "quedan"      "quieren"     "quiso"      
[156] "recuerdos"   "risa"        "risas"       "rojas"       "rojos"      
[161] "rosa"        "rosas"       "rubias"      "rumor"       "sabe"       
[166] "salid"       "sangre"      "sí"          "silencio"    "sino"       
[171] "solas"       "sollozando"  "solo"        "sombra"      "son"        
[176] "soñadora"    "sonriendo"   "suelto"      "tenue"       "tienden"    
[181] "tienen"      "tímida"      "todo"        "trémulas"    "trenzas"    
[186] "tristezas"   "túnica"      "tuve"        "unos"        "vamos"      
[191] "vela"        "versos"      "viaje"       "visto"       "vivos"      
[196] "volar"       "vuelan"      "ya" 


  [1] "a"           "allá"        "alzarse"     "ama"         "amante"
  [6] "amorosa"     "años"        "ay"          "azul"        "belleza"
 [11] "beso"        "besos"       "blancas"     "brazos"      "brotó"
 [16] "brumas"      "caer"        "canción"     "cayó"        "cierto"
 [21] "conoce"      "cosa"        "creo"        "cuadro"      "cuán"
 [26] "dan"         "dar"         "dicho"       "dios"        "donde"
 [31] "es"          "esa"         "ese"         "esposa"      "espumas"
 [36] "eternamente" "existe"      "fresca"      "fué"         "fueron"
 [41] "gentil"      "gigantes"    "gracias"     "grana"       "grito"
 [46] "ha"          "hacer"       "haré"        "haz"         "herido"
 [51] "hermosas"    "hermoso"     "hermosura"   "hijas"       "hijo"
 [56] "huerto"      "huyen"       "infeliz"     "infinita"    "instante"
 [61] "joven"       "las"         "le"          "leve"        "llama"
 [66] "llena"       "manto"       "meses"       "mío"         "muchas"
 [71] "muerto"      "muñecas"     "muros"       "océano"      "olor"
 [76] "ondas"       "orillas"     "otras"       "pálida"      "palidez"
 [81] "palomas"     "parecen"     "patria"      "pecado"      "pechos"
 [86] "piadosa"     "pido"        "pluma"       "porque"      "prosa"
 [91] "quién"       "quien"       "quieras"     "rabia"       "razón"
 [96] "responde"    "retozan"     "risueño"     "rompe"       "salud"

A quick look at the modernista vocabulary present in the MFWs reveal that many of the typical modernista words appear in it (azul, flores, rosas, primavera), as well as in the groups selected by patterns. In the MFWs, as it is expected of any list for stylistic analysis, too many of the top words are not useful for detecting a modernista vocabulary (la, de, y, el, en). In contrast, employing the patterns one finds interesting words that do not appear among the top MFWs: grecia, griega, pálidas, venus, marfil, and many others.

Periodizing Modernista Poetry

I. Intro

Gutiérrez Nájera never published his poems in the form of a book. They appeared in the numerous newspapers for which he worked and, after he passed away (1895), his friends collected them in a single volume, along with a preface written by Julio Sierra. The book (you can find a copy of it in archive.org) included 158 poems to which modernista scholars such as Mapes, Boyd C Carter, González Guerrero and others have continued to add new texts throughout the years. At the moment, according to Angel Muñoz Fernández, there are 235 poems attributed to the Mexican poet (13). It is usually assumed that Nájera, as a poet, had a “youthful” and a “mature” artistic periods, but there is no clear consensus about when one period ends and the other begins. The closest thing we have to a periodization of his poetry is the grouping of the poems introduced by González Guerrero in his 1953 edition of Poesías completas. Even though he divides Nájera’s poetic work into several chronological periods, González Guerrero also groups them according to themes and poetic forms. The critic’s seemingly chaotic periodization goes as follows: Under the general heading of “Primeras Poesías,” he adds two subdivisions, “La fe de mi infancia” (1875-1881) and “Trovas de amor” (1875-1880). The rest of the poems are placed in the following sections: “Otros poemas juveniles” (1877-1881), “Caminos del viento” (1880-1883), “Ala y abismo” (1884-1887), “Elegías” (1887-1890), “Nuevas canciones” (1888-1895), “Odas breves” (No dates given), “Poesías varias” (1876-1891), “Versiones” (1880-1884). The last group contains Nájera’s translations of French poems, some of which, at one point, were mistaken for original creations. One could argue that González Guerrero divides Nájera’s poetic trajectory into a youthful period that goes from 1875 to 1881, a transitional period from 1880-1883, a middle period, from 1884-1887, and a mature period that goes from 1888 to 1895.

My objective was to  apply a stylometric analysis to Nájera’s poetry with the purpose of creating a new periodization. In the next two sections of this post, I will summarize the problems I had with preparing the data and with some of technical aspects of the analytical process. If you prefer, you can jump to the last section of the post, in which I contrast my results to González Guerrero’s and propose a new periodization of Nájera’poetic work.

II. The 1896 edition and its afterlife

Although a total of 235 poems are recognized as forming Nájera’s poetic corpus, that number also includes poems translated from French literature and at least one poem written entirely in French. I excluded those from my analysis bringing down the total to 220 poems. The biggest problem in classifying the poems, however, had to do with the dates of composition and/or publication. The 1896 posthumous edition was supposed to be organized chronologically, but many of the texts do not follow that order, and many others have no date assigned to them. None of the scholars in charge of the editions of Nájera’s poems that came after, fixed the problem, often simply reproducing the composition/publication dates found in the 1896 edition. Angel Muñoz Fernández’s comments, in his preface to the 2000 edition of Nájera’s poetry (which contains a facsimile of the 1896 edition, of course), describes the complexity of the problem: “Revisando algunos diarios de la época, encontré que el célebre ‘Francia y México’, con fecha 1882 en la edición de 1896, fue publicado en El Nacional el 5 de mayo de 1881, apareciendo junto al título la fecha 1879, que pudiera corresponder al año en que el poema fue escrito” (17).

I was unable to determine the date of a total of 34 poems, bringing down the number of poems I could use for my analysis to 186.

III. Length, etc

The technical side of the project created additional problems. Initially, I envisioned grouping Nájera’s poems by year, and treating each year as if it were a single text. I would then tokenize the poems and get the word counts and frequencies in relation to that year alone. However, Nájera had a very uneven poetic production and some periods were more productive than others. Some years he wrote so few poems that it became impossible to get an accurate author signal because there were not enough tokens per year of production. In his paper, “Does Size Matter? Authorship Attribution, Short Samples, Big Problem,” Maciej Eder argues that the current methods for doing stylometric analysis do not allow the study of very short texts: “using 2,000-word samples will hardly provide a reliable result, to say nothing of shorter texts.” The number of words needed to get an accurate authorship signal in a text varies. With regard to poetry, Eder explains that in his experiment “the results for the three poetic corpora (Greek, Latin, English) proved ambiguous, suggesting that some 3,000 words or so would be usually enough, but significant misclassification would also occur occasionally.” To analyze Nájera’s poetic corpus, I combined the texts from adjacent years in order to create two-year periods with around 4000 words. Only a few of the years surpassed the 4000 token mark and I left those by themselves. I was forced to create a multi-year period for the last years of Nájera’s life because of his extremely low production during that time.

FECHA                   TAMAÑO DEL “TEXTO”

1879 ----------------------- 8611  

1880 ----------------------- 5896  

1881 ----------------------- 4352  

1875-1876 ------------------ 5461  

1877-1878 ------------------ 8594  

1882-1883 ------------------ 3995  

1884-1885 ------------------ 6073 

1886-1887 ------------------ 8648  

1888-1889 ------------------ 7335  

1890-1895 ------------------ 9541  

After combining the years to obtain a higher token number, I compared the style for each time period employing as my classification method, Burrow’s Delta with zscores. The following images show the results, employing 150 of the most frequent words


and with 300 MFWs


I did not eliminate any pronouns or overrepresented words. I have yet to apply other methods (such as SVM and PCA) to this data.

IV. Periodization.

In spite of all the problems related to dating the poems and length of samples, the stylometric analysis I performed makes it possible to propose a new periodization of Nájera’s poetic work (however provisional it might be). Looking at the following visualization of the classification resulting from the Burrow’s Delta method, the first thing one notices is how a cluster formed with the poetry from 1875 to around 1878/1879 (1879 often appears completely disconnected from the periods coming before and after). In González Guerrero’s view Nájera’s youthful period last until 1881, but in the stylometric analysis, the years from 1880 to around 1887 show strong similarities among them, almost always grouped together.

I was obviously concerned about having influenced the periodization by my creating two year periods to obtain a higher number of tokens. Addressing this problem was especially significant to determine when the transition from the middle to the mature period took place. González Guerrero employed 1888 as the year marking the beginning of Nájera’s last poetic period. When I tried combining 88-89 and 90-95, these two groups tended to move closer to each other than to the other 1880s groups. I then left 1888 by itself (there were enough tokens in that year to do that—over 5000) and created two more groups, 89-90 and 91-95. In this occasion 1888 moved toward 89-90, but not as close to 91-95 as I expected. The higher the number of MFW used, the more 91-95 distanced itself from the late 1880s. In other words, Nájera’s style definitely underwent a change in towards 1888 (possibly marking the beginning of a transitional period that goes until 1890?), but it is not clear that the last period of his poetry began as early as 1888.


To Do:

  • use of other classification methods such as SVM or PCA
  • analysis of the change of vocabulary from the 1870s to the 1890s (topic modeling needed?)
  • Adding prose documents to corpora. Establishing the publication date of those appears to be easier (should I assume that the difference between Nájera ‘s poetic style and his prose style is not significant?)

Modernista Vocabulary

Anyone who has studied modernista poetry has an image of what modernista language looks like. There are certain words that one automatically associates with this movement:  “cisnes,” “princesas,” “estatuas”, “céfiro” and so on. But how many of the so-called “modernista” words are also the most frequent words (MFW) employed by Darío and his followers? For example, I do not know why I always think of “nácar” as one of the quintessential modernista words. For me, it was surprising to discover that Darío does not use it in any of his poems in my database (don’t know if he uses it in his prose work).

The following is a list of twenty of Darío’s MFWs (excluding articles, prepositions and common verbs):

oh           0.2890521
vida         0.2870448
luz          0.2609499
oro          0.2288330
amor         0.2268256
alma         0.1987234
sol          0.1906941
azul         0.1666064
esperanza    0.1605845
canto        0.1565699
tierra       0.1505480
rosa         0.1485407
día          0.1465334
dios         0.1445261
gloria       0.1445261
rosas        0.1445261
sangre       0.1425188
ojos         0.1364968

And then there are those two words one usually associates with Darío’s poetry. They appear much less frequently in his texts:

cisne        0.05821189 
princesa     0.04616805

The gap between these two classical modernista words and the MFWs in Darío’s poetry is clearly shown in the following image.

I know it is not right for me take Darío’s vocabulary as representative of modernista poetry in general, but I am not sure that adding corpuses from other writers and compiling a list of their shared MFRWs would solve the problem. And, is this “gap” simply a problem between perception and reality?

I suppose that in order to contrast the public perception of modernista vocabulary with the actual MFWs in their texts, one needs to find a way to compile a list of modernista words as understood by readers and critics.

Topic Tree

Many of the techniques I have been applying to my corpora were specifically designed for use with “Big Data” —which I do not have, unless one considers about eighty something poems “Big” data. I believe that it is possible to apply quantitative analysis to smaller groups of text and obtain meaningful results. Ted Underwood would probably disagree with me on this, but I think even he would be surprised at the results I am getting from using his Topic Tree method to Nájera’s poetry.

Topic modeling is very popular at the moment, but when I started working with modernista poetry I had my doubts about that approach. I didn’t know if it was the right one for my data as topic modeling tends to emphasize  the isolation of key themes in a single text. Because I was treating each poem as an individual work, I needed a technique that could help me establish links among the poems based on the recurrence of similar words/topics. Underwood’s technique (an alternative to topic modeling, which he calls Topic Tree), does exactly that. He applied it to a huge collection of 18th century documents and produced this dendrogram tree. He also has a post explaining his technique in detail.  Underwood uses a vector space model to compare words among corpuses, but instead of employing the tf-idf scores normally used by search engines, he has developed his own formula, which he explains in this Tech Note. In the same Tech Note, he has released his R code (as well as a very handy script to divide large trees into manageable sections).

Let me now show the results I got using the Topic Tree technique.  I fed my script 250 common words and that produced four main branches. As expected the branches of my tree reflect Nájera’s main topics.

In branch one, the topics are poetry, faith and childhood:rama1In Branch two, all the words are related to the experience of death:


In the third branch the focus is on natural elements, with strong emphasis on flying animals and love:rama3The fourth and last branch is also the most interesting one. In the top part the main topic is poetry and its representation of beauty/nature, towards the middle the dominant topic is sexual attraction and the lower part of the branch shows the poetic representation of women in modernista poetry (notice the words that cluster around “white”) .rama4

The experiment in my opinion was very successful and, of course, I am now curious about what a topic tree of modernista poetry would look like.

More on Whiteness

I was surprised to discover that the ratio between white and blue in the first one thousand lines of Nájera’s poetry I scanned ( blanca/blanco 0.26328016 vs.     azul/celeste 0.06194827) was incredibly consistent with the results I got using about four thousand lines (see previous post). The first 1,000 were from poems written between 1882 and 1886.

I decided to convert Nájera’s corpus from plain text to a TEI format. So far I have been mining the poems as if they were a long text, not as individual poems. Adding a <date> tag to each poem allows me to group texts by year. I was hoping to discover an interesting pattern in Nájera’s use of colors. However, the number of tokens varied greatly from year to year, not nly because I have only scanned one third of his poems but also becasue possibly due to the demands of his journalistic duties, Nájera’s poetic output from around 1888 to 1895 was sparse. This is, of course, a common problem when having such a small data set.

In the graphic below, azul (including azul, azules, azur, celeste) appears at the beginning of his career (1877) and again towards the end (1895). Blanco (including blanco, blanca, blancos, blancas, blancura), on the other hand, consistently appears throughout the years.

azulesyblancosGrouping poems by date will also become useful in the future as I try to periodize Nájera’s poetry. Unlike Darío, Nájera never published a single book of poetry and most attempts at organizing his poetry in periods seem arbitrary (See González Guerrero’s preface to Poesías completas [1966], for example).