"Fightin' Words" in Python
24 Jun 2013The Python script below implements the “Fightin’ Words” algorithm (see Monroe, B., Colaresi, M., Quinn, K. Fightin’ words: lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4), pp. 372-403). It takes as inputs word-frequency matrices. These matrices must be in CSV format. The first column must contain the words and the second column must contain the absolute frequencies (if there are additional columns they will simply be ignored). These matrices need to be in the appropriate folder: change ‘rpath’ as needed (see line 11).
The script lets you choose between an uninformative prior (alpha = 0.01 for all words) and an informative prior (alpha = relative frequency of the word in the language). To choose the uninformative prior all you need are the word-frequency matrices mentioned before. To choose the latter you need also a matrix of global relative frequencies, i.e., a matrix containing the relative frequency of each word in the language as a whole. This matrix must be in CSV format as well. The first column must contain the words and the second column must contain the relative frequencies. This matrix must be saved as ‘corpus.csv’ (or you can change the name declared in lines 16, 26, and 52 of the script) and it must be in the same folder as the other matrices (‘rpath’). (The Laplace prior, also proposed in the ‘Fightin’ Words’ article, is not implemented here – I’m working on that).
The output is a set of Z scores, which is saved to a CSV file (see lines 111-116).
Because this script processes one file at a time, it can handle corpora that are too large to fit in memory.