new resources

I just added some programming-related entries to the resources page. It’s a bunch of books, online courses, and Python modules that I’ve discovered over the last few months and found useful.

"Fightin' Words" in Python

The Python script below implements the “Fightin’ Words” algorithm (see Monroe, B., Colaresi, M., Quinn, K. Fightin’ words: lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4), pp. 372-403). It takes as inputs word-frequency matrices. These matrices must be in CSV format. The first column must contain the words and the second column must contain the absolute frequencies (if there are additional columns they will simply be ignored). These matrices need to be in the appropriate folder: change ‘rpath’ as needed (see line 11).

The script lets you choose between an uninformative prior (alpha = 0.01 for all words) and an informative prior (alpha = relative frequency of the word in the language). To choose the uninformative prior all you need are the word-frequency matrices mentioned before. To choose the latter you need also a matrix of global relative frequencies, i.e., a matrix containing the relative frequency of each word in the language as a whole. This matrix must be in CSV format as well. The first column must contain the words and the second column must contain the relative frequencies. This matrix must be saved as ‘corpus.csv’ (or you can change the name declared in lines 16, 26, and 52 of the script) and it must be in the same folder as the other matrices (‘rpath’). (The Laplace prior, also proposed in the ‘Fightin’ Words’ article, is not implemented here – I’m working on that).

The output is a set of Z scores, which is saved to a CSV file (see lines 111-116).

Because this script processes one file at a time, it can handle corpora that are too large to fit in memory.

Wordscores in Python

The Python script below implements the ‘wordscores’ algorithm (see Laver, M., Benoit, K., Garry, J. Extracting policy positions from political texts using words as data. American Political Science Review, 97(2), 2003, pp. 311-331). It takes as inputs word-frequency matrices. These matrices must be in CSV format. The first column must contain the words, the second column must contain the absolute frequencies, and the third column must contain the relative frequencies (if there are additional columns they will simply be ignored). These matrices need to be in the specified input folder - change ‘ipath’ as needed (line 9). The output is a set of word scores, which is saved to a CSV file (line 52), and a set of document scores (and corresponding confidence intervals), which are printed on the screen (lines 102-115) and saved to a CSV file (line 118). These CSV files are saved to the specified output folder - change ‘opath’ as needed (line 10).

Because this script processes one file at a time, it can handle corpora that are too large to fit in memory (unlike the R and Stata versions).