papers

published:

Deep Learning Anomaly Detection as Support Fraud Investigation in Brazilian Exports and Anti-Money Laundering. 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016. Here we use deep learning to detect fake Brazilian exports. Data and code Sorry, it’s company-level data and therefore protected by Brazilian privacy laws (only had access to it because co-author works at Brazil’s tax authority.)

A dimensão geográfica das eleições brasileiras (“The spatial dimension of Brazilian elections”). Opinião Pública (Public Opinion), 19(2), 270-290, 2013. Here I use spatial econometrics and the Brazilian election of 2010 to understand why neighboring counties tend to vote similarly. Data and code. I used a mix of Stata (here) and R (here) code. The dataset is here (it’s in Stata format; convert it to CSV format to run the R code). The list of missing observations is here. (To produce the plots I used GeoDa and ArcGIS, using the respective GUIs, so there’s no code for those.)

Lobby e protecionismo no Brasil contemporâneo (“Lobby and protectionism in Brazil”). Revista Brasileira de Economia (Brazilian Review of Economics), 62(3), 263-178, 2008. Here I regress tariffs on industry-level indicators of political power (economic concentration, number of workers, etc). Data and code. I ran everything almost a decade ago and back then I used Excel spreadsheets to store data (I know, I know…) and I clicked buttons instead of writing code (I didn’t know any better), so I don’t have much to offer here. The spreadsheets are all in this zipped folder.

still in progress:

Using SVM to pre-classify government expenditures. Here I use support vector machines (SVM) to create an app that could reduce misclassification of government purchases in Brazil. The app suggests likely categories based on the description of the good being purchased. Data and code. Download and decompress the CSV files and save them all in the same folder. Then use the scripts parseX.py and parseY.py to create X.pkl and Y.pkl respectively (I know, I could simply let you download X.pkl and Y.pkl directly but you should not trust Python pickles you didn’t create yourself. And the pickles take up a lot more space than the CSVs.) Then use the catmat_svm.py script to train and validate the classifier. As for the web app, it’s open source.

Ideological bias in democracy measures Here I use Monte Carlos to reassess some studies on the biases behind the Freedom House, Polity IV, etc. I find that the evidence of bias is robust but that we can’t know which measures are biased or in what direction (e.g., for all we know the Freedom House may as well have a leftist bias, contrary to popular belief). Data and code. I used a mix of Stata (here and here) and R (here) code. Here’s the data in Stata format and here’s the same data in CSV format (for the R code).

Using NLP to measure democracy In this paper I use natural language processing to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). I base the ADS on 42 million news articles from 6,043 different sources. The ADS cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today the ADS are replicable and have standard errors small enough to actually distinguish between cases. To produce the ADS I relied on supervised learning. I tried three different approaches, compared the results, and picked the approach that worked best. More specifically, I tried: a) a combination of Latent Semantic Analysis and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods; and c) the Wordscores algorithm. The Wordscores algorithm outperformed the alternatives. I created a web application where anyone can tweak the training data and see how the results change (no coding required). Data and code. The two corpora (A and B) are available in MatrixMarket format. Each corpus is accompanied by other files: an internal index; a Python pickle with a dictionary mapping word IDs to words; and a Python pickle with a dictionary mapping words to word IDs. Here are the links: Corpus A (index, id2token, token2id) and Corpus B (index, id2token, token2id). There is also a cleaned-up version of the UDS dataset - here. As for the code, I used Python 2.7.6. For LSA and LDA I used the package gensim (v. 0.8.9) and for the tree methods I used the package scikit-learn (v. 0.14.1). All the LSA/LDA/trees scripts are available online (change the “num_topics”, “ipath”, “opath”, and “udsfile” variables as needed): LSA, LDA, tree-based predictions, and list of country-years (must be in the same folder as the LSA and LDA scripts). To run Wordscores I had to implement it in Python, as the existing implementations (in R and Stata) do not handle out-of-core data. The code is not pretty though, so if you want to replicate the Wordscores part it may be easier for you to write your own code. If you do want to use my code, first you’ll need to convert the term-frequency matrix from sparse to dense format, split it twice (once row-wise into chunks of 106,241 rows each and once column-wise in chunks of 49 columns each), compute the relative frequencies, split the matrix of relative frequencies (column-wise in chunks of 49 columns each), save all the chunks in HDF5 format, name the chunks in very specific ways (see code) and download a cleaned-up version of the UDS dataset from here. If you’re willing to endure all that then my code should work - you can find it here. Running LSA, LDA, and Wordscores required high-performance computers. To run LSA, LDA, and Wordscores on corpus A I used a cluster of memory-optimized servers from Amazon EC2. Each server was an 8-core Intel Xeon E5-2670 v2 (Ivy Bridge) CPU with 61GB of RAM. To run LSA and LDA on corpus B I used a cluster of nodes from the Ohio Supercomputer Center. Each node was a 12-core Intel Xeon x5650 CPU with 48GB of RAM. Most LSA and LDA specifications took about a day to run, but a few (especially LDA with 300 topics) took almost a week. Total computing time was 1,512 hours. The tree-based methods only took a few seconds for each batch and did not require high-performance computers.

Why is democracy declining in Latin America? Here I argue that Latin America’s “left turn” in the 2000s was accompanied by democratic erosion, as the new governments that came to power relied on constituencies that did not value democracy (which in turn reduced the electoral cost of suppressing press freedom, violating term limits, etc).

newspaper articles:

O terceiro fracasso do Mercosul (“The third failure of Mercosur”). O Estado de São Paulo, 2/5/2011. Here I discuss why Mercosur failed to lock in the trade liberalization of the 1990s.

O preço de aceitar a Venezuela (“The price of accepting Venezuela”). O Estado de São Paulo, 5/28/2009. Here I discuss the policy consequences of Venezuela’s entry into Mercosur (a trade bloc comprising Brazil, Argentina, Paraguay, Uruguay, and Venezuela).