doing data science in the government

Today it’s been three years since I first started working as a data scientist in the Brazilian government. Overall it’s been a great experience and I think this is a good time to reflect upon what I’ve learned so far. I’ll start with the ugly and the bad and then I’ll move on to the good.

bureaucrats won’t share data unless physically compelled to do so

There is no shortage of decrees saying that government agencies will give their data to other government agencies when requested to do so (here’s the latest one). But between the pretty text of the law and what really goes on in the intestines of the bureaucracy there is a huge gap. Every government agency is in favor of data sharing - except when it comes to its own data.

Excuses abound. I can’t share my data because it’s too sensitive. I can’t share my data because extracting it would be too costly. I can’t share my data because we’re in the middle of a major IT restructuring and things are too messy right now. I can’t share my data because we’ve lost its documentation. I can’t share my data because the IT guy is on vacation. I can’t share my data because you might misinterpret it (this is my favorite).

The actual reasons are not always easy to pinpoint. Sometimes there are legitimate concerns about privacy, as in the case of fiscal data. But this is rare. More often than not the alleged privacy concerns are just a convenient excuse. In some cases the privacy excuse is used even when the data is already public. For instance, everything the government buys (other than, say, spy gear) goes in the official bulletin, which anyone can read. The equivalent of the IRS in Brazil - Receita Federal - has all the corresponding tax invoices neatly collected and organized in a database. That database would make it much easier to compute, say, the average price the Brazilian government pays for ballpoint pens. You’d think that database would be readily available not just for the entire government but for the citizenry as well.

You’d be wrong. Government agencies have been trying - and failing - to put their hands in that database for years. Our IRS says it’s protected by privacy laws. But the law says that government purchases must be public. And they already are, it’s all in the official bulletin - but that’s unstructured data that would require lots of scraping, OCRing, and parsing. The data is already public but not in a machine-readable format. That’s why the IRS database is so valuable. But the IRS legal folks haven’t found their match yet.

(Meanwhile data that is actually sensitive - like people’s addresses and tax returns - can be bought for $10 from shady vendors not too hard to find; all it takes is a walk along Rua 25 de Março, in São Paulo. In Brazil if the government has any data on you then you can be sure it’s for sale at Rua 25 de Março.)

Sometimes bureaucrats don’t share data because they worry that the data will make them look bad. What if my data shows that I’m paying too much for office paper? What if my data shows that I have way more people than I need? What if my data shows that my agency is a complete waste of taxpayers’ money and should not exist? I suppose I understand the survival instinct behind such concerns, but those are precisely the cases where we absolutely must get the data, for that’s where the ugliest inefficiencies are. In general the more an agency refuses to share its data the more important it is to get it.

(It doesn’t help that I work for the Office of the Comptroller-General, whose main goal is to find and punish irregularities involving taxpayers’ money. People sometimes assume that our asking for their data is the prelude of bad things. We need to explain that inside the Office of the Comptroller-General there is this little unit called the Observatory of Public Spending and that our goal at the OPS is often to help other agencies by extracting insights from their data.)

government data is a mess

The idea that you should document your databases is something that hasn’t taken root in the Brazilian government. Nine times out of ten all you get is a MSSQL dump with no accompanying data dictionary. You try to guess the contents based on the names of the tables and columns, but these are usually uninformative (like D_CNTR_IN_APOSTILAMENTO or SF_TB_SP_TAB_REM_GDM_PST - both real-life examples). So you end up guessing based on the contents themselves. As in: if it’s an 11-digit numeric field then that’s probably the Brazilian equivalent of the Social Security Number.

As you might imagine, sometimes you guess wrong. You mistake quantity for total price and vice-versa and the resulting unit prices don’t make any sense. You think that a time field is in years but it’s actually in months. You mistake an ID field for a numeric field. Etc etc. You’re lucky when you catch the errors before whatever you’re doing becomes policy. And some fields you just leave unused because they are too mysterious and you can’t come up with any reasonable guesses. Like when it’s probably a dummy variable because the values are all 0s and 1s but you have no clue what the 0s and 1s mean.

Besides the lack of documentation there are also many errors. Null names, misclassification, missing data, typos (according to one database the Brazilian government signed an IT contract of US$ 1 quadrillion back in 2014 - that’s 26 times the budget of the entire US government). To give you an idea, half the government purchases classified as “wheeled vehicles” are under R$ 1,000 (roughly US$ 300); when we inspect the product descriptions we see that they are not actually vehicles but spare parts, which have a different code and should have been classified elsewhere.

The problem begins in the data generation process, i.e., in the systems bureaucrats use to enter the data. These systems are too permissive; they lack basic validation like checking input type (numeric, text, date, etc), input length (does the state code have more than two characters?), and the like. And there is no punishment for the bureaucrat who enters incorrect data.

The most frequent result is absurd averages. You try to compute, say, spending per employee, and what you get back is a bazillion dollars or something close to zero. That means many hours of data cleaning before we can really get started. You have to do all the sanity checks that the government systems fail to do - or else it’s garbage in, garbage out. After you’ve filtered out the US$ 1 quadrillion IT contracts and the US$ 0 hospitals and schools you are left with lots of missing data - and that’s often not missing at random, which poses additional problems. It’s no wonder that most of our academic work is about cleaning up data (like here and here).

recruiting is a different ball game

The Brazilian government does not deliberately recruit data scientists. Data scientists come in through a bunch of miscellaneous doors and then we find each other - by word-of-mouth or Twitter or conferences - and come up with ways to work together. By now there are a few well-known “data science places” in the government - like the OPS (where I work) and the TCU - and data-curious government employees have been flocking to them, by various means; but there isn’t a clear policy to attract data talent.

In order to enter the Brazilian government you usually have to pass a public exam and such exams do not cover any content even remotely related to data. What you do need to learn to pass these exams is a large number of arcane pieces of legislation, mostly relating to government procedures - like the three phases of a procurement process, what they are called, what the deadlines are, what the many exceptions are, and so on. As you may imagine, that doesn’t usually attract people interested in data. A few of us slip through the cracks somehow, but that’s largely by accident.

That makes recruiting for your team a lot harder than in the private sector, where you can simply post a job ad on LinkedIN and wait for applications. In the government you normally can’t recruit people that are not already in the government. It goes more or less like this: You start by identifying the person you want to bring to your team. He or she will usually be in another government agency, not in your own. You will need to negotiate with his or her agency so that they okay their coming to work for you. That may require you to find someone in your agency willing to go work for them. Such negotiations can last for months and they look a lot like an exchange of war prisoners (I was “traded” myself once and it’s not fun). If the head of your agency (your minister or whatever) has sufficient political clout (and know that you exist), he or she may try to prevail over his or her counterpart at the other agency. Either way, there’s no guarantee that you’ll succeed.

If it’s hard for the recruiter it’s even harder for the person being recruited. They need to tell their current agency that they no longer want to work there, but they have no guarantee that they will get transferred. Imagine telling your significant other that you want to break up but then being somehow legally compelled to stay in the relationship. It can be awkward. You’re unlikely get a promotion if your bosses know that you don’t want to work there. They may start giving you less important tasks. Such possibilities often kill any recruitment before it even begins.

There is also the issue of salary negotiations. The issue being that they are not possible. When you work for the government the law determines your salary - there is no room for negotiation. Sometimes you can offer a potential candidate a little extra money if they agree to take up some administrative responsibilities but this is usually under US$ 1000/month and most data scientists prefer to avoid having any responsibilities that involve paperwork. So whoever you are trying to lure must be really excited by the work you and your team do because the work itself is pretty much all you have to offer.

But enough with the bad and the ugly.

you have a lot of freedom to experiment

Paradoxically, in the midst of this red tape jungle we have a lot of leeway to play around and try new things. Data science is a highly technical subject, one that takes at least a few Coursera courses to even begin to grasp, and that helps keep it unregulated. We have to fill out the same stupid forms everyone else does when we want to buy stuff, but whether we use Hadoop or not, whether we adopt Python or R for a project, whether we go with an SVM or a neural network or both, and whether we think any given project is worth pursuing is all entirely up to us. Legal doesn’t have a say in any of that. The budget folks don’t have a say in any of that. The minister himself - heck, the president himself - doesn’t have a say in any of that. They wouldn’t even know where to start. So, thanks to the highly specialized nature of our trade we don’t have higher-ups trying to micromanage what we do.

There is also the tenure factor. You see, in Brazil once you enter the civil service you get automatically tenured after three years. And the constitution says that, once tenured, you need to do something really outrageous to get fired - and even then there are several appeal instances and often times nothing happens in the end. I bet that if I showed up naked for work tomorrow I still wouldn’t get fired; I might get a written warning or something along these lines, and I’d probably appear in the local news, but I would still have my job. It takes something outright criminal to get a government employee fired. Like, they need to catch you taking a bribe or not showing up for work for months. And even then sometimes people don’t get fired.

Overall tenure is bad: too many lazy idiots spend their days browsing Facebook and entertaining themselves with hallway gossip. But for experimenting purposes tenure is great. It makes “move fast and break things” possible. Some bureaucrats want our assistance and happily give us their data and collaborate with us, helping us understand their problems and needs. But other bureaucrats get upset that you’re even daring to ask for their data. And they worry that you might disrupt the way they work or that you might automate them altogether. If we had to worry about our jobs at every step of the way we wouldn’t accomplish much. Without tenure heads might roll.

you can help taxpayers; a lot

The Brazilian government is humongous - it takes up ~40% of the country’s GDP. Most of that money goes down the toilet: the government overpays for everything it buys; it contracts suppliers that do not deliver the goods and services they promised; corruption is generalized. But data science can help. For instance, at the OPS we have trained a model that predicts whether a supplier is likely to become a headache (say, because it won’t deliver or because it will shut down). (Here’s a talk I gave about it earlier this year.) We’re now refining that model so that we can later appify it and plug it into the government’s procurement system. That way the government will be alerted of potential problems before it signs the contract.

That project has taken a lot of time and effort - the first version of the model was the master’s research of my colleague Leonardo and since then he and other people have put a lot more work into it. That’s a lot of salary-hours. But if a single problematic contract of small size - say, US$ 100k or so - is prevented because of the model then all that effort will have been worth it. And given the size of the government’s budget - around US$ 1 trillion a year - we should be able to save a lot more money than US$ 100k. That’s money that could go back to taxpayers.

is it worth it?

If you can stomach the ugly and the bad and are excited about the good, then yes. :-)

liberating Sci-Hub

If you do academic research but are not affiliated with an academic institution you probably know Sci-Hub. It gives you access to over 60 million research papers - for free (no ads, no malware, no scams). Alexandra Elbakyan, its creator, has deservedly been ranked by Nature one of the top ten most relevant people in science and we independent researchers owe her a lot.

You’d think that such an invention would be welcomed by most people who are not Elsevier executives. You’d think that such an invention would be particularly welcomed at organizations that do not have an Elsevier subscription. You’d be wrong. In the Brazilian government, where I work, Sci-Hub is not only not welcomed, it is actively blocked. The firewall doesn’t let me access it.

That’s Portuguese for “Blocked content! Science is illegal/unethical, so screw yourself.” (Sort of.)

This week I finally got tired of that nonsense - dammit, I’m a data scientist, I need academic papers not only for the research I do on the side but also, and mainly, for my day job. So I decided to build an interface to Sci-Hub - an app that takes my search string, gives it to Sci-Hub, and retrieves the results. Much like I did before in order to use Telegram.

Writing the code was easy enough, it’s a simple web app that does just one thing. I wrote it on Thursday evening and I was confident that the next morning I would just fire app a new project on Google App Engine, deploy the code, and be done with it in less than an hour. Oh, the hubris. I ended up working on it all Friday and all Saturday morning; only at Saturday 12:43pm the damned thing went alive.

What follows is an account of those 36 hours, largely for my own benefit in case I run into the same issues again in the future, but also in case it may be helpful to other people also looking to unblock Sci-Hub. I’m also writing this because I think those 36 hours are a good illustration of the difference between programming, on the one hand, and software development, on the other, which is something I struggled to understand when I first started writing code. Finally, I’m writing this because those 36 hours are a good example of the inefficiencies introduced when sysadmins (or their bosses) decide to block useful resources.

le code

If you inspect the HTML code behind Sci-Hub you can see it’s really easy to scrape:

<div id="input"><form method="POST" action="/"><input type="hidden" id="sci-hub-plugin-check" name="sci-hub-plugin-check" value=""><input type="textbox" name="request"  placeholder="enter URL, PMID / DOI or search string" autocomplete="off" autofocus></form></div>

All you have to do is send a POST request. If Sci-Hub’s repository has the paper you are looking for, you get it in a PDF file.

So I built this minimal web app that sends a POST request to Sci-Hub and then emails me back the PDF. I chose email because getting and returning each paper takes several seconds and I didn’t want the app blocked by each request. With email I can have a background process do the heavy work; that way I can send several POST requests in a row without having to wait in-between.

To achieve that I used Python’s subprocess module. I wrote two scripts. One is the frontend, which simply takes the user’s input. I didn’t want any boilerplate, so I used cherrypy as my web framework. As for the HTML code I just put it all in the frontend.py file, as a bunch of concatenated strings (#sorrynotsorry). And I used CDNs to get the CSS code (and no JavaScript whatsoever).

I gave my app the grandiose name of Sci-Hub Liberator.

(Sci-Hub Liberator’s front-end. This is what happens when data scientists do web development.)

The other script is the backend. It is launched by the frontend with a call to subprocess.Popen. That way all requests are independent and run on separate background processes. The backend uses Python’s requests package to send the POST request to Sci-Hub, then BeautifulSoup to comb the response and find the link to the paper’s PDF, then requests again to fetch the PDF.

def get_pdf(user_input):
    '''
    search string -> paper in PDF format
    '''
    response_1 = requests.post('http://sci-hub.cc/', data = {'request': user_input})
    soup = BeautifulSoup(response.text)
    url_to_pdf = 'http:' + soup.find_all('iframe')[0].get('src')
    response_2 = requests.get(url_to_pdf)
    return response_2.content

The backend then uses Python’s own email package to email me the PDF.

def send_pdf(pdf):
    '''
    sends PDF to user
    '''
    sender = 'some_gmail_account_I_created_just_for_this@gmail.com'
    text = 'Your paper is attached. Thanks for using Sci-Hub Liberator! :-)'
    body = MIMEText(text, _charset = 'UTF-8')
    message = MIMEMultipart()
    message['Subject'] = Header('your paper is attached', 'utf-8')
    message['From'] = 'Sci-Hub Liberator'
    message['To'] = 'my_email_account@gmail.com'
    message.attach(body)

    part = MIMEApplication(pdf)
    part.add_header('Content-Disposition', 'attachment; filename = "paper.pdf"')
    message.attach(part)

    smtp_server = smtplib.SMTP('smtp.gmail.com:587')
    smtp_server.ehlo()
    smtp_server.starttls()
    smtp_server.ehlo
    smtp_server.login(sender, 'emails_password')
    smtp_server.sendmail(sender, 'my_email_account@gmail.com', message.as_string())
    smtp_server.quit()

Both scripts combined had 151 lines of code. Not exactly a “Hello, World!” application but not too far from it either.

a word of caution

Before I proceed I must ask you not to abuse Sci-Hub’s easily scrapable interface. That’s an amazing service they’re providing to the world and if you send thousands of requests in a row you may disrupt their operations. I trust that they have defenses against that (or else Elsevier would have taken them down long ago), but still, please don’t fuck up.

things change

Code written and tested, I turned to Google App Engine for hosting the app. With only 151 lines of code and two scripts I thought that launching the app would be a breeze. Silly me.

I wanted to use Python 3, but Google App Engine Launcher is only compatible with Python 2. I google around and it seems that they are deprecating GAE Launcher in favor of the Google Cloud SDK. Pity. GAE Launcher was a nifty little app that made deployment really easy. I had been using it since 2013 and it allowed me to focus on my app and not on deployment nonsense.

Resigned to my fate, I downloaded the Google Cloud SDK installer and… installation failed due to an SSL-related problem. It took some half an hour of googling and debugging before I could get it to work.

things don’t change

GAE’s standard environment only allows Python 2. You can only use Python 3 in GAE’s flexible environment. And the flexible environment is a different ball game.

I had never used the flexible environment before (I think it only became generally available early this year), but I decided to give it a try. To make a long story short, I couldn’t make it work. The exact same code that works fine on my machine returns a mysterious Application startup error when I try to deploy the app. The deploy attempt generates a log file but it is equally uninformative, it only says Deployment failed. Attempting to cleanup deployment artifacts.

Despite hours of tinkering and googling I couldn’t find out what the problem is. I declared all my dependencies in my requirements.txt file (and I pointed to the same versions I was using locally); I configured my app.yaml file; I made sure that all of my dependencies’ dependencies were allowed. I didn’t know what else to look into.

Eventually I gave up in despair and decided to fall back on GAE’s standard environment, which meant reverting to Python 2. That was a bummer - it’s 2017, if GAE’s standard environment needs to choose between 2 and 3 then it’s probably time to pick 3 (assuming there is a way to do that without killing all existing Python 2 projects).

pip issues

Vendoring didn’t work for BeautifulSoup. Even though I used pip install and not pip3 install what got installed was BeautifulSoup’s Python 3 version. That resulted in from bs4 import BeautifulSoup raising ImportError: No module named html.entities.

After several unsuccessful attempts to point pip install to a specific source file I gave up on pip. I tested my Mac’s system-wide Python 2 installation and BeautifulSoup was working just fine there. So I went to my Mac’s site-packages and just copied the damned bs4 folder into my app’s lib folder. That did the trick. It’s ugly and it doesn’t shed any light on the causes of the problem but by then it was Friday afternoon and I was beginning to worry this deployment might take the whole day (if only!).

sheer dumbness

GAE has long been my default choice for hosting applications and I’ve always known that it doesn’t allow calls to the operating system. It’s a “serverless” platform; you don’t need to mess with the OS, which means you also don’t get to mess with the OS. So I can’t really explain why I based the frontend-backend communication on a call to subprocess.Popen, which is a call to the OS. That’s just not allowed on GAE. Somehow that synapse simply didn’t happen in my brain.

back to the code

GAE has its own utilities for background tasks - that’s what the Task Queue API is for. It looks great and one day I want to sit down and learn how to use it. But by the time I got to this point I was entering the wee hours of Saturday. My hopes of getting it all done on Friday were long gone and I just wanted a quick fix that would let me go to bed.

So I rewrote my app to have it show the PDF on the screen instead of emailing it. That meant I would have to wait for one paper to come through before requesting another one. At that hour I was tired enough to accept it.

The change was pretty easy - it involved a lot more code deletion than code writing. It also obviated the need for a backend, so I put everything into a single script. But the wait for the PDF to be rendered was a little too much and I thought that a loading animation of sorts was required. I couldn’t find a way to do that using only cherrypy/HTML/CSS, so I ended up resorting to jQuery, which made my app a lot less lean.

Sci-Hub is smart

After getting rid of the OS calls I finally managed to deploy. I then noticed a requests-related error message. After some quick googling I found out that GAE doesn’t play well with requests and that you need to monkey-patch it. Easy enough, it seemed.

After the patching requests seemed to work (as in: not raising an exception) but all the responses from Sci-Hub came back empty. The responses came through, and with status code 200, so the communication was happening. But there was no content - no HTML, no nothing.

I thought that it might be some problem with the monkey-patching, so I commented out requests and switched to urrlib2 instead. No good: same empty responses. I commented out urllib2 and tried urlfetch. Same result. As per the official documentation I had run out of packages to try.

I thought it might have to do with the size of the response - maybe it was too large for GAE’s limits. But no, the papers I was requesting were under 10MB and the limit for the response is 32MB:

I had briefly considered the possibility of this being an user-agent issue: maybe Sci-Hub just doesn’t deal with bots. But everything worked fine on my machine, so that couldn’t be it.

Then it hit me: maybe the user-agent string on GAE is different from the user-agent string on my machine. I got a closer look at the documentation and found this:

Ha.

To test my hypothesis I re-ran the app on my machine but appending +http://code.google.com/appengine; appid: MY_APP_ID to my user-agent string. Sure enough, Sci-Hub didn’t respond with the PDF. Oddly though, I did get a non-empty response - some HTML code with Russian text about Sci-Hub (its mission, etc; or so Google Translate tells me). Perhaps Sci-Hub checks not only the request’s user-agent but also some other attribute like IP address or geographical location. One way or the other, I was not going to get my PDF if I sent the request from GAE.

At that point it was around 3am and I should probably have gone to bed. But I was in the zone. The world disappeared around me and I didn’t care about sleeping or eating or anything else. I was one with the code.

So instead of going to bed I googled around looking for ways to fool GAE and keep my user-agent string intact. I didn’t find anything of the kind, but I found Tom Tasche.

back to the code (again)

I decided to steal Tom’s idea. Turns out GAE has a micro-instance that you can use for free indefinitely (unlike AWS’s micro instance, which ceases to be free after a year). It’s not much to look at - 0.6GB of RAM - but hey, have I mentioned it’s free?

I rewrote my code (again). I went back to having the frontend and backend in separate scripts. But now instead of having the backend be a Python script called by subprocess.Popen I had it be an API. It received the user input and returned the corresponding PDF.

@cherrypy.expose
def get_pdf(self, pattern):
    '''
    search string -> paper in PDF format
    '''
    scihub_html = requests.post(
        'http://sci-hub.cc/', 
        data = {'request': pattern},
        headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'}
        )
    soup = BeautifulSoup(scihub_html.text)
    url_to_pdf = 'http:' + soup.find_all('iframe')[0].get('src')
    scihub_bytes = requests.get(ungated_url)
    return BytesIO(scihub_bytes.text)

I put this new backend in a GCE micro-instance and kept the frontend at GAE. I also promoted my backend’s IP from ephemeral to static, lest my app stop working out of the blue.

I was confident that this was it. I was finally going to bed. Just a quick test to confirm that this would work and then I’d switch off.

I tested the new architecture and… it failed. It takes a long time for the GCE instance to send the PDF to the GAE frontend and that raises a DeadlineExceededError. You can tweak the time out limit by using urlfetch.set_default_fetch_deadline(60) but GAE imposes a hard limit of 60 seconds - if you choose any other number your choice is just ignored. And I needed more than 60 seconds.

back to the code (yet again)

At that point I had an epiphany: I was already using a GCE instance anyway, so why not have the backend write the PDF to disk in a subprocess - so as not to block or dealy anything - and have it return just the link to the PDF? That sounded genius and if it weren’t 6am I might have screamed in triumph.

That only required a minor tweak to the code:

@cherrypy.expose
def get_pdf(self, pattern):
    '''
    search string -> paper in PDF format
    '''
    scihub_html = requests.post(
        'http://sci-hub.cc/', 
        data = {'request': pattern},
        headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'}
        )
    soup = BeautifulSoup(scihub_html.text)
    url_to_pdf = 'http:' + soup.find_all('iframe')[0].get('src')
    scihub_bytes = requests.get(url_to_pdf)
    paper_id = str(randint(0,60000000))
    with open('static/paper{}.pdf'.format(paper_id), mode = 'wb') as fbuffer:
        fbuffer.write(scihub_bytes.text)
    return paper_id

No DeadlineExceededError this time. Instead I got a MemoryError. It seems that 0.6GB of RAM is not enough to handle 10MB objects (10MB is the space the PDF occupies on disk; things usually take up more space in memory than on disk). So much for my brilliant workaround.

the end of fiscal responsibility

The cheapest non-free GCE instance has 1.7GB of RAM and costs ~$14.97 a month. I got bold and launched it (I looked into AWS EC2’s roughly equivalent instance and it wasn’t any cheaper: $34.96.). At last, after a painful all-nighter, my app was alive.

I mean, I still haven’t added any error checking, but that’s deliberate - I want to see what happens when Sci-Hub can’t find the paper I requested or is temporarily down or whatnot. I’ll add the error checks as the errors happen.

I’ll hate paying these $14.97 but it beats not having access to a resource that is critical for my work. The only alternative I see is to rescue my old Lenovo from semi-retirement and that would be annoying on several grounds (I don’t have a static IP address at home, I would need to leave it up and running all day, it would take up physical space, and so on). So for now $14.97 a month is reasonable. At least that money is not going to Elsevier.

Now that I’m paying for a GCE instance anyway I could move all my code to it (and maybe go back to having a single script) and be done with GAE for this project. But I have this vague goal of making this app public some day, so that other people in my situation can have access to Sci-Hub. And with GAE it’s easy to scale things up if necessary. That isn’t happening any time soon though.

things I learned

It’s not so fun to pull an all-nighter when you are no longer in grad school - we get used to having a stable schedule. But I don’t regret having gone through all these steps in those 36 hours. I used Google Compute Engine for the first time and I liked it. I’m used to AWS EC2’s interface and GCE’s looked a lot more intuitive to me (and I found out GCE has a free micro-instance; even though I ended up not using it for this project it may come in handy in the future). I also familiarized myself with gcloud, which I would have to do anyway at some point. And I also learned a thing or two about cherrypy (like the serve_fileobj method, which makes it really easy to serve static files from memory).

Those 36 hours were also a useful reminder of the difference between programming and software development. Programming is about learning all the things you can do with the tools your languages provide. Software development is largely about learning all the things you cannot do because your runtime environment won’t let you. Our Courseras and Udacities do a great job of teaching the former but we must learn the latter by ourselves, by trial and error and by reading the documentation. I’m not sure that it could be otherwise: loops and lambdas are fundamental concepts that have been with us for decades, but the quirks of GAE’s flexible environment will probably have changed completely in a year or less. Any course built around GAE (or GCP in general or AWS) would be obsolete too soon to make it worth it.

hey, where’s the code?

It’s all here. Have fun!

Automated Democracy Scores

New paper. Abstract:

This papers uses natural language processing to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). The ADS is based on 42 million news articles from 6,043 different sources and cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today the ADS is replicable and has standard errors small enough to actually distinguish between cases.

5 reasons why academics should read Anathem

I just read Neal Stephenson’s 2008 novel Anathem and now I walk around pestering everyone I know telling them to read it too. Well, not everyone: just people who are or have been in academia. Judging from Goodreads reviews everyone else finds the novel too long and theoretical, full of made up words, and full of characters who are too detached from the real world to be believable. Pay no heed to the haters - here’s why you academic types should read Anathem:

1. You will feel right at home even though the story is set in an alien world (I).

The planet is called Arbre and its history and society are not radically different from Earth’s. Except that at some point (thousands of years before the story begins) the people of Arbre revolted against science and confined their intellectuals to monasteries where the development and use of technology is severely limited - no computers, no cell phones, no internet, no cameras, etc -, as is any contact with the outside world. Inside these monasteries (“Maths”) the intellectuals (the “avout”) dedicate themselves to the study and development of mathematics, physics, and philosophy. As the use of technology is restricted, all that research is purely theoretical.

Arbre’s Maths are therefore an allegory for Earth’s universities. How many of our papers and dissertations end up having any (non-academic) impact? Maybe 1% of them? Fewer than that? In (Earth’s) academia the metric of success is usually peer-reviewed publications, not real-world usefulness. Even what we call, say, “applied econometrics” or “applied statistics” is more often than not “applied” only in a limited, strictly academic sense; when you apply econometrics to investigate the effect of economic growth on democracy that is unlikely to have any detectable effect on economic growth or democracy.

So, in Anathem you find this bizarre alien world where intellectuals are physically confined and isolated from the rest of the world and can’t use technology and yet that world feels familiar and as a (current or former) scholar you won’t react to that in the same way other people do. If you go check the reviews on Goodreads you’ll see lots of people complaining that the Maths are unrealistic. To you, however, Maths will sound eerily natural; Anathem would be more alien to you if the Maths were, say, engineering schools.

(Needless to say, the allegory only goes so far, as Arbran’s avout are legally forbidden from having any real-world impact; having no choice in the matter, they don’t lose any sleep over the purely academic nature of their work. And of course people do produce lots of useful research at Earth’s universities.)

2. You will feel right at home even though the story is set in an alien world (II).

The way an Arbran avout progresses in his or her mathic career is entirely different from the way an Earthly scholar progresses in his or her academic career - and yet way too familiar. In Arbre you start by being collected at around age 10. That makes you a “fid” and you will be mentored and taught by the more senior avout, each of which you will respectully address as “pa” or “ma”. When you reach your early twenties you choose - and are chosen by - a specific mathic order. There are many such orders, each named after the avout who founded it - there are the Edharians, the Lorites, the Matharrites, and so on, each with specific liturgies and beliefs.

The avout are not allowed to have any contact with the outside world (the “extramuros”) except at certain regular intervals: one year (the Unarian maths), ten years (the Decenarian maths), one hundred years (the Centenarian maths), or one thousand years (the Milleniarian maths). And only for ten days (those days are called “Apert”). You can get collected by any math - Unarian, Decenarian, Centenarian, or Millenarian. If you get collected, say, at a Unarian math, and you show a lot of skill and promise, you can get upgraded (“Graduated”) to a Decenarian math. If you keep showing skill and promise you can get Graduated to a Centenarian math. And so on. The filter gets progressively stricter; only very few ever get Graduated to the Millenarian maths.

So, the reward for being isolated from the outside world and focusing intensely on your research is… getting even more isolated from the outside world so that you can focus even more intensely on your research. Sounds familiar?

3. Anathem gives you vocabulary for all things academia.

Think back to your Ph.D. years and remember the times you went out with your fellow fids for drinks (well, if you were actual fids you wouldn’t be able to leave your math - you could, but then you wouldn’t be able to go back, except during Apert - but never mind that). Weird conversations (from the point of view of those overhearing them) ensued and you got curious looks from waiters and from other customers.

Why? Because you spoke in the jargon of your field - you used non-ordinary words and you used ordinary words in non-ordinary ways. Like “instrumental” or “endogeneity” or “functional programming”. Not only that: the conversations were speculative and obeyed certain unwritten rules, like Occam’s razor. Clearly these were not the same conversations you have with non-avout - your college friends, your family, your Tinder dates. And yet you call all of them “conversations”. Well, not anymore; Anathem gives you a word for inter-avout conversation about mathic subjects: Dialog. Neal Stephenson goes as far as creating a taxonomy of Dialog types:

Dialog, Peregrin: A Dialog in which two participants of roughly equal knowledge and intelligence develop an idea by talking to each other, typically while out walking around.

Dialog, Periklynian: A competitive Dialog in which each participant seeks to destroy the other’s position (see Plane).

Dialog, Suvinian: A Dialog in which a mentor instructs a fid, usually by asking the fid questions, as opposed to speaking discursively.

Dialog: A discourse, usually in formal style, between theors. “To be in Dialog” is to participate in such a discussion extemporaneously. The term may also apply to a written record of a historical Dialog; such documents are the cornerstone of the mathic literary tradition and are studied, re-enacted, and memorized by fids. In the classic format, a Dialog involves two principals and some number of onlookers who participate sporadically. Another common format is the Triangular, featuring a savant, an ordinary person who seeks knowledge, and an imbecile. There are countless other classifications, including the suvinian, the Periklynian, and the peregrin.

(Anathem, pp. 960-961)

(Yes, there is a glossary in Anathem.)

You can’t get much more precise than that without being summoned to a Millenarian math.

Dialog is just one example. You left academia? You went Feral.

Feral: A literate and theorically minded person who dwells in the Sæculum, cut off from contact with the mathic world. Typically an ex-avout who has renounced his or her vows or been Thrown Back, though the term is also technically applicable to autodidacts who have never been avout.

(Anathem, p. 963)

You left academia to go work for the government? You got Evoked.

Voco: A rarely celebrated aut by which the Sæcular Power Evokes (calls forth from the math) an avout whose talents are needed in the Sæcular world. Except in very unusual cases, the one Evoked never returns to the mathic world.

(Anathem, p. 976)

Reviewer #2 says your argument is not original? He’s a Lorite.

Lorite: A member of an Order founded by Saunt Lora, who believed that all of the ideas that the human mind was capable of coming up with had already been come up with. Lorites are, therefore, historians of thought who assist other avout in their work by making them aware of others who have thought similar things in the past, and thereby preventing them from re-inventing the wheel.

(Anathem, p. 967)

Got friends or family who are not academics? Well, ok, J. K. Rowling has already given us a word for that - muggles. But in some languages that word gets super offensive translations - in Brazilian Portuguese, for instance, they made it “trouxas”, which means “idiots”. Not cool, Harry Potter translators. But worry not, Neal Stephenson gives us an alternative that’s only a tiny bit offensive: “extras” (from “extramuros” - everything outside the maths).

Extra: Slightly disparaging term used by avout to refer to Sæcular people.

(Anathem, p. 963)

That cousin of yours who believes the Earth is flat? He is a sline.

Sline: An extramuros person with no special education, skills, aspirations, or hope of acquiring same, generally construed as belonging to the lowest social class.

(Anathem, p. 973)

And of course, what happens to a scholar who gets expelled from academia? He gets anathametized.

Anathem: (1) In Proto-Orth, a poetic or musical invocation of Our Mother Hylaea, used in the aut of Provener, or (2) an aut by which an incorrigible fraa or suur is ejected from the mathic world.

(Anathem, pp. 956-957)

And so on and so forth. Frankly, it’s amazing that academics manage to have any Dialogs whatsoever without having read Anathem.

(I must note that Neal Stephenson not only puts these words in the book’s glossary, he uses them extensively throughout the book - there are 40 occurrences of “evoked”, 90 occurrences of “Dialog”, and 57 occurrences of “sline”, for instance. And because there is a glossary at the end he doesn’t bother to define these words in the main text, he just uses them. Which can make your life difficult if, like me, you didn’t bother to skim the book before reading it and only found out about the glossary after you had finished. Damn Kindle.)

4. Anathem might be the push you need to quit social media for good.

I’ve been reading Cal Newport’s Deep Work, about the importance of focusing hard and getting “in the zone” in order to be productive. (Well, “reading” is inaccurate. I bought the audio version and I’ve been listening to it while driving - which is not without irony.) There isn’t a whole lot of novelty there - it’s mostly common sense advice about “unplugging” for at least a couple of hours each day so you can get meaningful work done (meaningful work being work that imposes some mental strain, as opposed to replying emails or attending meetings). The thing is, at a certain point, much to my amusement and surprise, Cal Newport mentions Neal Stephenson.

As Cal Newport tells us, Neal Stephenson is a known recluse. He doesn’t answer emails and he is absent from social media. To Newport, that helps explain Stephenson’s productivity and success (No, I won’t engage you in a long Periklynian Dialog about how we can’t establish causality based on anecdotal evidence. That’s not the point and in any case Cal Newport, despite being an avout himself - he’s a computer science professor at Georgetown - is trying to reach an audience of extras and Ferals.) I had read other Neal Stephenson books before - Cryptonomicon, Snow Crash, The Diamond Age, REAMDE, Seveneves -, but I had never bothered to google the man, so I had no idea how he lived. After Cal Newport’s mention, though, I think Anathem is a lot more personal than it looks. Among its many messages maybe there is Neal Stephenson telling us “see? this is what can be achieved when smart people are locked up and cut off from the world”. “What can be achieved” being, in Neal Stephenson’s case (and brilliantly recursively), a great novel about what can be achieved when smart people are locked up and cut off from access to the world.

5. Anathem may be an extreme version of what happens when people turn against science.

Flat-Earthers and anti-vaxxers are back. People who don’t know what a standard-deviation is pontificate freely and publicly about the scientific evidence of climate change. Violent gangs openly oppose free speech at universities. I’m not saying these slines are about to lock up Earth’s scientists in monasteries, but perhaps the Temnestrian Iconography is getting more popular.

“[…] Fid Erasmas, what are the Iconographies and why do we concern ourselves with them?” […]

“Well, the extras—”

“The Sæculars,” Tamura corrected me.

“The Sæculars know that we exist. They don’t know quite what to make of us. The truth is too complicated for them to keep in their heads. Instead of the truth, they have simplified representations— caricatures— of us. Those come and go, and have done since the days of Thelenes. But if you stand back and look at them, you see certain patterns that recur again and again, like, like— attractors in a chaotic system.”

“Spare me the poetry,” said Grandsuur Tamura with a roll of the eyes. There was a lot of tittering, and I had to force myself not to glance in Tulia’s direction. I went on, “Well, long ago those patterns were identified and written down in a systematic way by avout who make a study of extramuros. They are called Iconographies. They are important because if we know which iconography a given extra— pardon me, a given Sæcular— is carrying around in his head, we’ll have a good idea what they think of us and how they might react to us.”

Grandsuur Tamura gave no sign of whether she liked my answer or not. But she turned her eyes away from me, which was the most I could hope for. “Fid Ostabon,” she said, staring now at a twenty-one-year-old fraa with a ragged beard. “What is the Temnestrian Iconography?”

“It is the oldest,” he began. “I didn’t ask how old it was.” “It’s from an ancient comedy,” he tried.

“I didn’t ask where it was from.”

“The Temnestrian Iconography…” he rebegan.

“I know what it’s called. What is it?

“It depicts us as clowns,” Fraa Ostabon said, a little brusquely. “But… clowns with a sinister aspect. It is a two-phase iconography: at the beginning, we are shown, say, prancing around with butterfly nets or looking at shapes in the clouds…”

“Talking to spiders,” someone put in. Then, when no reprimand came from Grandsuur Tamura, someone else said: “Reading books upside-down.” Another: “Putting our urine up in test tubes.”

“So at first it seems only comical,” said Fraa Ostabon, regaining the floor. “But then in the second phase, a dark side is shown— an impressionable youngster is seduced, a responsible mother lured into insanity, a political leader led into decisions that are pure folly.”

“It’s a way of blaming the degeneracy of society on us— making us the original degenerates,” said Grandsuur Tamura. “Its origins? Fid Dulien?”

“The Cloud-weaver, a satirical play by the Ethran playwright Temnestra that mocks Thelenes by name and that was used as evidence in his trial.”

“How to know if someone you meet is a subscriber to this iconography? Fid Olph?”

“Probably they will be civil as long as the conversation is limited to what they understand, but they’ll become strangely hostile if we begin speaking of abstractions…?”

(Anathem, pp. 71-72)


This is it. Go read Anathem and tell your fellow avout and Ferals about it. See you at Apert.

doing data science when you live in a failed state

Brazil is the undisputed world leader in homicides: over 50 thousand a year, which is more than Europe, Oceania, United States, Russia, and China combined. Yes, combined. Yes, the whole freaking Europe. Yes, the supposedly gun-loving United States. Yes, China with its 1.3 billion people. Brazil beats these continents and countries by 4,473 homicides, which is roughly equivalent to Uganda or to ten Canadas. No, I’m not making these numbers up. Take a moment to let that sink in.

As you might guess, a country with lots of homicides also tends to have lots of robbery. I’d love to take my MacBook Pro to a coffee shop and work there all day like I used to when I was in grad school - back when I lived in lovely, safe, Columbus, Ohio. But if I do that in Brasília I’ll probably come back home empty handed (if I come back home at all). You can’t parade Apple gear around when you live in a failed state.

I finally got tired of working from home all weekend, so I decided to enable SSH and HTTP connections into my home network, so I can use my Mac remotely as if it were an AWS server. That way I can go to the coffee shop with my old, cheap Lenovo - or even a tablet or smartphone - and use it to connect to my MacBook, which will remain safe and sound back home. It took some doing and I imagine others may be going through the same problem (i.e., wanting to work at a coffee shop but living in an episode of The Walking Dead), so here’s a how-to.

My setup is: Humax HG100R-L2 modem (that’s what most clients of NET - Brazil’s largest cable company - have), AirPort Extreme Base Station router, MacBook Pro. Your setup will likely differ, but you can probably tweak the instructions here to fit whatever you have.

step 1: your modem

If you have both a modem and a router then the easiest way to go about this is to put your modem in ‘bridge mode’. That means disabling your modem’s advanced settings and delegating them to your router. That way you only need to worry about router settings. You won’t need to worry about complex interactions between your modem settings and router settings.

Head to http://192.168.0.1/ on your browser. You should see the page below.

If you’ve never changed them, your id and password are ‘admin’ and ‘password’ respectively. Sign in. You should see the following, except with your WiFi network name and password shown under “SSID(2.4GHz)” and “Senha” respectively. (Your password will be shown in plain characters, not as a bunch of dots, so don’t let your neighbors peek.) (Yes, Humax’ settings are in a mix of Portuguese and English. It beats me too.)

Click “Advanced Network Settings” (lower right corner). You should see something like this:

Click on “Definir” (between “Status” and “Back Up”, second column from the left). You should see a page with a bit more options than the following one (that’s because your modem is not in bridge mode yet).

On the “Modo Switch” menu, choose “Bridge”, then click “Aplicar”. Click “ok” on whatever confirmation pop up appears. This will make you go offline for a couple of minutes, as your modem resets itself. Wait until it’s back up online again and voilà, your modem is now in bridge mode.

(If you ever need to tweak your modem settings again, it’s no longer http://192.168.0.1/ but http://192.168.100.1)

step 2: your router

On to your router now. We need to tell it to accept incoming SSH and HTTP connections. In order to do that we need to tell your router to map those types of connections to specific ports.

On your Mac, open the AirPort Utility app.

Click on the AirPort Extreme picture to go into your routers’ settings and go to the ‘Network’ tab. You should see something like this:

We’ll make a lot of changes here. First, on the “Router Mode” dropdown menu, choose “DHCP and NAT” if that’s not the chosen value already. Then click the “+” button near “DHCP Reservations”. That will open a small page. You’ll make it look like the one below by selecting the exact same choices. (To do that you’ll need to know your MAC address, which you can find out in your Mac by going into “System Preferences”, “Network”, “Advanced”; it’s the combination of digits you see right next to “Wi-Fi Address”.) When everything matches, click “Save”.

Now you’re back to this:

Click the “+” button near “Port Settings”. A small page will pop up. Tweak all the fields so that it looks exactly like this:

Click “Save”. Then click the “+” button near “Port Settings” again. The same small page will pop up. Make it look exactly like this:

Click “Save”. Then click “Update”. Your router will go crazy for a moment as it does its magic. Wait until it comes back up online and voilà, you have allowed SSH and HTTP connections into your home network. SSH connections will be forwarded to port 22 and HTTP connections will be forwarded to port 8080.

step 3: your Mac

This part is simple. Go to “System Preferences”, “Sharing”, and enable Remote Login:

If your firewall is active then you need to tell it to allow incoming traffic through ports 22 and 8080. This can be a bit tricky and it depends on your OS version. This may help. Alternatively, you can take the lazy and insecure path of simply disabling your firewall altogether (“System Preferences”, “Security and Privacy”, “Firewall”).

step 4: your IP address

You need to know your MacBook’s public IP address so you can access it from the outside. This should tell you. Write it down.

My experience with NET in Brazil (and with TimeWarnerCable in the US) is that IP addresses don’t change that often. But they do sometimes. If that bothers you you may ask that your cable provider give you a static IP address (they may charge a small fee for that). (EDIT: alternatively, you can use a Dynamic DNS service - like this; h/t Thompson Marzagão.)

step 5: your coffee shop

Take whatever cheap, inconspicous piece of hardware you have at hand to your favorite coffee shop. Launch a terminal and do ssh myusername@myipaddress, where myusername is the username you normally use to log into your Mac and myipaddress is the IP address you wrote down in step 4. Enter your password and that’s it, you are now inside your Mac. You can cd into different directories, run code, do whatever you want.

If your coffee shop hardware is a tablet or smartphone, Termius is a terrific SSH client for mobile devices.

step 6 (optional): your data science

Wondering why I made you enable HTTP connections? Well, here comes the really fun part: Jupyter notebooks. You can start a Jupyter server in your Mac and then, with your coffee shop cheapoware, use your browser to write code interactively and have it run on your Mac. Jupyter’s default language is Python but you can install kernels for an increasingly large number of languages, like R and Julia.

On your Mac, do pip install jupyter to install Jupyter and then do jupyter notebook --ip='0.0.0.0' --port='8080' --no-browser to start the Jupyter server. You’ll be given a url. Something like http://0.0.0.0:8080/?token=sfdsfs90809809s8dfs0df8sdf. Replace 0.0.0.0 by myipaddress (see step 4). That’s the address you’ll use at the coffee shop to launch Jupyter notebooks.

(If your cheapoware is a laptop things should work right out-of-the-box. If it’s an iOS device then you have some additional steps to take - see here.)

step 7: your venti caramel macchiato

That’s it! You have now reduced your likelihood of getting mugged and minimized your losses in case you do get mugged. Time to grab your katana and go mingle with the locals.