I started working remote a few weeks ago. It’s sheer awesomeness - no distractions, no commute, no shabby cafeteria lunch. But there was one minor glitch: I was having trouble accessing my organization’s proprietary data. Being a government agency, we hoard tons of sensitive data on people - addresses, taxpayer IDs, personal income, etc. So, naturally, we restrict access to our databases. There is no URL I can go to when I need to run some database query; I need to be inside my organization’s network to run database queries.
I can use a VPN to log into my office PC from any machine. And once I’m logged into my office PC I can access whatever I want. But that means being forced to use my lame, old office PC instead of my sleak, fast MacBook Pro. Using the VPN also means enduring a maddening lag. It’s only a fraction of a second, but it’s noticeable and it can drive you insane. Finally, using the VPN means I have no local internet access - while I’m in the VPN my laptop has no other contact with the outside world, which results in my not being able to use Spotify and a bunch of other stuff. And I need my 90s Eurodance playlists to be in the proper mindset for writing code.
After enduring all that (I know, tiny violin…) for a couple of weeks I decided to do something about it. I realized that I needed a “bridge” between the data and my laptop. Some web service of sorts that could receive data requests and respond to them. Now, I’d never trust myself to build something like that. I don’t know nearly enough about information security to go about building that sort of tool. I don’t wanna be the guy who caused every Brazilian’s monthly income to be exposed to the world.
Then it hit me: I don’t need to build anything like that, these tools have already been built and we already use them to exchange sensitive data. I’m talking about messaging apps like Slack, Telegram, and the like. My office PC can access Slack. My personal laptop can access Slack. Slack is what my team uses for communication, which means sensitive information already circulates through it. In sum, the middleman I needed was already in place. All I had to do was to repurpose it. And that’s what I did. What follows bellow is a brief account of how I did it, in case other people may be in the same situation.
step 1: stuff that doesn’t involve code
The first thing you need to create are the Slack channels that will handle the data requests. I chose to create two - #incoming, to receive the data requests, and #outgoing, to send the requested data. I made them private, so as not to annoy my teammates with notifications and messages they don’t need to see. Alternatively, you could create an entirely new Slack workspace; that way you isolate human messaging from bot messaging.
Once you’ve created the channels you’ll need to create a Slack bot. It’s this bot that will: a) read the data requests that arrive on #incoming; and b) post the requested data to #outgoing. Slack lets you choose between two types of bots: “app bots” and “custom bots”. They nudge you towards the former but the latter is a lot more straightforward to set up: just click here, click “Add Configuration”, and follow the instructions. When you’re done, write down your bot’s API token - it’s the string that starts with xoxb- -, and, on your Slack workspace, invite your bot to join #incoming and #outgoing.
step 2: testing your bot
We need to make sure that your Slack bot can read from #incoming and post to #outgoing.
Let’s start with reading. There are a number of ways to go about this - Slack has a number of APIs. I think the Web API is the best pick for the impatient. Now, the documentation doesn’t have a quickstart or many useful examples. The explanations are verbose and frustratingly unhelpful if you just want to “get it done” quick. So instead of making you read the docs I’ll just give you what you need to know: make a GET request to https://slack.com/api/groups.history?token=xoxb-your-bot-token&channel=id_of_incoming, where xoxb-your-bot-token is the token you wrote down in step 1 and id_of_incoming is the ID of the #incoming channel (it’s the endpoint of the channel’s URL). That will return to you the channel’s messages (up to 100 messages). If there are no messages in #incoming you won’t get anything interesting back. If that’s the case, just post anything to the channel first.
In real life you won’t be using Terminal/cmd for this, you’ll be using a Python or R script or something along these lines. Here’s how to do that in Python:
What you get back is a Python dict which should have a key named ‘messages’. So, if 'messages' in 'response', then inside response['messages'] you’ll find a list of dicts, each dict being a message, each dict’s key being an attribute of said message (timestamp, text, user who posted it, etc).
Now, you don’t want to access #incoming’s entire history every time you poll it. You can include a parameter named oldest in the params dict and assign a timestamp to it. Then read_messages won’t return messages older than the specified timestamp.
(A little gotcha: what you pass as channel is not the channel’s name but the channel’s ID, which you can get from its URL. Some Slack methods do accept the channel’s name but I never remember which ones, so it’s easier to just use the channel’s ID for everything.)
(Because you went with a custom bot instead of an app bot you won’t have to deal with a bunch of error messages having to do with something Slack calls “scope”. You won’t waste two days in a mad loop of trying to get the scopes right, failing, cursing Slack, refusing to read the API documentation, failing, cursing Slack, refusing to read the API documentation. I envy you.)
Alright then, let’s move on to posting. Here’s how you do it: make a POST request to https://slack.com/api/chat.postMessage, using your bot’s token, your channel’s ID, and the message of the text as payload. Like this:
There. Once you run this code you should see the message “hey macarena” appear in #outgoing.
step 3: receiving #incoming messages
Ok, now you need a server-side program that will check #incoming for new messages - say, every five seconds or so. By server-side I mean it will run inside your company’s network; it needs to run from a machine that has access to your company’s databases. Here’s an example:
Now, you probably want this “listener” to run in the background, so that you can log off without killing it. If you’re running it on a Linux machine the simplest solution is to use tmux. It lets you create multiple “sessions” and run each session in the background. If you’re doing it on a Windows machine you can use cygwin or, if that’s Windows 10, you can use tmux with the native Ubuntu binaries.
step 4: processing #incoming messages
Receiving messages is not enough, your script needs to do something about them. The simple, quick-and-dirty solution is to have your #incoming messages be the very database queries you want to run. An #incoming message could be, say, SELECT [some_column] FROM [some].[table] WHERE [some_other_column] = 42. Then the listener (the server-side program we created before) would read the query and use an ODBC package - like pyodbc or rodbc - to run it. If that works for you, here’s how you’d amend the listener we created before to have it handle SQL queries:
Ok, I’m glossing over a bunch of details here. First you’ll need to set up an ODBC driver, which isn’t always easy to get right the first time - it depends on what SQL engine you have (SQL Server, MySQL, etc), on whether your script is running on Linux or Windows, and on what credentials you’re using to connect to your SQL engine. I can’t really help you out on this, you’ll have to google your way around. If you’ve never set up an ODBC connection before this is probably the part that’s going to take up most of your time.
Once the ODBC part is taken care of, leave the script above running and post some SQL query on #incoming. You should see the the result set of the query. Well done then, everything is working so far.
step 5: replying to #incoming messages
So you have a script that receives queries and executes them. Now your script needs to post the result sets to #outgoing. There really isn’t much mystery here - we already wrote post_to_outgoing above. The only thing left is to convert our result set into a string, so that Slack can accept it. In Python the json module handles that for us: json.dumps(your_data) takes a list or dict (or list of dicts, or dict of lists) and turns it into a string. It’s all below.
Ta-da. As long as this script is running continuously inside your company’s network you no longer need a VPN to query name_of_your_database. If you want more flexibility you can tweak run_query so that it takes the name of the database as a second argument. And you should sprinkle try/except statements here and there to capture database errors and the like.
You’ve taken remote work one step further. It’s not only you who can work remote now: the applications you develop no longer need to live inside your company’s network. You can develop on whatever machine and environment you choose and have your applications post their queries to #incoming and retrieve the result sets from #outgoing.
One gotcha here: Slack automatically breaks up long messages, so if your query exceeds Slack’s maximum length it will be truncated and run_query will probably (hopefully) raise an error. Keep it short.
step 6: make it neat
Alright, you have a functioning “bridge” between you and your company’s databases. But that’s still a crude tool, especially if you will develop applications on top of it. You don’t want your apps to post raw SQL queries on Slack - that’s a lot of unnecessary characters being passed back and forth. Instead of a run_query function you should have a get_data function that stores a “template” of the query and only adds to it, say, the part that comes after the WHERE [something] = . Something like this:
This is still too crude to be called an API but it’s a first step in that direction. The idea is to make the Slack-your_app interface as tight as possible, so as to minimize the types of errors you will encounter and to minimize the exchange of unnecessary strings. If you know exactly the sort of stuff that will be passed to get_data it’s easier to reason about what the code is doing.
Today it’s been three years since I first started working as a data scientist in the Brazilian government. Overall it’s been a great experience and I think this is a good time to reflect upon what I’ve learned so far. I’ll start with the ugly and the bad and then I’ll move on to the good.
bureaucrats won’t share data unless physically compelled to do so
There is no shortage of decrees saying that government agencies will give their data to other government agencies when requested to do so (here’s the latest one). But between the pretty text of the law and what really goes on in the intestines of the bureaucracy there is a huge gap. Every government agency is in favor of data sharing - except when it comes to its own data.
Excuses abound. I can’t share my data because it’s too sensitive. I can’t share my data because extracting it would be too costly. I can’t share my data because we’re in the middle of a major IT restructuring and things are too messy right now. I can’t share my data because we’ve lost its documentation. I can’t share my data because the IT guy is on vacation. I can’t share my data because you might misinterpret it (this is my favorite).
The actual reasons are not always easy to pinpoint. Sometimes there are legitimate concerns about privacy, as in the case of fiscal data. But this is rare. More often than not the alleged privacy concerns are just a convenient excuse. In some cases the privacy excuse is used even when the data is already public. For instance, everything the government buys (other than, say, spy gear) goes in the official bulletin, which anyone can read. The equivalent of the IRS in Brazil - Receita Federal - has all the corresponding tax invoices neatly collected and organized in a database. That database would make it much easier to compute, say, the average price the Brazilian government pays for ballpoint pens. You’d think that database would be readily available not just for the entire government but for the citizenry as well.
You’d be wrong. Government agencies have been trying - and failing - to put their hands in that database for years. Our IRS says it’s protected by privacy laws. But the law says that government purchases must be public. And they already are, it’s all in the official bulletin - but that’s unstructured data that would require lots of scraping, OCRing, and parsing. The data is already public but not in a machine-readable format. That’s why the IRS database is so valuable. But the IRS legal folks haven’t found their match yet.
(Meanwhile data that is actually sensitive - like people’s addresses and tax returns - can be bought for $10 from shady vendors not too hard to find; all it takes is a walk along Rua 25 de Março, in São Paulo. In Brazil if the government has any data on you then you can be sure it’s for sale at Rua 25 de Março.)
Sometimes bureaucrats don’t share data because they worry that the data will make them look bad. What if my data shows that I’m paying too much for office paper? What if my data shows that I have way more people than I need? What if my data shows that my agency is a complete waste of taxpayers’ money and should not exist? I suppose I understand the survival instinct behind such concerns, but those are precisely the cases where we absolutely must get the data, for that’s where the ugliest inefficiencies are. In general the more an agency refuses to share its data the more important it is to get it.
(It doesn’t help that I work for the Office of the Comptroller-General, whose main goal is to find and punish irregularities involving taxpayers’ money. People sometimes assume that our asking for their data is the prelude of bad things. We need to explain that inside the Office of the Comptroller-General there is this little unit called the Observatory of Public Spending and that our goal at the OPS is often to help other agencies by extracting insights from their data.)
government data is a mess
The idea that you should document your databases is something that hasn’t taken root in the Brazilian government. Nine times out of ten all you get is a MSSQL dump with no accompanying data dictionary. You try to guess the contents based on the names of the tables and columns, but these are usually uninformative (like D_CNTR_IN_APOSTILAMENTO or SF_TB_SP_TAB_REM_GDM_PST - both real-life examples). So you end up guessing based on the contents themselves. As in: if it’s an 11-digit numeric field then that’s probably the Brazilian equivalent of the Social Security Number.
As you might imagine, sometimes you guess wrong. You mistake quantity for total price and vice-versa and the resulting unit prices don’t make any sense. You think that a time field is in years but it’s actually in months. You mistake an ID field for a numeric field. Etc etc. You’re lucky when you catch the errors before whatever you’re doing becomes policy. And some fields you just leave unused because they are too mysterious and you can’t come up with any reasonable guesses. Like when it’s probably a dummy variable because the values are all 0s and 1s but you have no clue what the 0s and 1s mean.
Besides the lack of documentation there are also many errors. Null names, misclassification, missing data, typos (according to one database the Brazilian government signed an IT contract of US$ 1 quadrillion back in 2014 - that’s 26 times the budget of the entire US government). To give you an idea, half the government purchases classified as “wheeled vehicles” are under R$ 1,000 (roughly US$ 300); when we inspect the product descriptions we see that they are not actually vehicles but spare parts, which have a different code and should have been classified elsewhere.
The problem begins in the data generation process, i.e., in the systems bureaucrats use to enter the data. These systems are too permissive; they lack basic validation like checking input type (numeric, text, date, etc), input length (does the state code have more than two characters?), and the like. And there is no punishment for the bureaucrat who enters incorrect data.
The most frequent result is absurd averages. You try to compute, say, spending per employee, and what you get back is a bazillion dollars or something close to zero. That means many hours of data cleaning before we can really get started. You have to do all the sanity checks that the government systems fail to do - or else it’s garbage in, garbage out. After you’ve filtered out the US$ 1 quadrillion IT contracts and the US$ 0 hospitals and schools you are left with lots of missing data - and that’s often not missing at random, which poses additional problems. It’s no wonder that most of our academic work is about cleaning up data (like here and here).
recruiting is a different ball game
The Brazilian government does not deliberately recruit data scientists. Data scientists come in through a bunch of miscellaneous doors and then we find each other - by word-of-mouth or Twitter or conferences - and come up with ways to work together. By now there are a few well-known “data science places” in the government - like the OPS (where I work) and the TCU - and data-curious government employees have been flocking to them, by various means; but there isn’t a clear policy to attract data talent.
In order to enter the Brazilian government you usually have to pass a public exam and such exams do not cover any content even remotely related to data. What you do need to learn to pass these exams is a large number of arcane pieces of legislation, mostly relating to government procedures - like the three phases of a procurement process, what they are called, what the deadlines are, what the many exceptions are, and so on. As you may imagine, that doesn’t usually attract people interested in data. A few of us slip through the cracks somehow, but that’s largely by accident.
That makes recruiting for your team a lot harder than in the private sector, where you can simply post a job ad on LinkedIN and wait for applications. In the government you normally can’t recruit people that are not already in the government. It goes more or less like this: You start by identifying the person you want to bring to your team. He or she will usually be in another government agency, not in your own. You will need to negotiate with his or her agency so that they okay their coming to work for you. That may require you to find someone in your agency willing to go work for them. Such negotiations can last for months and they look a lot like an exchange of war prisoners (I was “traded” myself once and it’s not fun). If the head of your agency (your minister or whatever) has sufficient political clout (and know that you exist), he or she may try to prevail over his or her counterpart at the other agency. Either way, there’s no guarantee that you’ll succeed.
If it’s hard for the recruiter it’s even harder for the person being recruited. They need to tell their current agency that they no longer want to work there, but they have no guarantee that they will get transferred. Imagine telling your significant other that you want to break up but then being somehow legally compelled to stay in the relationship. It can be awkward. You’re unlikely get a promotion if your bosses know that you don’t want to work there. They may start giving you less important tasks. Such possibilities often kill any recruitment before it even begins.
There is also the issue of salary negotiations. The issue being that they are not possible. When you work for the government the law determines your salary - there is no room for negotiation. Sometimes you can offer a potential candidate a little extra money if they agree to take up some administrative responsibilities but this is usually under US$ 1000/month and most data scientists prefer to avoid having any responsibilities that involve paperwork. So whoever you are trying to lure must be really excited by the work you and your team do because the work itself is pretty much all you have to offer.
But enough with the bad and the ugly.
you have a lot of freedom to experiment
Paradoxically, in the midst of this red tape jungle we have a lot of leeway to play around and try new things. Data science is a highly technical subject, one that takes at least a few Coursera courses to even begin to grasp, and that helps keep it unregulated. We have to fill out the same stupid forms everyone else does when we want to buy stuff, but whether we use Hadoop or not, whether we adopt Python or R for a project, whether we go with an SVM or a neural network or both, and whether we think any given project is worth pursuing is all entirely up to us. Legal doesn’t have a say in any of that. The budget folks don’t have a say in any of that. The minister himself - heck, the president himself - doesn’t have a say in any of that. They wouldn’t even know where to start. So, thanks to the highly specialized nature of our trade we don’t have higher-ups trying to micromanage what we do.
There is also the tenure factor. You see, in Brazil once you enter the civil service you get automatically tenured after three years. And the constitution says that, once tenured, you need to do something really outrageous to get fired - and even then there are several appeal instances and often times nothing happens in the end. I bet that if I showed up naked for work tomorrow I still wouldn’t get fired; I might get a written warning or something along these lines, and I’d probably appear in the local news, but I would still have my job. It takes something outright criminal to get a government employee fired. Like, they need to catch you taking a bribe or not showing up for work for months. And even then sometimes people don’t get fired.
Overall tenure is bad: too many lazy idiots spend their days browsing Facebook and entertaining themselves with hallway gossip. But for experimenting purposes tenure is great. It makes “move fast and break things” possible. Some bureaucrats want our assistance and happily give us their data and collaborate with us, helping us understand their problems and needs. But other bureaucrats get upset that you’re even daring to ask for their data. And they worry that you might disrupt the way they work or that you might automate them altogether. If we had to worry about our jobs at every step of the way we wouldn’t accomplish much. Without tenure heads might roll.
you can help taxpayers; a lot
The Brazilian government is humongous - it takes up ~40% of the country’s GDP. Most of that money goes down the toilet: the government overpays for everything it buys; it contracts suppliers that do not deliver the goods and services they promised; corruption is generalized. But data science can help. For instance, at the OPS we have trained a model that predicts whether a supplier is likely to become a headache (say, because it won’t deliver or because it will shut down). (Here’s a talk I gave about it earlier this year.) We’re now refining that model so that we can later appify it and plug it into the government’s procurement system. That way the government will be alerted of potential problems before it signs the contract.
That project has taken a lot of time and effort - the first version of the model was the master’s research of my colleague Leonardo and since then he and other people have put a lot more work into it. That’s a lot of salary-hours. But if a single problematic contract of small size - say, US$ 100k or so - is prevented because of the model then all that effort will have been worth it. And given the size of the government’s budget - around US$ 1 trillion a year - we should be able to save a lot more money than US$ 100k. That’s money that could go back to taxpayers.
is it worth it?
If you can stomach the ugly and the bad and are excited about the good, then yes. :-)
If you do academic research but are not affiliated with an academic institution you probably know Sci-Hub. It gives you access to over 60 million research papers - for free (no ads, no malware, no scams). Alexandra Elbakyan, its creator, has deservedly been ranked by Nature one of the top ten most relevant people in science and we independent researchers owe her a lot.
You’d think that such an invention would be welcomed by most people who are not Elsevier executives. You’d think that such an invention would be particularly welcomed at organizations that do not have an Elsevier subscription. You’d be wrong. In the Brazilian government, where I work, Sci-Hub is not only not welcomed, it is actively blocked. The firewall doesn’t let me access it.
That’s Portuguese for “Blocked content! Science is illegal/unethical, so screw yourself.” (Sort of.)
This week I finally got tired of that nonsense - dammit, I’m a data scientist, I need academic papers not only for the research I do on the side but also, and mainly, for my day job. So I decided to build an interface to Sci-Hub - an app that takes my search string, gives it to Sci-Hub, and retrieves the results. Much like I did before in order to use Telegram.
Writing the code was easy enough, it’s a simple web app that does just one thing. I wrote it on Thursday evening and I was confident that the next morning I would just fire app a new project on Google App Engine, deploy the code, and be done with it in less than an hour. Oh, the hubris. I ended up working on it all Friday and all Saturday morning; only at Saturday 12:43pm the damned thing went alive.
What follows is an account of those 36 hours, largely for my own benefit in case I run into the same issues again in the future, but also in case it may be helpful to other people also looking to unblock Sci-Hub. I’m also writing this because I think those 36 hours are a good illustration of the difference between programming, on the one hand, and software development, on the other, which is something I struggled to understand when I first started writing code. Finally, I’m writing this because those 36 hours are a good example of the inefficiencies introduced when sysadmins (or their bosses) decide to block useful resources.
If you inspect the HTML code behind Sci-Hub you can see it’s really easy to scrape:
All you have to do is send a POST request. If Sci-Hub’s repository has the paper you are looking for, you get it in a PDF file.
So I built this minimal web app that sends a POST request to Sci-Hub and then emails me back the PDF. I chose email because getting and returning each paper takes several seconds and I didn’t want the app blocked by each request. With email I can have a background process do the heavy work; that way I can send several POST requests in a row without having to wait in-between.
I gave my app the grandiose name of Sci-Hub Liberator.
(Sci-Hub Liberator’s front-end. This is what happens when data scientists do web development.)
The other script is the backend. It is launched by the frontend with a call to subprocess.Popen. That way all requests are independent and run on separate background processes. The backend uses Python’s requests package to send the POST request to Sci-Hub, then BeautifulSoup to comb the response and find the link to the paper’s PDF, then requests again to fetch the PDF.
The backend then uses Python’s own email package to email me the PDF.
Both scripts combined had 151 lines of code. Not exactly a “Hello, World!” application but not too far from it either.
a word of caution
Before I proceed I must ask you not to abuse Sci-Hub’s easily scrapable interface. That’s an amazing service they’re providing to the world and if you send thousands of requests in a row you may disrupt their operations. I trust that they have defenses against that (or else Elsevier would have taken them down long ago), but still, please don’t fuck up.
Code written and tested, I turned to Google App Engine for hosting the app. With only 151 lines of code and two scripts I thought that launching the app would be a breeze. Silly me.
I wanted to use Python 3, but Google App Engine Launcher is only compatible with Python 2. I google around and it seems that they are deprecating GAE Launcher in favor of the Google Cloud SDK. Pity. GAE Launcher was a nifty little app that made deployment really easy. I had been using it since 2013 and it allowed me to focus on my app and not on deployment nonsense.
Resigned to my fate, I downloaded the Google Cloud SDK installer and… installation failed due to an SSL-related problem. It took some half an hour of googling and debugging before I could get it to work.
things don’t change
GAE’s standard environment only allows Python 2. You can only use Python 3 in GAE’s flexible environment. And the flexible environment is a different ball game.
I had never used the flexible environment before (I think it only became generally available early this year), but I decided to give it a try. To make a long story short, I couldn’t make it work. The exact same code that works fine on my machine returns a mysterious Application startup error when I try to deploy the app. The deploy attempt generates a log file but it is equally uninformative, it only says Deployment failed. Attempting to cleanup deployment artifacts.
Despite hours of tinkering and googling I couldn’t find out what the problem is. I declared all my dependencies in my requirements.txt file (and I pointed to the same versions I was using locally); I configured my app.yaml file; I made sure that all of my dependencies’ dependencies were allowed. I didn’t know what else to look into.
Eventually I gave up in despair and decided to fall back on GAE’s standard environment, which meant reverting to Python 2. That was a bummer - it’s 2017, if GAE’s standard environment needs to choose between 2 and 3 then it’s probably time to pick 3 (assuming there is a way to do that without killing all existing Python 2 projects).
Vendoring didn’t work for BeautifulSoup. Even though I used pip install and not pip3 install what got installed was BeautifulSoup’s Python 3 version. That resulted in from bs4 import BeautifulSoup raising ImportError: No module named html.entities.
After several unsuccessful attempts to point pip install to a specific source file I gave up on pip. I tested my Mac’s system-wide Python 2 installation and BeautifulSoup was working just fine there. So I went to my Mac’s site-packages and just copied the damned bs4 folder into my app’s lib folder. That did the trick. It’s ugly and it doesn’t shed any light on the causes of the problem but by then it was Friday afternoon and I was beginning to worry this deployment might take the whole day (if only!).
GAE has long been my default choice for hosting applications and I’ve always known that it doesn’t allow calls to the operating system. It’s a “serverless” platform; you don’t need to mess with the OS, which means you also don’t get to mess with the OS. So I can’t really explain why I based the frontend-backend communication on a call to subprocess.Popen, which is a call to the OS. That’s just not allowed on GAE. Somehow that synapse simply didn’t happen in my brain.
back to the code
GAE has its own utilities for background tasks - that’s what the Task Queue API is for. It looks great and one day I want to sit down and learn how to use it. But by the time I got to this point I was entering the wee hours of Saturday. My hopes of getting it all done on Friday were long gone and I just wanted a quick fix that would let me go to bed.
So I rewrote my app to have it show the PDF on the screen instead of emailing it. That meant I would have to wait for one paper to come through before requesting another one. At that hour I was tired enough to accept it.
The change was pretty easy - it involved a lot more code deletion than code writing. It also obviated the need for a backend, so I put everything into a single script. But the wait for the PDF to be rendered was a little too much and I thought that a loading animation of sorts was required. I couldn’t find a way to do that using only cherrypy/HTML/CSS, so I ended up resorting to jQuery, which made my app a lot less lean.
Sci-Hub is smart
After getting rid of the OS calls I finally managed to deploy. I then noticed a requests-related error message. After some quick googling I found out that GAE doesn’t play well with requests and that you need to monkey-patch it. Easy enough, it seemed.
After the patching requests seemed to work (as in: not raising an exception) but all the responses from Sci-Hub came back empty. The responses came through, and with status code 200, so the communication was happening. But there was no content - no HTML, no nothing.
I thought that it might be some problem with the monkey-patching, so I commented out requests and switched to urrlib2 instead. No good: same empty responses. I commented out urllib2 and tried urlfetch. Same result. As per the official documentation I had run out of packages to try.
I thought it might have to do with the size of the response - maybe it was too large for GAE’s limits. But no, the papers I was requesting were under 10MB and the limit for the response is 32MB:
I had briefly considered the possibility of this being an user-agent issue: maybe Sci-Hub just doesn’t deal with bots. But everything worked fine on my machine, so that couldn’t be it.
Then it hit me: maybe the user-agent string on GAE is different from the user-agent string on my machine. I got a closer look at the documentation and found this:
To test my hypothesis I re-ran the app on my machine but appending +http://code.google.com/appengine; appid: MY_APP_ID to my user-agent string. Sure enough, Sci-Hub didn’t respond with the PDF. Oddly though, I did get a non-empty response - some HTML code with Russian text about Sci-Hub (its mission, etc; or so Google Translate tells me). Perhaps Sci-Hub checks not only the request’s user-agent but also some other attribute like IP address or geographical location. One way or the other, I was not going to get my PDF if I sent the request from GAE.
At that point it was around 3am and I should probably have gone to bed. But I was in the zone. The world disappeared around me and I didn’t care about sleeping or eating or anything else. I was one with the code.
So instead of going to bed I googled around looking for ways to fool GAE and keep my user-agent string intact. I didn’t find anything of the kind, but I found Tom Tasche.
back to the code (again)
I decided to steal Tom’s idea. Turns out GAE has a micro-instance that you can use for free indefinitely (unlike AWS’s micro instance, which ceases to be free after a year). It’s not much to look at - 0.6GB of RAM - but hey, have I mentioned it’s free?
I rewrote my code (again). I went back to having the frontend and backend in separate scripts. But now instead of having the backend be a Python script called by subprocess.Popen I had it be an API. It received the user input and returned the corresponding PDF.
I put this new backend in a GCE micro-instance and kept the frontend at GAE. I also promoted my backend’s IP from ephemeral to static, lest my app stop working out of the blue.
I was confident that this was it. I was finally going to bed. Just a quick test to confirm that this would work and then I’d switch off.
I tested the new architecture and… it failed. It takes a long time for the GCE instance to send the PDF to the GAE frontend and that raises a DeadlineExceededError. You can tweak the time out limit by using urlfetch.set_default_fetch_deadline(60) but GAE imposes a hard limit of 60 seconds - if you choose any other number your choice is just ignored. And I needed more than 60 seconds.
back to the code (yet again)
At that point I had an epiphany: I was already using a GCE instance anyway, so why not have the backend write the PDF to disk in a subprocess - so as not to block or dealy anything - and have it return just the link to the PDF? That sounded genius and if it weren’t 6am I might have screamed in triumph.
That only required a minor tweak to the code:
No DeadlineExceededError this time. Instead I got a MemoryError. It seems that 0.6GB of RAM is not enough to handle 10MB objects (10MB is the space the PDF occupies on disk; things usually take up more space in memory than on disk). So much for my brilliant workaround.
the end of fiscal responsibility
The cheapest non-free GCE instance has 1.7GB of RAM and costs ~$14.97 a month. I got bold and launched it (I looked into AWS EC2’s roughly equivalent instance and it wasn’t any cheaper: $34.96.). At last, after a painful all-nighter, my app was alive.
I mean, I still haven’t added any error checking, but that’s deliberate - I want to see what happens when Sci-Hub can’t find the paper I requested or is temporarily down or whatnot. I’ll add the error checks as the errors happen.
I’ll hate paying these $14.97 but it beats not having access to a resource that is critical for my work. The only alternative I see is to rescue my old Lenovo from semi-retirement and that would be annoying on several grounds (I don’t have a static IP address at home, I would need to leave it up and running all day, it would take up physical space, and so on). So for now $14.97 a month is reasonable. At least that money is not going to Elsevier.
Now that I’m paying for a GCE instance anyway I could move all my code to it (and maybe go back to having a single script) and be done with GAE for this project. But I have this vague goal of making this app public some day, so that other people in my situation can have access to Sci-Hub. And with GAE it’s easy to scale things up if necessary. That isn’t happening any time soon though.
things I learned
It’s not so fun to pull an all-nighter when you are no longer in grad school - we get used to having a stable schedule. But I don’t regret having gone through all these steps in those 36 hours. I used Google Compute Engine for the first time and I liked it. I’m used to AWS EC2’s interface and GCE’s looked a lot more intuitive to me (and I found out GCE has a free micro-instance; even though I ended up not using it for this project it may come in handy in the future). I also familiarized myself with gcloud, which I would have to do anyway at some point. And I also learned a thing or two about cherrypy (like the serve_fileobj method, which makes it really easy to serve static files from memory).
Those 36 hours were also a useful reminder of the difference between programming and software development. Programming is about learning all the things you can do with the tools your languages provide. Software development is largely about learning all the things you cannot do because your runtime environment won’t let you. Our Courseras and Udacities do a great job of teaching the former but we must learn the latter by ourselves, by trial and error and by reading the documentation. I’m not sure that it could be otherwise: loops and lambdas are fundamental concepts that have been with us for decades, but the quirks of GAE’s flexible environment will probably have changed completely in a year or less. Any course built around GAE (or GCP in general or AWS) would be obsolete too soon to make it worth it.
This papers uses natural language processing to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). The ADS is based on 42 million news articles from 6,043 different sources and cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today the ADS is replicable and has standard errors small enough to actually distinguish between cases.
I just read Neal Stephenson’s 2008 novel Anathem and now I walk around pestering everyone I know telling them to read it too. Well, not everyone: just people who are or have been in academia. Judging from Goodreads reviews everyone else finds the novel too long and theoretical, full of made up words, and full of characters who are too detached from the real world to be believable. Pay no heed to the haters - here’s why you academic types should read Anathem:
1. You will feel right at home even though the story is set in an alien world (I).
The planet is called Arbre and its history and society are not radically different from Earth’s. Except that at some point (thousands of years before the story begins) the people of Arbre revolted against science and confined their intellectuals to monasteries where the development and use of technology is severely limited - no computers, no cell phones, no internet, no cameras, etc -, as is any contact with the outside world. Inside these monasteries (“Maths”) the intellectuals (the “avout”) dedicate themselves to the study and development of mathematics, physics, and philosophy. As the use of technology is restricted, all that research is purely theoretical.
Arbre’s Maths are therefore an allegory for Earth’s universities. How many of our papers and dissertations end up having any (non-academic) impact? Maybe 1% of them? Fewer than that? In (Earth’s) academia the metric of success is usually peer-reviewed publications, not real-world usefulness. Even what we call, say, “applied econometrics” or “applied statistics” is more often than not “applied” only in a limited, strictly academic sense; when you apply econometrics to investigate the effect of economic growth on democracy that is unlikely to have any detectable effect on economic growth or democracy.
So, in Anathem you find this bizarre alien world where intellectuals are physically confined and isolated from the rest of the world and can’t use technology and yet that world feels familiar and as a (current or former) scholar you won’t react to that in the same way other people do. If you go check the reviews on Goodreads you’ll see lots of people complaining that the Maths are unrealistic. To you, however, Maths will sound eerily natural; Anathem would be more alien to you if the Maths were, say, engineering schools.
(Needless to say, the allegory only goes so far, as Arbran’s avout are legally forbidden from having any real-world impact; having no choice in the matter, they don’t lose any sleep over the purely academic nature of their work. And of course people do produce lots of useful research at Earth’s universities.)
2. You will feel right at home even though the story is set in an alien world (II).
The way an Arbran avout progresses in his or her mathic career is entirely different from the way an Earthly scholar progresses in his or her academic career - and yet way too familiar. In Arbre you start by being collected at around age 10. That makes you a “fid” and you will be mentored and taught by the more senior avout, each of which you will respectully address as “pa” or “ma”. When you reach your early twenties you choose - and are chosen by - a specific mathic order. There are many such orders, each named after the avout who founded it - there are the Edharians, the Lorites, the Matharrites, and so on, each with specific liturgies and beliefs.
The avout are not allowed to have any contact with the outside world (the “extramuros”) except at certain regular intervals: one year (the Unarian maths), ten years (the Decenarian maths), one hundred years (the Centenarian maths), or one thousand years (the Milleniarian maths). And only for ten days (those days are called “Apert”). You can get collected by any math - Unarian, Decenarian, Centenarian, or Millenarian. If you get collected, say, at a Unarian math, and you show a lot of skill and promise, you can get upgraded (“Graduated”) to a Decenarian math. If you keep showing skill and promise you can get Graduated to a Centenarian math. And so on. The filter gets progressively stricter; only very few ever get Graduated to the Millenarian maths.
So, the reward for being isolated from the outside world and focusing intensely on your research is… getting even more isolated from the outside world so that you can focus even more intensely on your research. Sounds familiar?
3. Anathem gives you vocabulary for all things academia.
Think back to your Ph.D. years and remember the times you went out with your fellow fids for drinks (well, if you were actual fids you wouldn’t be able to leave your math - you could, but then you wouldn’t be able to go back, except during Apert - but never mind that). Weird conversations (from the point of view of those overhearing them) ensued and you got curious looks from waiters and from other customers.
Why? Because you spoke in the jargon of your field - you used non-ordinary words and you used ordinary words in non-ordinary ways. Like “instrumental” or “endogeneity” or “functional programming”. Not only that: the conversations were speculative and obeyed certain unwritten rules, like Occam’s razor. Clearly these were not the same conversations you have with non-avout - your college friends, your family, your Tinder dates. And yet you call all of them “conversations”. Well, not anymore; Anathem gives you a word for inter-avout conversation about mathic subjects: Dialog. Neal Stephenson goes as far as creating a taxonomy of Dialog types:
Dialog, Peregrin: A Dialog in which two participants of roughly equal knowledge and intelligence develop an idea by talking to each other, typically while out walking around.
Dialog, Periklynian: A competitive Dialog in which each participant seeks to destroy the other’s position (see Plane).
Dialog, Suvinian: A Dialog in which a mentor instructs a fid, usually by asking the fid questions, as opposed to speaking discursively.
Dialog: A discourse, usually in formal style, between theors. “To be in Dialog” is to participate in such a discussion extemporaneously. The term may also apply to a written record of a historical Dialog; such documents are the cornerstone of the mathic literary tradition and are studied, re-enacted, and memorized by fids. In the classic format, a Dialog involves two principals and some number of onlookers who participate sporadically. Another common format is the Triangular, featuring a savant, an ordinary person who seeks knowledge, and an imbecile. There are countless other classifications, including the suvinian, the Periklynian, and the peregrin.
(Anathem, pp. 960-961)
(Yes, there is a glossary in Anathem.)
You can’t get much more precise than that without being summoned to a Millenarian math.
Dialog is just one example. You left academia? You went Feral.
Feral: A literate and theorically minded person who dwells in the Sæculum, cut off from contact with the mathic world. Typically an ex-avout who has renounced his or her vows or been Thrown Back, though the term is also technically applicable to autodidacts who have never been avout.
(Anathem, p. 963)
You left academia to go work for the government? You got Evoked.
Voco: A rarely celebrated aut by which the Sæcular Power Evokes (calls forth from the math) an avout whose talents are needed in the Sæcular world. Except in very unusual cases, the one Evoked never returns to the mathic world.
(Anathem, p. 976)
Reviewer #2 says your argument is not original? He’s a Lorite.
Lorite: A member of an Order founded by Saunt Lora, who believed that all of the ideas that the human mind was capable of coming up with had already been come up with. Lorites are, therefore, historians of thought who assist other avout in their work by making them aware of others who have thought similar things in the past, and thereby preventing them from re-inventing the wheel.
(Anathem, p. 967)
Got friends or family who are not academics? Well, ok, J. K. Rowling has already given us a word for that - muggles. But in some languages that word gets super offensive translations - in Brazilian Portuguese, for instance, they made it “trouxas”, which means “idiots”. Not cool, Harry Potter translators. But worry not, Neal Stephenson gives us an alternative that’s only a tiny bit offensive: “extras” (from “extramuros” - everything outside the maths).
Extra: Slightly disparaging term used by avout to refer to Sæcular people.
(Anathem, p. 963)
That cousin of yours who believes the Earth is flat? He is a sline.
Sline: An extramuros person with no special education, skills, aspirations, or hope of acquiring same, generally construed as belonging to the lowest social class.
(Anathem, p. 973)
And of course, what happens to a scholar who gets expelled from academia? He gets anathametized.
Anathem: (1) In Proto-Orth, a poetic or musical invocation of Our Mother Hylaea, used in the aut of Provener, or (2) an aut by which an incorrigible fraa or suur is ejected from the mathic world.
(Anathem, pp. 956-957)
And so on and so forth. Frankly, it’s amazing that academics manage to have any Dialogs whatsoever without having read Anathem.
(I must note that Neal Stephenson not only puts these words in the book’s glossary, he uses them extensively throughout the book - there are 40 occurrences of “evoked”, 90 occurrences of “Dialog”, and 57 occurrences of “sline”, for instance. And because there is a glossary at the end he doesn’t bother to define these words in the main text, he just uses them. Which can make your life difficult if, like me, you didn’t bother to skim the book before reading it and only found out about the glossary after you had finished. Damn Kindle.)
4. Anathem might be the push you need to quit social media for good.
I’ve been reading Cal Newport’s Deep Work, about the importance of focusing hard and getting “in the zone” in order to be productive. (Well, “reading” is inaccurate. I bought the audio version and I’ve been listening to it while driving - which is not without irony.) There isn’t a whole lot of novelty there - it’s mostly common sense advice about “unplugging” for at least a couple of hours each day so you can get meaningful work done (meaningful work being work that imposes some mental strain, as opposed to replying emails or attending meetings). The thing is, at a certain point, much to my amusement and surprise, Cal Newport mentions Neal Stephenson.
As Cal Newport tells us, Neal Stephenson is a known recluse. He doesn’t answer emails and he is absent from social media. To Newport, that helps explain Stephenson’s productivity and success (No, I won’t engage you in a long Periklynian Dialog about how we can’t establish causality based on anecdotal evidence. That’s not the point and in any case Cal Newport, despite being an avout himself - he’s a computer science professor at Georgetown - is trying to reach an audience of extras and Ferals.) I had read other Neal Stephenson books before - Cryptonomicon, Snow Crash, The Diamond Age, REAMDE, Seveneves -, but I had never bothered to google the man, so I had no idea how he lived. After Cal Newport’s mention, though, I think Anathem is a lot more personal than it looks. Among its many messages maybe there is Neal Stephenson telling us “see? this is what can be achieved when smart people are locked up and cut off from the world”. “What can be achieved” being, in Neal Stephenson’s case (and brilliantly recursively), a great novel about what can be achieved when smart people are locked up and cut off from access to the world.
5. Anathem may be an extreme version of what happens when people turn against science.
Flat-Earthers and anti-vaxxers are back. People who don’t know what a standard-deviation is pontificate freely and publicly about the scientific evidence of climate change. Violent gangs openly oppose free speech at universities. I’m not saying these slines are about to lock up Earth’s scientists in monasteries, but perhaps the Temnestrian Iconography is getting more popular.
“[…] Fid Erasmas, what are the Iconographies and why do we concern ourselves with them?” […]
“Well, the extras—”
“The Sæculars,” Tamura corrected me.
“The Sæculars know that we exist. They don’t know quite what to make of us. The truth is too complicated for them to keep in their heads. Instead of the truth, they have simplified representations— caricatures— of us. Those come and go, and have done since the days of Thelenes. But if you stand back and look at them, you see certain patterns that recur again and again, like, like— attractors in a chaotic system.”
“Spare me the poetry,” said Grandsuur Tamura with a roll of the eyes. There was a lot of tittering, and I had to force myself not to glance in Tulia’s direction.
I went on, “Well, long ago those patterns were identified and written down in a systematic way by avout who make a study of extramuros. They are called Iconographies. They are important because if we know which iconography a given extra— pardon me, a given Sæcular— is carrying around in his head, we’ll have a good idea what they think of us and how they might react to us.”
Grandsuur Tamura gave no sign of whether she liked my answer or not. But she turned her eyes away from me, which was the most I could hope for. “Fid Ostabon,” she said, staring now at a twenty-one-year-old fraa with a ragged beard. “What is the Temnestrian Iconography?”
“It is the oldest,” he began. “I didn’t ask how old it was.” “It’s from an ancient comedy,” he tried.
“I didn’t ask where it was from.”
“The Temnestrian Iconography…” he rebegan.
“I know what it’s called. What is it?”
“It depicts us as clowns,” Fraa Ostabon said, a little brusquely. “But… clowns with a sinister aspect. It is a two-phase iconography: at the beginning, we are shown, say, prancing around with butterfly nets or looking at shapes in the clouds…”
“Talking to spiders,” someone put in. Then, when no reprimand came from Grandsuur Tamura, someone else said: “Reading books upside-down.” Another: “Putting our urine up in test tubes.”
“So at first it seems only comical,” said Fraa Ostabon, regaining the floor. “But then in the second phase, a dark side is shown— an impressionable youngster is seduced, a responsible mother lured into insanity, a political leader led into decisions that are pure folly.”
“It’s a way of blaming the degeneracy of society on us— making us the original degenerates,” said Grandsuur Tamura. “Its origins? Fid Dulien?”
“The Cloud-weaver, a satirical play by the Ethran playwright Temnestra that mocks Thelenes by name and that was used as evidence in his trial.”
“How to know if someone you meet is a subscriber to this iconography? Fid Olph?”
“Probably they will be civil as long as the conversation is limited to what they understand, but they’ll become strangely hostile if we begin speaking of abstractions…?”
(Anathem, pp. 71-72)
This is it. Go read Anathem and tell your fellow avout and Ferals about it. See you at Apert.