finding books to read

Despite all the hype about AI, product recommendations still suck. Amazon offers me the exact same Bose headphones I bought last month, plus two hundred close substitutes. Spotify thinks that because I enjoy Eurodance from the 1990s I will also enjoy Brazilian music from the 1990s. Netflix thinks that because I enjoy Black Mirror I will also enjoy Dear White People - and that’s despite Netflix knowing that I’m a fan of Rick & Morty, South Park, and any show that has Ricky Gervais. Goodreads thinks that because I enjoyed Ready Player One I will also enjoy Spell or High Water even though I’ve already read Spell or High Water and rated it only one star on Goodreads itself. No wonder there is zero peer-reviewed evidence that Cambridge Analytica swayed even a single voter in 2016. The singularity is nowhere near.

I could write to each of those companies or annoy them on Twitter or whatnot. But I believe in the DIY ethic.

I chose to start with Goodreads. It has by far the most egregious recommender system. And you hear people talk about their favorite TV shows or self-tracking wristbands all the time but you don’t always get to hear people talk about their favorite books.

“do the simple thing first”

I used something called user-based collaborative filtering, which is fancy language for “find other people who like the sort of stuff you like and then check what else they also like.”

There are lots of ways to do user-based collaborative filtering. Some algorithms only require grade school math while others rely on deep learning. (Aggarwal’s Recommender Systems is by far the best textbook on the subject that I know of, in case you’re interested.) Here I used something that only requires grade school math and can’t even be called an “algorithm”: cosine similarity.

But before I talk about cosine similarity let me talk about the data we need to collect in order to use it.

scraping

You start by collecting, for every item you ever rated, all other ratings it has received. And you also collect the identity (name or user ID) of all the other raters - all the people who have rated items in common with you. So I had to collect, for each book I ever rated on Goodreads, all other ratings that book received on Goodreads, along with the user IDs of the other raters.

Unfortunately, Goodreads doesn’t let us see all the ratings each book has received. Goodreads lets us see the full distribution of ratings: forty thousand 5-star ratings, twenty thousand 4-star ratings, and so on (see below). But it doesn’t let us see all the ratings themselves. And if I can’t see a rating I can’t know whose rating it is, so for our purposes here it might as well not exist.

How many ratings does Goodreads let us see? If you visit a book’s page - say, Ender’s Game page - you immediately see up to 30 ratings (fewer if it’s an obscure book that not even 30 people have bothered to rate). If the book has more than 30 ratings you can click “next” on the bottom of the page and that shows you 30 more ratings. You can keep clicking “next” until you get to page ten, which means you can get up to 300 ratings in total. That’s regardless of the book’s popularity: it can be Pride and Prejudice, which has 2.3 million ratings, and you still won’t be able to see beyond page ten.

Now, we can bypass that limit - to a point. We can filter the contents and ask to see just 1-star ratings or just 2-star ratings and so on. And each time we will get back up to ten pages (of 1-star ratings or 2-star ratings or whatever else we chose). Goodreads ratings go from 1 to 5 stars, integers only, so in practice this filtering lets us multiply five-fold the total number of ratings we can collect for each book: we increase it from 300 to 1500.

We can even go beyond 1500. See the “Sort” button in the above picture? We can sort the ratings from older to newer or from newer to older. Either way we still get back up to ten pages. So we can get 1500 ratings sorted from older to newer and then get 1500 ratings sorted from newer to older. That way we end up with 3000 ratings (some - or all - of which will be duplicates when the book is not popular enough to have received more than 3000 ratings in total).

I didn’t do the sorting thing though. It only occurred to me after I had done everything and I’m in no mood to go back and redo everything. So what I have is only up to 1500 ratings per book. Sorry, reviewer #2.

Alrigh then, enough talk. Here is the Python code I wrote to get all the ratings I ever gave on Goodreads. I used Selenium and Chrome. If you’re new to scraping then here’s a tutorial. You don’t need to be logged in to scrape Goodreads (though that reduces the size of the data a little: some Goodreads users choose to make their profiles available only to people who are logged in).

import time
import pickle
from selenium import webdriver
from bs4 import BeautifulSoup

# initialize browser
path_to_executable = '/path/to/chromedriver'
browser = webdriver.Chrome(path_to_executable)

# get my own reviews
my_url = 'https://www.goodreads.com/review/list/123456789-your-user-id-here?print=true'
browser.get(my_url)
time.sleep(5)
source = browser.page_source
soup = BeautifulSoup(source, 'lxml')

# get total number of pages
pagination_div = soup.find('div', {'id': 'reviewPagination'})
page_links = pagination_div.find_all('a')
last_page = page_links[-2] # page_links[-1] is "next >>"
total_pages = int(last_page.text)
print('total_pages:', total_pages)

my_reviews = {}
for page in range(1, total_pages + 1):

    # get page and extract table that contains the reviews
    new_page = my_url + '&page=' + str(page)
    browser.get(new_page)
    time.sleep(5)
    source = browser.page_source
    soup = BeautifulSoup(source, 'lxml')
    table = soup.find('table', {'id': 'books'})
    rows = table.find_all('tr', class_ = 'bookalike review')

    # for each review get book title, link, and my rating
    for i, row in enumerate(rows):
        title_column = row.find('td', class_ = 'field title')
        title = title_column.find('a').text
        href = title_column.find('a')['href']
        rating_column = row.find('td', class_ = 'field rating')
        rating_div = rating_column.find('div', class_ = 'value')
        rating_span = rating_div.find('span', class_ = 'staticStars')
        if rating_span.has_attr('title'):
            my_rating = rating_span['title']
            my_reviews[str(page) + '_' + str(i)] = (title, href, my_rating)

with open('my_reviews.pkl', mode = 'wb') as f:
    pickle.dump(my_reviews, f)

And here is the code I wrote to scrape all the ratings of every book I’ve ever rated. Now, this code takes a while to run - it’s getting every rating of every book I ever read -, so I used Chrome in headless mode.

import time
import pickle
from selenium import webdriver
from bs4 import BeautifulSoup

# load my own reviews
with open('my_reviews.pkl', mode = 'rb') as f:
    my_reviews = pickle.load(f)

# get book URLs
book_endpoints = [e[1] for e in my_reviews.values()]

# initialize browser
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('headless')
chrome_options.add_argument('log-level=3')
chrome_options.add_experimental_option('prefs', {'profile.default_content_setting_values.automatic_downloads': 1})
executable_path = '/path/to/chromedriver'
browser = webdriver.Chrome(executable_path = executable_path, chrome_options = chrome_options)

# get book reviews
all_reviews = {}
for i, book_endpoint in enumerate(book_endpoints):
    print(' ')
    print(i, book_endpoint)

    # get book id
    if '.' in book_endpoint:
        book_id = int(book_endpoint[11:book_endpoint.index('.')])
    elif '-' in book_endpoint:
        book_id = int(book_endpoint[11:book_endpoint.index('-')])
    else:
        book_id = book_endpoint[11:]

    # create empty list to store book's reviews
    all_reviews[book_id] = []

    source = ''

    # iterate through all possible ratings (1-5)
    for rating in range(1, 6):
        print('rating:', rating)

        # get source code of book's first page and soupify it
        browser.get('https://www.goodreads.com' + book_endpoint + '?rating=' + str(rating))
        time.sleep(5)
        source = browser.page_source
        soup = BeautifulSoup(source, 'lxml')

        page = 1
        while True:

            # get divs that contain the reviews
            reviews = soup.find_all('div', class_ = 'review')

            for review in reviews:

                # get endoint of reviewer's URL
                reviewer = review.find('a', class_ = 'user')['href']

                # get text
                text_div = review.find('div', class_ = 'reviewText stacked')
                if text_div:
                    text = text_div.find('span', class_ = 'readable').text
                else:
                    text = None

                # store review
                all_reviews[book_id].append((reviewer, rating, text))

            print('page:', page, 'book reviews:', len(set(all_reviews[book_id])))

            # go to next page
            if '<a class="next_page"' in source:
                try:
                    element = browser.find_element_by_partial_link_text('next ')
                    browser.execute_script("arguments[0].click();", element)
                    while browser.page_source == source:
                        time.sleep(1)
                    source = browser.page_source
                    soup = BeautifulSoup(source, 'lxml')
                    page += 1
                except:
                    print('"next" present but couldnt click it')
                    break
            else:
                print('no "next" element')
                break

with open('raters.pkl', mode = 'wb') as f:
    pickle.dump(all_reviews, f)

On Goodreads you can give a star-based rating, as we’ve discussed, but you can also write a review about the book. The code above gets these reviews too. I won’t use them here, but I may in a future blog post (it might be fun to inspect the adjectives most frequently associated with each book).

les data

In total I got 292,043 ratings from 197,548 different Goodreads users. That’s all from 357 books, which is the number of books I had rated by the time I did this. That’s all the data I need to find people who like the same sort of stuff I like.

In fact, that’s actually more data than I need. I have to discard everyone with whom I have only one book in common (more about that in a minute). After doing that I’m left with 43,825 raters and the following distribution:

As you see, the vast majority of these users have only a couple of books in common with me: they are concentrated in the 2-10 range of the X axis. Only a few users have rated more than 10 books in common with me. The maximum is 68 books. That’s Stephen.

Turns out Stephen and I have a similar sci-fi taste. We’re both into the classics - Heinlein, Bradbury, Asimov - and into more recent authors like Neal Stephenson, Charles Stross, and Ted Chiang. We’re both also into economic and moral philosophy: Friedman, Hayek, Emerson, Nietzsche.

But no, I can’t just go after the books Stephen has read and I haven’t. Life is not so simple. His interests are far more diverse than mine: he’s into horror, fantasy, comics, crime, and dozens of other genres, whereas I can go a whole year reading nothing but sci-fi. I’ve only read 2% of the books he’s read, but he’s read 24% of the books I’ve read. So it’s not like Stephen and I are “book twins”, it’s just that Stephen reads a lot, so inevitably he will look like many people’s “book twin” at first glance.

Not to mention that our ratings don’t necessarily coincide. For instance, Stephen is much more indulgent with slow-paced novels than I am. He 4-starred Olaf Stapledon’s Last and First Men and C.J. Cherryh’s Downbelow Station, both of which I abandoned after just a few chapters. I like books that hook you from the very first paragraph.

The moon blew up without warning and for no apparent reason. (Seveneves, Neal Stephenson)

I did two things on my seventy-fifth birthday. I visited my wife’s grave. Then I joined the army. (Old Man’s War, John Scalzi)

Lolita, light of my life, fire of my loins. (Lolita, Vladimir Nabokov)

There are other complications but I’m getting ahead of myself. For now I just want to find potentially similar people, not evaluate each of them. Stephen is a potential neighbor - to use the jargon of user-based collaborative filtering -, but we won’t know for sure until much later. Keep reading.

There’s one complication I need to mention right now though. When I check Stephen’s profile on Goodreads I can see that we actually have rated 93 books in common, not 68. My dataset is lying to me. Why? Well, remember how I only got up to 1500 ratings for each book because Goodreads doesn’t let us see all the ratings? Turns out that for 25 of those 93 books Stephen’s ratings were not in the (up to) 1500 ratings I scraped for each book. Hence those 25 ratings didn’t make it into the dataset.

I could fix that by scraping each rater’s profile and checking all the books we actually have in common. But with 43,825 raters that would take forever. I could parallelize the scraping - have, say, 43k bots scrape ~1k raters each. But Goodreads would surely understand that as an attack and block my IP address. I could get around that by using proxies or remote servers but that’s a lot more effort than I’m willing to put into this. Sorry again, reviewer #2.

Alright, time to go back to the question I left hanging before: why did I have to drop users with only one book in common with me? To answer that I need to (finally) explain cosine similarity.

cosine similarity

Say we arrange the scraped data in a matrix where each column represents a book, each row represents a user, and each cell represents how many stars that user rated that book. Here I have 43,825 users and 357 books, so that’s a 43,825 by 357 matrix.

(Most users won’t have rated most books, so most of the cells are NULL or None or whatever null-like representation you prefer. It can’t be zero though - that would later be interpreted as a rating of zero stars -, so this isn’t really a sparse matrix, it’s something else. I don’t know that it has a specific name.)

Now suppose I pluck two arbitrary (row) vectors from that matrix and, in these vectors, I keep only the columns corresponding to books that both users have rated (i.e., I drop all columns corresponding to books that only one or neither user has rated). If you project the resulting two vectors onto an Euclidian space there will be an angle between them.

Here’s the thing: the cosine of that angle is a measure of how similar the two vectors are. Ta-da. Now you know what cosine similarity means.


A concrete example might help.

Let’s pluck two users from our matrix: myself and someone else. Say, Geneva. We have rated a total of five books in common according to the dataset. Here are our ratings for those five books:

book Geneva Thiago
The Moon is a Harsh Mistress 4 2
Neuromancer 3 3
Snow Crash 4 5
The End of Eternity 4 5
The Last Policeman 4 5

These are our two vectors then: [4, 3, 4, 4, 4] (Genevas’ ratings) and [2, 3, 5, 5, 5] (my ratings).

The first step here is to compute the Euclidean norm of each vector. It’s not hard: for each vector you square every element, you sum up all these squares, then you take the square root of that sum. So for Geneva’s vector the norm is \(\sqrt{4^2 + 3^2 + 4^2 + 4^2 + 4^2}\approx 8.54\). For my own vector the norm is \(\sqrt{2^2 + 3^2 + 5^2 + 5^2 + 5^2}\approx 9.38\).

Next we divide each vector’s element by the vector’s norm. Hence Geneva’s vector becomes \([\frac{4}{8.54}, \frac{3}{8.54}, \frac{4}{8.54}, \frac{4}{8.54}, \frac{4}{8.54}]\) = [0.46, 0.35, 0.46, 0.46, 0.46] and mine becomes \([\frac{2}{9.38}, \frac{3}{9.38}, \frac{5}{9.38}, \frac{5}{9.38}, \frac{5}{9.38}]\) = [0.21, 0.31, 0.53, 0.53, 0.53].

Finally, the cosine itself. It’s just the dot product of those two vectors: 0.46(0.21) + 0.35(0.31) + 0.46(0.53) + 0.46(0.53) + 0.46(0.53) = 0.09 + 0.10 + 0.24 + 0.24 + 0.24 \(\approx\) 0.93.

How similar are the two vectors if the cosine of their angle is 0.93? They are about as similar as they can be: for any two vectors of our Goodreads matrix the cosine will be in the 0-1 range. The closer to 0, the more different the two vectors are. The closer to 1, the more similar.

(Before the math police come after me: “in abstract” a cosine can vary from -1 to +1. But in our Goodreads matrix we only have nonnegative values; Goodreads doesn’t let you rate a book negative stars. So here the cosine between any of our two vectors has to be in the 0 to +1 range.)

If I compute the cosine between my own vector and every other vector in our matrix then I’ll know which users are most similar to me, book-wise.

Perhaps now you can see why I dropped users who only had one book in common with me. An angle cannot exist between two points, only between two vectors. And I need at least two points to form one vector.

Note that I have only considered the five books that Geneva and I have both rated. We can’t compute the cosine between two vectors that are not the same size. If I add a sixth book that’ll be a book that either only Geneva has rated or only I have rated or neither of us has rated. We’d have one vector with six elements and one vector with five elements (no, we can’t input zero or any other value to fill the gap). That wouldn’t work.

Another thing to keep in mind is that, as I explained before, I only got up to 1500 ratings for each book, so the number of books I have in common with each of the other Goodreads users is actually higher than what’s in the dataset. I checked Geneva’s profile and it turns out that we share 21 books, not five. So by computing the cosines between my ratings and everyone else’s ratings I get a rough idea about who my closest neighbors probably are; I don’t get a precise, 100% accurate ranking of my closest neighbors.

too many book twins

So I sat down and computed the cosine similarity between me and every other person on my Goodreads dataset. (Well, actually I sat down and wrote code to do that for me.)

Turns out there are lots of people for whom cosine=1, the maximum possible value. But no, I don’t have lots of book twins out there. The problem is: a cosine of 1 is pretty common when the two vectors are short.

Take user Ajay, for instance. We have two books in common: How to Lie with Statistics and Skin in the Game. He rated both 5 stars. I rated both 3 stars. If you do the math you’ll find a cosine of 1. Or take Derek. We also have two books in common: The Signal and the Noise and Neuromancer. He rated them 3 and 2 stars respectively. I rated them 4 and 3 stars respectively. Cosine is also 1.

Let’s put this in perspective. Remember Stephen? Our cosine similarity is 0.93. Now that’s a guy with whom I share 68 books. Should Ajay and Derek rank higher than Stephen on my similarity ranking? I don’t think so. I’d much rather read a book suggested by Stephen than a book suggested by Ajay or Derek.

Alright then, should I drop everyone with fewer than, say, ten books in common with me? Nope. Not all books are equally important. The two books I share with Ajay and the two books I share with Derek are all pretty popular. Lots of people have read them, most of which are certainly not in my tribe. Now what if I had only two books in common with someone but those two books were, say, David Wingrove’s The Middle Kingdom and John Ringo’s Live Free or Die?

That would be a different story. Those two books are excellent but little known. The Middle Kingdom, published in 1989 - back when China’s GDP was 3.8% of what it is today -, shows a future where history has been rewritten and schoolchildren learn that the Chinese have been ruling the world since general Ban Chao (an actual person) destroyed the Roman Empire in the first century AD. The Middle Kingdom makes Lord of the Rings and Game of Thrones seem juvenile and unimaginative. And yet it has only 1.3k ratings on Goodreads. Live Free or Die, in turn, is about a guy named Tyler Vernon. Earth has been conquered by an alien race - the Horvath. Tyler sells maple syrup to other alien races (it works as a mixture of cocaine and aphrodisiac for them). He then uses the money to buy alien military technology and fight back against the Horvath. It’s great, unapologetic, libertarian sci-fi - government fails to fight off hostile aliens, private sector steps up; it’s Robert Heinlein meets Ayn Rand. And yet only 7.3k brave souls have rated it on Goodreads.

My point is: if I only have two books in common with you and these two books are The Middle Kingdom and Live Free or Die then I definitely want to hear your book recommendations. Those two books are too good and too obscure. Reading them certifies you as the sort of weirdo I want to get book recommendations from. That’s why I can’t simply drop people with whom I have only two or three books in common. They may be the right books.

We can think about it from a probabilistic perspective. The fact you’ve read The Middle Kingdom and Live Free or Die tells me more about you than the fact that you’ve read, say, Ender’s Game. Don’t get me wrong, I love Ender’s Game - it’s on my top 5 list of all-time favorites. But nearly a million people have rated Ender’s Game on Goodreads. The more popular the book, the more diverse its readers. Surely there are Ender’s Game readers who enjoy boring novels like Anne of Green Gables or The Handmaid’s Tale. You telling me that you’ve read Ender’s Game doesn’t tell me much about what type of reader you are. Same with Ready Player One, Snow Crash, and World War Z: excellent books, but so popular that they attract all sorts of people, including lots of people I absolutately don’t want to get book recommendations from. In probabilistic terms, P(good recommendations|has read The Middle Kingdom) > P(good recommendations|has read World War Z).

What to do then?

measuring how weird you are

The solution is to use something called Inverse User Frequency (IUF). Here is how it works. Goodreads has ~65 million users. To Kill a Mockingbird has been rated by 3.6 million users. So its IUF is log(65/3.6) \(\approx\) 2.8. The Middle Kingdom, on the other hand, has been rated by only 1.3k users. So its UDF is log(65/0.0013) \(\approx\) 10.8. So, The Middle Kingdom is almost four times weirder than To Kill a Mockingbird (without taking the log the differences would be too extreme; we want to take the book’s popularity into account but not too much). (If you’re familiar with natural language processing you will notice this is similar to Inverse Document Frequency.)

So instead of computing raw cosine similarities we are going to compute weighted cosine similarities, with each rating being weighted by the IUF of the corresponding book.

To give a concrete example, let’s go back to Geneva. Here are the IUFs of the books we have in common.

book raters IUF
The Moon is a Harsh Mistress 91k 6.5
Neuromancer 216k 5.7
Snow Crash 200k 5.7
The End of Eternity 34k 7.5
The Last Policeman 20k 8

Now that we have the IUFs we incorporate them in the cosine computation.

That’s more like it. The books I have in common with Geneva are pretty popular and that’s reflected in our similarity going down from 0.93 to 0.15. Good. Now, the decrease is not as drastic as it seems. Weighting by IUF makes all cosine similarities smaller. Let’s go back to the hypothetical example of someone with whom I share two books and they are The Middle Kingdom and Live Free or Die. If that other person has rated both 4 and I have rated both 5 then our raw cosine similarity is 1 but our weighted cosine similarity is only 0.5.

still doesn’t quite feel like my tribe…

I adjusted the cosine similarities to account for the IUF, looked into the people with the highest similarity scores and… things were still a bit odd. When I looked into the top 10 folks and the books they’ve rated - and what their ratings were - I just couldn’t quite picture myself taking book recommendations from them in real life (like, if I had met them at a café). I know, this is super wishy-washy, but I’m sure you get the idea.

Turns out I had neglected to take into consideration that some people are just more generous and forgiving than others. Some people have a mean rating of 4.5 while others have a mean rating of 2.5. So we need to mean-center people’s ratings before we compute the similarity scores. As in: if your mean rating is 2.5 and you rated Snow Crash 4 stars then I’ll actually consider it 1.5 stars when computing our cosine similarity.

So, I redid everything but mean-centering the ratings.

did it work?

Well, we can’t answer that question just yet. First we need to answer another question: how do we know how good (or bad) the results are?

Turns out there is a simple metric for that. I can use the similarities I generated to “predict” my own ratings for books I have already rated, then compare the predicted ratings to my actual ratings.

Say I take Stephen and Geneva, which we’ve encountered before. Stephen rated Neal Stephenson’s Snow Crash five stars. Geneva rated it four stars. My similarity with Stephen is 0.93. My similarity with Geneva is 0.15. So if I wanted to use only Stephen and Geneva to produce a prediction, I could make it . In other words, I could use a weighted average of Stephen’s and Geneva’s ratings to “predict” what my own rating of Snow Crash would be (the weights being our respective similarities). My actual rating of Snow Crash is five stars, which is pretty close to 4.86.

I can do this operation for all books I ever rated. And then I can check how close my predicted ratings and my actual ratings are. How can we measure that? Hhmm, if only there was a way to compare the similarity of two vectors… Oh wait, there is: cosine similarity. So I use cosine similarity twice: first to compute user-user similarities, then to assess the quality of the predictions those similarities produce. What a handy tool.

Now, here I don’t want to use all Goodreads users I scraped. The whole point of this exercise is to find my tribe, i.e., the people whose literary tastes are most similar to mine. That’s the folks I want to get book recommendations from. So I have to define a threshold - say, my top 100 or my top 10000 most similar users. And then I predict the ratings for the books I have rated in common with these top k people - my top k “neighbors”, as textbooks refer to them.

There is a trade off in choosing k. The higher the k the more neighbors you have to get predictions from, so the more books you have predictions for. But the higher the k the more you get predictions from neighbors who are not so close to you. Here I tried a bunch of different values for k and it turns out that k=1000 seemed like a good compromise.

Now we’re ready to the answer the original question: how good are the results? Turns out the cosine similarity between my predicted ratings and my actual ratings, using my top 1000 neighbors, is 0.91 (after adding back the mean, which we removed before - remember?). Since cosines vary from 0 to 1, that seems pretty good. Looks like I’ve finally found my tribe.

so what?

Ok, so how do we generate book recommendations? Well, I just “predicted” what my ratings would be for books I have already rated. That’s how I assessed the quality of the similarities. (That’s how we assess any predictive model: we “predict” things that have already happened, then check how right or wrong those “predictions” are.) Now, to get actual book recommendations I simply predict what my ratings will be for books I have not rated and I pick the books which I’m predicted to rate highly. So this time I will predict without quotes: what I’m predicting hasn’t happened yet.

I went back to my top 1000 neighbors and scraped their ratings for every book they have ever rated (and that I never did). Then I predicted what my own rating would be for each one of those books.

That resulted in a ranking of 29,335 books. Here are the top 12:

  • Jim Butcher’s Storm Front. This is the first of a series of books, several of which were in the top 100. I had never heard of it before. I’m not really into the fantasy genre, but I guess I’ll give it a try.

  • Brandon Sanderson’s The Way of Kings. Another first book of a fantasy series I had never heard of.

  • Carl Sagan’s Cosmos. I’ve seen the Neil deGrasse Tyson show and I loved it, so I’m surely going to read this.

  • Brian Vaughan and Fiona Staples’ Saga. I’ve never read a graphic novel before and I never thought I would, but I guess there is a first time for everything.

  • Terry Pratchett’s The Color of Magic. Yet more fantasy. This one I had seen around but the book description hadn’t appealed to me.

  • Mario Puzo’s The Godfather. Ok, this probably should have been in my want-to-read shelf already.

  • Yuval Harari’s Sapiens. At one point I had this book in my want-to-read shelf but then I read Harari’s other book, Homo Deus, and I didn’t really enjoy it, so I gave up on Sapiens. I guess I’ll put it back there now.

  • Brandon Sanderson’s The Final Empire. Yet more fantasy I had never heard of.

  • David Weber’s On Basilisk Station. Now, that’s what I’m talking about! Finally some sci-fi in my recommendations. This is probably the one I’ll read first.

  • Patrick Rothfuss’ The Name of the Wind. More. Freaking. Fantasy. This one I had seen around before but the title is so uninspiring that I never even bothered to read the book’s description.

  • Dostoyevsky’s The Brothers Karamazov. Not sure what to do about this one. I endured Tolstoy’s Anna Karenina and War and Peace and from what I’ve heard Dostoyevsky is similarly soporific. I may not have the fortitude to go through this one.

  • J.J. Benítez’ Jerusalén. This one I read before - some 25 years ago, as a kid. (But I rated a different edition on Goodreads, so it ended up on my recommendations here.) This was the first book about time travel I ever read and I absolutely loved it. Now, my thirteen-year-old self was not a particularly discriminating reader. So I think I’ll pass. I don’t want to ruin the good memories I have of it.

And voilà, I have built and used my own book recommender system.

These results are less exciting than I expected. I don’t understand all the fantasy books. I’ve read a fantasy novel here and there but I’ve read all of Asimov’s Robot, Empire, and Foundation novels, most Neal Stephenson books, a lot of Robert Heinlein and John Scalzi - so why the heck is there only one sci-fi book in my top 12 recommendations?

I guess some of the oddities may be due to my including non-fiction books when computing the cosine similarities. The fact that you and I have both read Daron Acemoglu’s Economic origins of dictatorship and democracy and William Greene’s Econometric analysis and Luciano Ramalho’s Fluent Python may result in a high similarity score but that doesn’t mean we like the same stuff when it comes to fiction. Dropping non-fiction books might improve the results. That would require going through each of my books and flagging them as either fiction and non-fiction, then redoing everythig I’ve done here, so I’m not doing it. You and I will just have to live with that, reviewer #2.

But there is a lot more room for improvement here - it’s not just about dropping non-fiction books. Let’s talk about some possibilities.

what next?

If you’re building a recommender system on top of cosine similarity or some other such metric the way to go about it is to tweak the similarity formula in all sorts of ways - give more weight to this or that, etc -, check how good or bad the resulting predictions are in each case, then pick the formula that yields the best predictions. I did a little tweaking here (IUF, mean-centering), but if you’re serious about it you should try a lot more.

Besides, there are other ways to build recommender systems that don’t rely on hard-coded similarity formulas. For instance, we could use linear regression to learn the weight each neighbor should have. We’d still end up with a weighted average for the predictions, but the weights would have been learned from the data.

Alternatively we could use algorithms like random forest, support vector machine, or neural network. These algorithms have one big advantage over a linear regression: they learn weird relationships between the variables you have. It could be, for instance, that I’m unlikely to enjoy books that are recommended by both Stephen and Geneva but likely to enjoy books that are recommended only by Geneva and not by Stephen or vice-versa. Or that I’m super likely to enjoy a book recommended by Peter only if his rating is exactly five stars and all other ratings are exactly one star. Linear regression won’t capture any of that unless you know beforehand what these oddities are and model them explicitly. But random forest, SVM, and neural networks can learn those oddities from the data and incorporate them into their predictions. The drawback is that if you don’t have a lot of data these algorithms can easily mistake noise for signal and incorrectly pick up idiosyncrasies.

You could also use clustering algorithms, like k-means or DBSCAN. They work by grouping (clustering) samples (in our case, those would be Goodreads users) together according to their similarities.

Now, using linear regression, random forest, SVM, neural network, or clustering requires overcoming a problem: missing data. Here my top 1000 neighbors don’t have a single book that they all have rated in common. There are too many missing data points to run a regression or do anything of the kind. So before doing any of that you’ll need to either impute the missing data points or use matrix factorization to reduce the dimensionality of your dataset. The exception is k-means, which is relatively straightforward to adapt for use with massive missing data (see the Aggarwal book I mentioned before, chapter 12).

An alternatively that bypasses the missing data issue entirely is to model the data as a big graph, which is just fancy data science jargon for a network, like Facebook or Twitter: you have nodes (people and pages) and links between them (A follows B, A is friends with B, etc). Here we could model each Goodreads user as a node and each rating that two users have in common as a link between them. Then we could use a class of algorithms known as community detection algorithms to find what the “cliques” are.

Here I was just building a recommender system for myself, for the fun of it, so I’m not going to do any of that. But if you’re doing something serious with it then you probably should try at least some of these alternatives and see which one yields the best predictions.


This is it. Happy reading!

Outside of a dog, a book is man’s best friend. Inside of a dog it’s too dark to read. (Groucho Marx)

using Slack as middleware

I started working remote a few weeks ago. It’s sheer awesomeness - no distractions, no commute, no shabby cafeteria lunch. But there was one minor glitch: I was having trouble accessing my organization’s proprietary data. Being a government agency, we hoard tons of sensitive data on people - addresses, taxpayer IDs, personal income, etc. So, naturally, we restrict access to our databases. There is no URL I can go to when I need to run some database query; I need to be inside my organization’s network to run database queries.

I can use a VPN to log into my office PC from any machine. And once I’m logged into my office PC I can access whatever I want. But that means being forced to use my lame, old office PC instead of my sleak, fast MacBook Pro. Using the VPN also means enduring a maddening lag. It’s only a fraction of a second, but it’s noticeable and it can drive you insane. Finally, using the VPN means I have no local internet access - while I’m in the VPN my laptop has no other contact with the outside world, which results in my not being able to use Spotify and a bunch of other stuff. And I need my 90s Eurodance playlists to be in the proper mindset for writing code.

After enduring all that (I know, tiny violin…) for a couple of weeks I decided to do something about it. I realized that I needed a “bridge” between the data and my laptop. Some web service of sorts that could receive data requests and respond to them. Now, I’d never trust myself to build something like that. I don’t know nearly enough about information security to go about building that sort of tool. I don’t wanna be the guy who caused every Brazilian’s monthly income to be exposed to the world.

Then it hit me: I don’t need to build anything like that, these tools have already been built and we already use them to exchange sensitive data. I’m talking about messaging apps like Slack, Telegram, and the like. My office PC can access Slack. My personal laptop can access Slack. Slack is what my team uses for communication, which means sensitive information already circulates through it. In sum, the middleman I needed was already in place. All I had to do was to repurpose it. And that’s what I did. What follows bellow is a brief account of how I did it, in case other people may be in the same situation.

step 1: stuff that doesn’t involve code

The first thing you need to create are the Slack channels that will handle the data requests. I chose to create two - #incoming, to receive the data requests, and #outgoing, to send the requested data. I made them private, so as not to annoy my teammates with notifications and messages they don’t need to see. Alternatively, you could create an entirely new Slack workspace; that way you isolate human messaging from bot messaging.

Once you’ve created the channels you’ll need to create a Slack bot. It’s this bot that will: a) read the data requests that arrive on #incoming; and b) post the requested data to #outgoing. Slack lets you choose between two types of bots: “app bots” and “custom bots”. They nudge you towards the former but the latter is a lot more straightforward to set up: just click here, click “Add Configuration”, and follow the instructions. When you’re done, write down your bot’s API token - it’s the string that starts with xoxb- -, and, on your Slack workspace, invite your bot to join #incoming and #outgoing.

step 2: testing your bot

We need to make sure that your Slack bot can read from #incoming and post to #outgoing.

Let’s start with reading. There are a number of ways to go about this - Slack has a number of APIs. I think the Web API is the best pick for the impatient. Now, the documentation doesn’t have a quickstart or many useful examples. The explanations are verbose and frustratingly unhelpful if you just want to “get it done” quick. So instead of making you read the docs I’ll just give you what you need to know: make a GET request to https://slack.com/api/groups.history?token=xoxb-your-bot-token&channel=id_of_incoming, where xoxb-your-bot-token is the token you wrote down in step 1 and id_of_incoming is the ID of the #incoming channel (it’s the endpoint of the channel’s URL). That will return to you the channel’s messages (up to 100 messages). If there are no messages in #incoming you won’t get anything interesting back. If that’s the case, just post anything to the channel first.

In real life you won’t be using Terminal/cmd for this, you’ll be using a Python or R script or something along these lines. Here’s how to do that in Python:

import requests

def read_messages():
    slack_token = 'xoxb-your-bot-token'
    slack_url = 'https://slack.com/api/groups.history'
    params = {
        'token': slack_token,
        'channel': 'id_of_incoming'
        }
    response = requests.get(slack_url, params = params, verify = False)
    if response.status_code == 200:
        return response.json()

response = read_messages()

What you get back is a Python dict which should have a key named ‘messages’. So, if 'messages' in 'response', then inside response['messages'] you’ll find a list of dicts, each dict being a message, each dict’s key being an attribute of said message (timestamp, text, user who posted it, etc).

Now, you don’t want to access #incoming’s entire history every time you poll it. You can include a parameter named oldest in the params dict and assign a timestamp to it. Then read_messages won’t return messages older than the specified timestamp.

(A little gotcha: what you pass as channel is not the channel’s name but the channel’s ID, which you can get from its URL. Some Slack methods do accept the channel’s name but I never remember which ones, so it’s easier to just use the channel’s ID for everything.)

(Because you went with a custom bot instead of an app bot you won’t have to deal with a bunch of error messages having to do with something Slack calls “scope”. You won’t waste two days in a mad loop of trying to get the scopes right, failing, cursing Slack, refusing to read the API documentation, failing, cursing Slack, refusing to read the API documentation. I envy you.)

Alright then, let’s move on to posting. Here’s how you do it: make a POST request to https://slack.com/api/chat.postMessage, using your bot’s token, your channel’s ID, and the message of the text as payload. Like this:

import requests

def post_to_outgoing(message):
    slack_token = 'xoxb-your-bot-token'
    slack_url = 'https://slack.com/api/chat.postMessage'
    payload = {
        'token': slack_token,
        'channel': 'id_of_outgoing',
        'text': message
        }
    requests.post(slack_url, data = payload, verify = False)

post_to_outgoing('hey macarena')

There. Once you run this code you should see the message “hey macarena” appear in #outgoing.

step 3: receiving #incoming messages

Ok, now you need a server-side program that will check #incoming for new messages - say, every five seconds or so. By server-side I mean it will run inside your company’s network; it needs to run from a machine that has access to your company’s databases. Here’s an example:

import time
import requests

def read_messages(timestamp):
    slack_token = 'xoxb-your-bot-token'
    slack_url = 'https://slack.com/api/groups.history'
    params = {
        'token': slack_token,
        'channel': 'id_of_incoming',
        'oldest': timestamp
        }
    response = requests.get(slack_url, params = params, verify = False)
    if response.status_code == 200:
        return response.json()

while True:
    timestamp = time.time() - 5
    new_msg = read_request(timestamp)
    if new_msg:
        print('new message(s) received:', new_msg['messages'])
    time.sleep(5)

Now, you probably want this “listener” to run in the background, so that you can log off without killing it. If you’re running it on a Linux machine the simplest solution is to use tmux. It lets you create multiple “sessions” and run each session in the background. If you’re doing it on a Windows machine you can use cygwin or, if that’s Windows 10, you can use tmux with the native Ubuntu binaries.

step 4: processing #incoming messages

Receiving messages is not enough, your script needs to do something about them. The simple, quick-and-dirty solution is to have your #incoming messages be the very database queries you want to run. An #incoming message could be, say, SELECT [some_column] FROM [some].[table] WHERE [some_other_column] = 42. Then the listener (the server-side program we created before) would read the query and use an ODBC package - like pyodbc or rodbc - to run it. If that works for you, here’s how you’d amend the listener we created before to have it handle SQL queries:

import time
import pyodbc
import requests

def read_messages(timestamp):
    slack_token = 'xoxb-your-bot-token'
    slack_url = 'https://slack.com/api/groups.history'
    params = {
        'token': slack_token,
        'channel': 'id_of_incoming',
        'oldest': timestamp
        }
    response = requests.get(slack_url, params = params, verify = False)
    if response.status_code == 200:
        return response.json()

def run_query(query):
    cnxn = pyodbc.connect(
        driver = 'name of your ODBC driver',
        server = 'path-to-your-database-server',
        database = 'name_of_your_database',
        uid = 'uid',
        pwd = 'pwd'
        )
    cursor = cnxn.cursor()
    cursor.execute(query)
    resultset = cursor.fetchall()
    return resultset

while True:
    timestamp = time.time() - 5
    r = read_request(timestamp)
    if r:
        for message in r['messages']:
            resultset = run_query(message['text'])
            print('query:', message['text'])
            print('results:', resultset)
    time.sleep(5)

Ok, I’m glossing over a bunch of details here. First you’ll need to set up an ODBC driver, which isn’t always easy to get right the first time - it depends on what SQL engine you have (SQL Server, MySQL, etc), on whether your script is running on Linux or Windows, and on what credentials you’re using to connect to your SQL engine. I can’t really help you out on this, you’ll have to google your way around. If you’ve never set up an ODBC connection before this is probably the part that’s going to take up most of your time.

Once the ODBC part is taken care of, leave the script above running and post some SQL query on #incoming. You should see the the result set of the query. Well done then, everything is working so far.

step 5: replying to #incoming messages

So you have a script that receives queries and executes them. Now your script needs to post the result sets to #outgoing. There really isn’t much mystery here - we already wrote post_to_outgoing above. The only thing left is to convert our result set into a string, so that Slack can accept it. In Python the json module handles that for us: json.dumps(your_data) takes a list or dict (or list of dicts, or dict of lists) and turns it into a string. It’s all below.

import json
import time
import pyodbc
import requests

def read_messages(timestamp):
    slack_token = 'xoxb-your-bot-token'
    slack_url = 'https://slack.com/api/groups.history'
    params = {
        'token': slack_token,
        'channel': 'id_of_incoming',
        'oldest': timestamp
        }
    response = requests.get(slack_url, params = params, verify = False)
    if response.status_code == 200:
        return response.json()

def run_query(query):
    cnxn = pyodbc.connect(
        driver = 'name of your ODBC driver',
        server = 'path-to-your-database-server',
        database = 'name_of_your_database',
        uid = 'uid',
        pwd = 'pwd'
        )
    cursor = cnxn.cursor()
    cursor.execute(query)
    resultset = cursor.fetchall()
    return resultset

def post_to_outgoing(message):
    slack_token = 'xoxb-your-bot-token'
    slack_url = 'https://slack.com/api/chat.postMessage'
    payload = {
        'token': slack_token,
        'channel': 'id_of_outgoing',
        'text': message
        }
    requests.post(slack_url, data = payload, verify = False)

while True:
    timestamp = time.time() - 5
    r = read_request(timestamp)
    if r:
        for message in r['messages']:
            resultset = run_query(message['text'])
            for result in resultset:
                output = json.dumps(list(result))
                post_to_outgoing(output)
    time.sleep(5)

Ta-da. As long as this script is running continuously inside your company’s network you no longer need a VPN to query name_of_your_database. If you want more flexibility you can tweak run_query so that it takes the name of the database as a second argument. And you should sprinkle try/except statements here and there to capture database errors and the like.

You’ve taken remote work one step further. It’s not only you who can work remote now: the applications you develop no longer need to live inside your company’s network. You can develop on whatever machine and environment you choose and have your applications post their queries to #incoming and retrieve the result sets from #outgoing.

One gotcha here: Slack automatically breaks up long messages, so if your query exceeds Slack’s maximum length it will be truncated and run_query will probably (hopefully) raise an error. Keep it short.

step 6: make it neat

Alright, you have a functioning “bridge” between you and your company’s databases. But that’s still a crude tool, especially if you will develop applications on top of it. You don’t want your apps to post raw SQL queries on Slack - that’s a lot of unnecessary characters being passed back and forth. Instead of a run_query function you should have a get_data function that stores a “template” of the query and only adds to it, say, the part that comes after the WHERE [something] = . Something like this:

def get_data(input):
    query = 'SELECT [some_column] FROM [some_database].[some_table] WHERE [other_column] = ' + input
    cnxn = pyodbc.connect(
        driver = 'name of your ODBC driver',
        server = 'path-to-your-database-server',
        database = 'name_of_your_database',
        uid = 'uid',
        pwd = 'pwd'
        )
    cursor = cnxn.cursor()
    cursor.execute(query)
    resultset = cursor.fetchall()
    return resultset

This is still too crude to be called an API but it’s a first step in that direction. The idea is to make the Slack-your_app interface as tight as possible, so as to minimize the types of errors you will encounter and to minimize the exchange of unnecessary strings. If you know exactly the sort of stuff that will be passed to get_data it’s easier to reason about what the code is doing.

I think this is it. Happy remoting!

doing data science in the government

Today it’s been three years since I first started working as a data scientist in the Brazilian government. Overall it’s been a great experience and I think this is a good time to reflect upon what I’ve learned so far. I’ll start with the ugly and the bad and then I’ll move on to the good.

bureaucrats won’t share data unless physically compelled to do so

There is no shortage of decrees saying that government agencies will give their data to other government agencies when requested to do so (here’s the latest one). But between the pretty text of the law and what really goes on in the intestines of the bureaucracy there is a huge gap. Every government agency is in favor of data sharing - except when it comes to its own data.

Excuses abound. I can’t share my data because it’s too sensitive. I can’t share my data because extracting it would be too costly. I can’t share my data because we’re in the middle of a major IT restructuring and things are too messy right now. I can’t share my data because we’ve lost its documentation. I can’t share my data because the IT guy is on vacation. I can’t share my data because you might misinterpret it (this is my favorite).

The actual reasons are not always easy to pinpoint. Sometimes there are legitimate concerns about privacy, as in the case of fiscal data. But this is rare. More often than not the alleged privacy concerns are just a convenient excuse. In some cases the privacy excuse is used even when the data is already public. For instance, everything the government buys (other than, say, spy gear) goes in the official bulletin, which anyone can read. The equivalent of the IRS in Brazil - Receita Federal - has all the corresponding tax invoices neatly collected and organized in a database. That database would make it much easier to compute, say, the average price the Brazilian government pays for ballpoint pens. You’d think that database would be readily available not just for the entire government but for the citizenry as well.

You’d be wrong. Government agencies have been trying - and failing - to put their hands in that database for years. Our IRS says it’s protected by privacy laws. But the law says that government purchases must be public. And they already are, it’s all in the official bulletin - but that’s unstructured data that would require lots of scraping, OCRing, and parsing. The data is already public but not in a machine-readable format. That’s why the IRS database is so valuable. But the IRS legal folks haven’t found their match yet.

(Meanwhile data that is actually sensitive - like people’s addresses and tax returns - can be bought for $10 from shady vendors not too hard to find; all it takes is a walk along Rua 25 de Março, in São Paulo. In Brazil if the government has any data on you then you can be sure it’s for sale at Rua 25 de Março.)

Sometimes bureaucrats don’t share data because they worry that the data will make them look bad. What if my data shows that I’m paying too much for office paper? What if my data shows that I have way more people than I need? What if my data shows that my agency is a complete waste of taxpayers’ money and should not exist? I suppose I understand the survival instinct behind such concerns, but those are precisely the cases where we absolutely must get the data, for that’s where the ugliest inefficiencies are. In general the more an agency refuses to share its data the more important it is to get it.

(It doesn’t help that I work for the Office of the Comptroller-General, whose main goal is to find and punish irregularities involving taxpayers’ money. People sometimes assume that our asking for their data is the prelude of bad things. We need to explain that inside the Office of the Comptroller-General there is this little unit called the Observatory of Public Spending and that our goal at the OPS is often to help other agencies by extracting insights from their data.)

government data is a mess

The idea that you should document your databases is something that hasn’t taken root in the Brazilian government. Nine times out of ten all you get is a MSSQL dump with no accompanying data dictionary. You try to guess the contents based on the names of the tables and columns, but these are usually uninformative (like D_CNTR_IN_APOSTILAMENTO or SF_TB_SP_TAB_REM_GDM_PST - both real-life examples). So you end up guessing based on the contents themselves. As in: if it’s an 11-digit numeric field then that’s probably the Brazilian equivalent of the Social Security Number.

As you might imagine, sometimes you guess wrong. You mistake quantity for total price and vice-versa and the resulting unit prices don’t make any sense. You think that a time field is in years but it’s actually in months. You mistake an ID field for a numeric field. Etc etc. You’re lucky when you catch the errors before whatever you’re doing becomes policy. And some fields you just leave unused because they are too mysterious and you can’t come up with any reasonable guesses. Like when it’s probably a dummy variable because the values are all 0s and 1s but you have no clue what the 0s and 1s mean.

Besides the lack of documentation there are also many errors. Null names, misclassification, missing data, typos (according to one database the Brazilian government signed an IT contract of US$ 1 quadrillion back in 2014 - that’s 26 times the budget of the entire US government). To give you an idea, half the government purchases classified as “wheeled vehicles” are under R$ 1,000 (roughly US$ 300); when we inspect the product descriptions we see that they are not actually vehicles but spare parts, which have a different code and should have been classified elsewhere.

The problem begins in the data generation process, i.e., in the systems bureaucrats use to enter the data. These systems are too permissive; they lack basic validation like checking input type (numeric, text, date, etc), input length (does the state code have more than two characters?), and the like. And there is no punishment for the bureaucrat who enters incorrect data.

The most frequent result is absurd averages. You try to compute, say, spending per employee, and what you get back is a bazillion dollars or something close to zero. That means many hours of data cleaning before we can really get started. You have to do all the sanity checks that the government systems fail to do - or else it’s garbage in, garbage out. After you’ve filtered out the US$ 1 quadrillion IT contracts and the US$ 0 hospitals and schools you are left with lots of missing data - and that’s often not missing at random, which poses additional problems. It’s no wonder that most of our academic work is about cleaning up data (like here and here).

recruiting is a different ball game

The Brazilian government does not deliberately recruit data scientists. Data scientists come in through a bunch of miscellaneous doors and then we find each other - by word-of-mouth or Twitter or conferences - and come up with ways to work together. By now there are a few well-known “data science places” in the government - like the OPS (where I work) and the TCU - and data-curious government employees have been flocking to them, by various means; but there isn’t a clear policy to attract data talent.

In order to enter the Brazilian government you usually have to pass a public exam and such exams do not cover any content even remotely related to data. What you do need to learn to pass these exams is a large number of arcane pieces of legislation, mostly relating to government procedures - like the three phases of a procurement process, what they are called, what the deadlines are, what the many exceptions are, and so on. As you may imagine, that doesn’t usually attract people interested in data. A few of us slip through the cracks somehow, but that’s largely by accident.

That makes recruiting for your team a lot harder than in the private sector, where you can simply post a job ad on LinkedIN and wait for applications. In the government you normally can’t recruit people that are not already in the government. It goes more or less like this: You start by identifying the person you want to bring to your team. He or she will usually be in another government agency, not in your own. You will need to negotiate with his or her agency so that they okay their coming to work for you. That may require you to find someone in your agency willing to go work for them. Such negotiations can last for months and they look a lot like an exchange of war prisoners (I was “traded” myself once and it’s not fun). If the head of your agency (your minister or whatever) has sufficient political clout (and know that you exist), he or she may try to prevail over his or her counterpart at the other agency. Either way, there’s no guarantee that you’ll succeed.

If it’s hard for the recruiter it’s even harder for the person being recruited. They need to tell their current agency that they no longer want to work there, but they have no guarantee that they will get transferred. Imagine telling your significant other that you want to break up but then being somehow legally compelled to stay in the relationship. It can be awkward. You’re unlikely get a promotion if your bosses know that you don’t want to work there. They may start giving you less important tasks. Such possibilities often kill any recruitment before it even begins.

There is also the issue of salary negotiations. The issue being that they are not possible. When you work for the government the law determines your salary - there is no room for negotiation. Sometimes you can offer a potential candidate a little extra money if they agree to take up some administrative responsibilities but this is usually under US$ 1000/month and most data scientists prefer to avoid having any responsibilities that involve paperwork. So whoever you are trying to lure must be really excited by the work you and your team do because the work itself is pretty much all you have to offer.

But enough with the bad and the ugly.

you have a lot of freedom to experiment

Paradoxically, in the midst of this red tape jungle we have a lot of leeway to play around and try new things. Data science is a highly technical subject, one that takes at least a few Coursera courses to even begin to grasp, and that helps keep it unregulated. We have to fill out the same stupid forms everyone else does when we want to buy stuff, but whether we use Hadoop or not, whether we adopt Python or R for a project, whether we go with an SVM or a neural network or both, and whether we think any given project is worth pursuing is all entirely up to us. Legal doesn’t have a say in any of that. The budget folks don’t have a say in any of that. The minister himself - heck, the president himself - doesn’t have a say in any of that. They wouldn’t even know where to start. So, thanks to the highly specialized nature of our trade we don’t have higher-ups trying to micromanage what we do.

There is also the tenure factor. You see, in Brazil once you enter the civil service you get automatically tenured after three years. And the constitution says that, once tenured, you need to do something really outrageous to get fired - and even then there are several appeal instances and often times nothing happens in the end. I bet that if I showed up naked for work tomorrow I still wouldn’t get fired; I might get a written warning or something along these lines, and I’d probably appear in the local news, but I would still have my job. It takes something outright criminal to get a government employee fired. Like, they need to catch you taking a bribe or not showing up for work for months. And even then sometimes people don’t get fired.

Overall tenure is bad: too many lazy idiots spend their days browsing Facebook and entertaining themselves with hallway gossip. But for experimenting purposes tenure is great. It makes “move fast and break things” possible. Some bureaucrats want our assistance and happily give us their data and collaborate with us, helping us understand their problems and needs. But other bureaucrats get upset that you’re even daring to ask for their data. And they worry that you might disrupt the way they work or that you might automate them altogether. If we had to worry about our jobs at every step of the way we wouldn’t accomplish much. Without tenure heads might roll.

you can help taxpayers; a lot

The Brazilian government is humongous - it takes up ~40% of the country’s GDP. Most of that money goes down the toilet: the government overpays for everything it buys; it contracts suppliers that do not deliver the goods and services they promised; corruption is generalized. But data science can help. For instance, at the OPS we have trained a model that predicts whether a supplier is likely to become a headache (say, because it won’t deliver or because it will shut down). (Here’s a talk I gave about it earlier this year.) We’re now refining that model so that we can later appify it and plug it into the government’s procurement system. That way the government will be alerted of potential problems before it signs the contract.

That project has taken a lot of time and effort - the first version of the model was the master’s research of my colleague Leonardo and since then he and other people have put a lot more work into it. That’s a lot of salary-hours. But if a single problematic contract of small size - say, US$ 100k or so - is prevented because of the model then all that effort will have been worth it. And given the size of the government’s budget - around US$ 1 trillion a year - we should be able to save a lot more money than US$ 100k. That’s money that could go back to taxpayers.

is it worth it?

If you can stomach the ugly and the bad and are excited about the good, then yes. :-)

liberating Sci-Hub

If you do academic research but are not affiliated with an academic institution you probably know Sci-Hub. It gives you access to over 60 million research papers - for free (no ads, no malware, no scams). Alexandra Elbakyan, its creator, has deservedly been ranked by Nature one of the top ten most relevant people in science and we independent researchers owe her a lot.

You’d think that such an invention would be welcomed by most people who are not Elsevier executives. You’d think that such an invention would be particularly welcomed at organizations that do not have an Elsevier subscription. You’d be wrong. In the Brazilian government, where I work, Sci-Hub is not only not welcomed, it is actively blocked. The firewall doesn’t let me access it.

That’s Portuguese for “Blocked content! Science is illegal/unethical, so screw yourself.” (Sort of.)

This week I finally got tired of that nonsense - dammit, I’m a data scientist, I need academic papers not only for the research I do on the side but also, and mainly, for my day job. So I decided to build an interface to Sci-Hub - an app that takes my search string, gives it to Sci-Hub, and retrieves the results. Much like I did before in order to use Telegram.

Writing the code was easy enough, it’s a simple web app that does just one thing. I wrote it on Thursday evening and I was confident that the next morning I would just fire app a new project on Google App Engine, deploy the code, and be done with it in less than an hour. Oh, the hubris. I ended up working on it all Friday and all Saturday morning; only at Saturday 12:43pm the damned thing went alive.

What follows is an account of those 36 hours, largely for my own benefit in case I run into the same issues again in the future, but also in case it may be helpful to other people also looking to unblock Sci-Hub. I’m also writing this because I think those 36 hours are a good illustration of the difference between programming, on the one hand, and software development, on the other, which is something I struggled to understand when I first started writing code. Finally, I’m writing this because those 36 hours are a good example of the inefficiencies introduced when sysadmins (or their bosses) decide to block useful resources.

le code

If you inspect the HTML code behind Sci-Hub you can see it’s really easy to scrape:

<div id="input"><form method="POST" action="/"><input type="hidden" id="sci-hub-plugin-check" name="sci-hub-plugin-check" value=""><input type="textbox" name="request"  placeholder="enter URL, PMID / DOI or search string" autocomplete="off" autofocus></form></div>

All you have to do is send a POST request. If Sci-Hub’s repository has the paper you are looking for, you get it in a PDF file.

So I built this minimal web app that sends a POST request to Sci-Hub and then emails me back the PDF. I chose email because getting and returning each paper takes several seconds and I didn’t want the app blocked by each request. With email I can have a background process do the heavy work; that way I can send several POST requests in a row without having to wait in-between.

To achieve that I used Python’s subprocess module. I wrote two scripts. One is the frontend, which simply takes the user’s input. I didn’t want any boilerplate, so I used cherrypy as my web framework. As for the HTML code I just put it all in the frontend.py file, as a bunch of concatenated strings (#sorrynotsorry). And I used CDNs to get the CSS code (and no JavaScript whatsoever).

I gave my app the grandiose name of Sci-Hub Liberator.

(Sci-Hub Liberator’s front-end. This is what happens when data scientists do web development.)

The other script is the backend. It is launched by the frontend with a call to subprocess.Popen. That way all requests are independent and run on separate background processes. The backend uses Python’s requests package to send the POST request to Sci-Hub, then BeautifulSoup to comb the response and find the link to the paper’s PDF, then requests again to fetch the PDF.

def get_pdf(user_input):
    '''
    search string -> paper in PDF format
    '''
    response_1 = requests.post('http://sci-hub.cc/', data = {'request': user_input})
    soup = BeautifulSoup(response.text)
    url_to_pdf = 'http:' + soup.find_all('iframe')[0].get('src')
    response_2 = requests.get(url_to_pdf)
    return response_2.content

The backend then uses Python’s own email package to email me the PDF.

def send_pdf(pdf):
    '''
    sends PDF to user
    '''
    sender = 'some_gmail_account_I_created_just_for_this@gmail.com'
    text = 'Your paper is attached. Thanks for using Sci-Hub Liberator! :-)'
    body = MIMEText(text, _charset = 'UTF-8')
    message = MIMEMultipart()
    message['Subject'] = Header('your paper is attached', 'utf-8')
    message['From'] = 'Sci-Hub Liberator'
    message['To'] = 'my_email_account@gmail.com'
    message.attach(body)

    part = MIMEApplication(pdf)
    part.add_header('Content-Disposition', 'attachment; filename = "paper.pdf"')
    message.attach(part)

    smtp_server = smtplib.SMTP('smtp.gmail.com:587')
    smtp_server.ehlo()
    smtp_server.starttls()
    smtp_server.ehlo
    smtp_server.login(sender, 'emails_password')
    smtp_server.sendmail(sender, 'my_email_account@gmail.com', message.as_string())
    smtp_server.quit()

Both scripts combined had 151 lines of code. Not exactly a “Hello, World!” application but not too far from it either.

a word of caution

Before I proceed I must ask you not to abuse Sci-Hub’s easily scrapable interface. That’s an amazing service they’re providing to the world and if you send thousands of requests in a row you may disrupt their operations. I trust that they have defenses against that (or else Elsevier would have taken them down long ago), but still, please don’t fuck up.

things change

Code written and tested, I turned to Google App Engine for hosting the app. With only 151 lines of code and two scripts I thought that launching the app would be a breeze. Silly me.

I wanted to use Python 3, but Google App Engine Launcher is only compatible with Python 2. I google around and it seems that they are deprecating GAE Launcher in favor of the Google Cloud SDK. Pity. GAE Launcher was a nifty little app that made deployment really easy. I had been using it since 2013 and it allowed me to focus on my app and not on deployment nonsense.

Resigned to my fate, I downloaded the Google Cloud SDK installer and… installation failed due to an SSL-related problem. It took some half an hour of googling and debugging before I could get it to work.

things don’t change

GAE’s standard environment only allows Python 2. You can only use Python 3 in GAE’s flexible environment. And the flexible environment is a different ball game.

I had never used the flexible environment before (I think it only became generally available early this year), but I decided to give it a try. To make a long story short, I couldn’t make it work. The exact same code that works fine on my machine returns a mysterious Application startup error when I try to deploy the app. The deploy attempt generates a log file but it is equally uninformative, it only says Deployment failed. Attempting to cleanup deployment artifacts.

Despite hours of tinkering and googling I couldn’t find out what the problem is. I declared all my dependencies in my requirements.txt file (and I pointed to the same versions I was using locally); I configured my app.yaml file; I made sure that all of my dependencies’ dependencies were allowed. I didn’t know what else to look into.

Eventually I gave up in despair and decided to fall back on GAE’s standard environment, which meant reverting to Python 2. That was a bummer - it’s 2017, if GAE’s standard environment needs to choose between 2 and 3 then it’s probably time to pick 3 (assuming there is a way to do that without killing all existing Python 2 projects).

pip issues

Vendoring didn’t work for BeautifulSoup. Even though I used pip install and not pip3 install what got installed was BeautifulSoup’s Python 3 version. That resulted in from bs4 import BeautifulSoup raising ImportError: No module named html.entities.

After several unsuccessful attempts to point pip install to a specific source file I gave up on pip. I tested my Mac’s system-wide Python 2 installation and BeautifulSoup was working just fine there. So I went to my Mac’s site-packages and just copied the damned bs4 folder into my app’s lib folder. That did the trick. It’s ugly and it doesn’t shed any light on the causes of the problem but by then it was Friday afternoon and I was beginning to worry this deployment might take the whole day (if only!).

sheer dumbness

GAE has long been my default choice for hosting applications and I’ve always known that it doesn’t allow calls to the operating system. It’s a “serverless” platform; you don’t need to mess with the OS, which means you also don’t get to mess with the OS. So I can’t really explain why I based the frontend-backend communication on a call to subprocess.Popen, which is a call to the OS. That’s just not allowed on GAE. Somehow that synapse simply didn’t happen in my brain.

back to the code

GAE has its own utilities for background tasks - that’s what the Task Queue API is for. It looks great and one day I want to sit down and learn how to use it. But by the time I got to this point I was entering the wee hours of Saturday. My hopes of getting it all done on Friday were long gone and I just wanted a quick fix that would let me go to bed.

So I rewrote my app to have it show the PDF on the screen instead of emailing it. That meant I would have to wait for one paper to come through before requesting another one. At that hour I was tired enough to accept it.

The change was pretty easy - it involved a lot more code deletion than code writing. It also obviated the need for a backend, so I put everything into a single script. But the wait for the PDF to be rendered was a little too much and I thought that a loading animation of sorts was required. I couldn’t find a way to do that using only cherrypy/HTML/CSS, so I ended up resorting to jQuery, which made my app a lot less lean.

Sci-Hub is smart

After getting rid of the OS calls I finally managed to deploy. I then noticed a requests-related error message. After some quick googling I found out that GAE doesn’t play well with requests and that you need to monkey-patch it. Easy enough, it seemed.

After the patching requests seemed to work (as in: not raising an exception) but all the responses from Sci-Hub came back empty. The responses came through, and with status code 200, so the communication was happening. But there was no content - no HTML, no nothing.

I thought that it might be some problem with the monkey-patching, so I commented out requests and switched to urrlib2 instead. No good: same empty responses. I commented out urllib2 and tried urlfetch. Same result. As per the official documentation I had run out of packages to try.

I thought it might have to do with the size of the response - maybe it was too large for GAE’s limits. But no, the papers I was requesting were under 10MB and the limit for the response is 32MB:

I had briefly considered the possibility of this being an user-agent issue: maybe Sci-Hub just doesn’t deal with bots. But everything worked fine on my machine, so that couldn’t be it.

Then it hit me: maybe the user-agent string on GAE is different from the user-agent string on my machine. I got a closer look at the documentation and found this:

Ha.

To test my hypothesis I re-ran the app on my machine but appending +http://code.google.com/appengine; appid: MY_APP_ID to my user-agent string. Sure enough, Sci-Hub didn’t respond with the PDF. Oddly though, I did get a non-empty response - some HTML code with Russian text about Sci-Hub (its mission, etc; or so Google Translate tells me). Perhaps Sci-Hub checks not only the request’s user-agent but also some other attribute like IP address or geographical location. One way or the other, I was not going to get my PDF if I sent the request from GAE.

At that point it was around 3am and I should probably have gone to bed. But I was in the zone. The world disappeared around me and I didn’t care about sleeping or eating or anything else. I was one with the code.

So instead of going to bed I googled around looking for ways to fool GAE and keep my user-agent string intact. I didn’t find anything of the kind, but I found Tom Tasche.

back to the code (again)

I decided to steal Tom’s idea. Turns out GAE has a micro-instance that you can use for free indefinitely (unlike AWS’s micro instance, which ceases to be free after a year). It’s not much to look at - 0.6GB of RAM - but hey, have I mentioned it’s free?

I rewrote my code (again). I went back to having the frontend and backend in separate scripts. But now instead of having the backend be a Python script called by subprocess.Popen I had it be an API. It received the user input and returned the corresponding PDF.

@cherrypy.expose
def get_pdf(self, pattern):
    '''
    search string -> paper in PDF format
    '''
    scihub_html = requests.post(
        'http://sci-hub.cc/', 
        data = {'request': pattern},
        headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'}
        )
    soup = BeautifulSoup(scihub_html.text)
    url_to_pdf = 'http:' + soup.find_all('iframe')[0].get('src')
    scihub_bytes = requests.get(ungated_url)
    return BytesIO(scihub_bytes.text)

I put this new backend in a GCE micro-instance and kept the frontend at GAE. I also promoted my backend’s IP from ephemeral to static, lest my app stop working out of the blue.

I was confident that this was it. I was finally going to bed. Just a quick test to confirm that this would work and then I’d switch off.

I tested the new architecture and… it failed. It takes a long time for the GCE instance to send the PDF to the GAE frontend and that raises a DeadlineExceededError. You can tweak the time out limit by using urlfetch.set_default_fetch_deadline(60) but GAE imposes a hard limit of 60 seconds - if you choose any other number your choice is just ignored. And I needed more than 60 seconds.

back to the code (yet again)

At that point I had an epiphany: I was already using a GCE instance anyway, so why not have the backend write the PDF to disk in a subprocess - so as not to block or dealy anything - and have it return just the link to the PDF? That sounded genius and if it weren’t 6am I might have screamed in triumph.

That only required a minor tweak to the code:

@cherrypy.expose
def get_pdf(self, pattern):
    '''
    search string -> paper in PDF format
    '''
    scihub_html = requests.post(
        'http://sci-hub.cc/', 
        data = {'request': pattern},
        headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'}
        )
    soup = BeautifulSoup(scihub_html.text)
    url_to_pdf = 'http:' + soup.find_all('iframe')[0].get('src')
    scihub_bytes = requests.get(url_to_pdf)
    paper_id = str(randint(0,60000000))
    with open('static/paper{}.pdf'.format(paper_id), mode = 'wb') as fbuffer:
        fbuffer.write(scihub_bytes.text)
    return paper_id

No DeadlineExceededError this time. Instead I got a MemoryError. It seems that 0.6GB of RAM is not enough to handle 10MB objects (10MB is the space the PDF occupies on disk; things usually take up more space in memory than on disk). So much for my brilliant workaround.

the end of fiscal responsibility

The cheapest non-free GCE instance has 1.7GB of RAM and costs ~$14.97 a month. I got bold and launched it (I looked into AWS EC2’s roughly equivalent instance and it wasn’t any cheaper: $34.96.). At last, after a painful all-nighter, my app was alive.

I mean, I still haven’t added any error checking, but that’s deliberate - I want to see what happens when Sci-Hub can’t find the paper I requested or is temporarily down or whatnot. I’ll add the error checks as the errors happen.

I’ll hate paying these $14.97 but it beats not having access to a resource that is critical for my work. The only alternative I see is to rescue my old Lenovo from semi-retirement and that would be annoying on several grounds (I don’t have a static IP address at home, I would need to leave it up and running all day, it would take up physical space, and so on). So for now $14.97 a month is reasonable. At least that money is not going to Elsevier.

Now that I’m paying for a GCE instance anyway I could move all my code to it (and maybe go back to having a single script) and be done with GAE for this project. But I have this vague goal of making this app public some day, so that other people in my situation can have access to Sci-Hub. And with GAE it’s easy to scale things up if necessary. That isn’t happening any time soon though.

things I learned

It’s not so fun to pull an all-nighter when you are no longer in grad school - we get used to having a stable schedule. But I don’t regret having gone through all these steps in those 36 hours. I used Google Compute Engine for the first time and I liked it. I’m used to AWS EC2’s interface and GCE’s looked a lot more intuitive to me (and I found out GCE has a free micro-instance; even though I ended up not using it for this project it may come in handy in the future). I also familiarized myself with gcloud, which I would have to do anyway at some point. And I also learned a thing or two about cherrypy (like the serve_fileobj method, which makes it really easy to serve static files from memory).

Those 36 hours were also a useful reminder of the difference between programming and software development. Programming is about learning all the things you can do with the tools your languages provide. Software development is largely about learning all the things you cannot do because your runtime environment won’t let you. Our Courseras and Udacities do a great job of teaching the former but we must learn the latter by ourselves, by trial and error and by reading the documentation. I’m not sure that it could be otherwise: loops and lambdas are fundamental concepts that have been with us for decades, but the quirks of GAE’s flexible environment will probably have changed completely in a year or less. Any course built around GAE (or GCP in general or AWS) would be obsolete too soon to make it worth it.

hey, where’s the code?

It’s all here. Have fun!

Automated Democracy Scores

New paper. Abstract:

This papers uses natural language processing to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). The ADS is based on 42 million news articles from 6,043 different sources and cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today the ADS is replicable and has standard errors small enough to actually distinguish between cases.