# how much is your apartment worth? (part 2)

So, Brazilian banks are using predictive models to do property valuation but they are doing it wrong. It’s time for us data folks to step in and cause some disruption.

our hypothetical property

Let’s assume that we want to find the market value of an apartment in Brasília. It’s 120m2, it’s in the Noroeste sector (Brasília is a planned city, so it has sectors instead of neighborhoods), it has three bedrooms, three bathrooms, and two parking spots. The condo fee is R$580/month and the property tax is R$ 745/month.

getting data

It’s 2019, so there’s no need to go knocking on doors asking people how much they paid for their apartments, how many bedrooms they have, etc. We can simply scrape online listing sites like ZAP, wimoveis, or Viva Real. Any of the three would do (and scraping all three of them would be best). Here I’ll scrape wimoveis for the simple reason that it is the easiest one to scrape.

Granted, that just gives us asking prices, not transaction prices. But the alternative is to go knocking on doors asking people about their apartments. (Unless of course you are a bank and you have tons of mortgage data.)

Here’s the (Python) code I wrote to scrape the result pages:

Every page of results contains up to 20 listings. But it only has summary information for each listing. The full data is in each listing’s own URL. So we need to get the URL of every listing from every page. I use BeautifulSoup for that:

Now we’re finally ready to scrape the listings themselves, with all the data we need (price, m2, pictures, etc).

In the end we’ll have over 15k samples. That’s up from the current 25-250 samples that real estate appraisers are using.

parsing data

Ok, what we have now is a huge mess of HTML and JPG files. The data we need is all buried in those files. We need to extract it.

For now I’ll ignore the JPG files and only use the HTML files.

Hooray, now we’ve put all the (non-image) data in a CSV file with proper column names and everything.

throwing data away

Now that we have all that data it’s time to throw some of it away.

You see, people are lazy. When they list their properties on wimoveis they don’t bother to tick all the boxes - “pool”, “playground”, “A/C”, etc. Whatever they consider relevant they’ll write down in the text field (often with lots of adjectives and exclamation marks). The result is that our CSV file is mostly empty: most of its cells are missing data. This varies according to the feature we’re talking about. But the vast majority of the features have simply too many missing data points to be useful. So let’s clean up a bit.

There! Now we have a clean, usable dataset.

train the model

I tried a few different algorithms to estimate the properties’ asking prices: linear regression, SVM, random forest, boosted trees, neural networks. For each of these algorithms (except linear regression) I tweaked the corresponding parameters a bunch of times (and for neural networks I tried lots of different architectures). The clear winner was boosted trees (which won’t be so surprising to Kaggle competitors).

Just a quick note: we discarded lots of features in the previous step because of missing data. Here we’ll add some of them back. People don’t always tick the “barbecue” box when filling out their listings, but they usually mention it in the text field. So the code below scans the text field looking for certain words.

This gives me a median absolute error of R$46k. In proportional terms (i.e., error / asking price) we have a median absolute error of 10% and a mean absolute error of 23%. Which is line with previous work (see here), where the mean absolute error is 25%-30%, and here, where the mean absolute error is 22%.) We’re not capturing everything here. Say, maybe the property is next door to a police station or to a church or to a loud bar. Maybe there was a murder in the premises. Etc etc. My point is not that these estimates should be final. My point is simply that these estimates are probably closer to the truth than the ones being produced today by professional appraisers all over Brazil. appraise! Alright then, time to appraise our Noroeste apartment. Just append the following lines to the previous code block and run it. And voilà, we have our point estimate: R$ 978k (That’s about US$254k). We also have a prediction interval with lower and upper bounds: [788k, 1060k]. To produce this interval I used something similar to quantile regression. The lower bound is an estimate of the 25th percentile of the distribution. The upper bound is an estimate of the 75th percentile. The point estimate is an estimate of the 50th percentile (i.e., the median). As we have three different models, the lower and upper bounds are not centered around the point estimate (we actually have three point estimates). More details here. text Here I’m scanning the property descriptions for words like “A/C”, “barbecue”, etc, and featurizing them as dummies. But you can use the texts themselves in the model. Just insert the following code between the line where you shuffle the samples and the line where you instantiate the model: The TFIDF matrix is too big - we end up with more columns than samples. So we don’t use it directly, we use LSA to reduce the TFIDF matrix to a documentsXtopics matrix of 400 topics. This improves the performance of the model a bit. But it’s ultimately nonsensical: when you’re appraising a new property you could keep tweaking the text until you get the price you want. So I did this just to inspect which topic vectors would be more relevant (see here), and then which words had more weight in these topics. This helped me decide which words to look for in the text fields (sauna, pool, etc). images I’m still figuring out the best way to use the pictures. I tried using the metadata first: number of pictures, height and width of the pictures, etc. That didn’t improve the model. (I know, shocking. But I like to try the simple things first.) I also checked whether the dominant colors help us predict the price. To do that I clustered the pixels of every picture of the property. Each pixel is defined by three values: R, G, B, each of which can vary from 0 to 255 and represents the intensity of each of the three primary colors (red, green, and blue). So the pixels exist in the same tridimensional space and therefore we can cluster them. The centroid of each cluster is a dominant color. Ideally we’d use DBSCAN for this, as the clusters may have widely different sizes and shapes and we don’t even know a priori how many clusters each picture has. But DBSCAN just takes forever to run. So I used k-means instead. I used the elbow technique to find the ideal number of clusters and it turns out that for most images that number was two or three. That was a massive waste of time. K-means is faster but it still took almost a week to run. And in the end those centroids didn’t improve the model one bit. A friend who knows a lot more about images than I do suggested that I try something along these lines. I.e., having a branched neural network where I can input both structured features (m2, location, etc) and image features. So that’s what I’m trying right now. It’s tricky though because the number of pictures varies across samples, the pictures are of different sizes, and the pictures aren’t standardized in any way (some listings have five pictures of the kitchen and none of the bedrooms, others have no pictures of kitchen, etc). The same friend also suggested that I use some pre-trained neural network capable of identifying objects like “window”, “A/C unit”, and so on, and then use the identified objects as features. That’s the next item on my to-do list. All that said, the truth is that I’m not sure the images will be useful in the end. It’s like with the texts: you could keep taking new pictures of the property until you get the “right” price. I think that’s harder to do with pictures than with texts, but who knows. I need to think more about it. Suggestions are welcome. “someone must be doing it already!” You bet. ZAP is doing it. Which makes sense: they have tons of data, so why not use it? In fact just last month they announced the next step: they’ll start buying property themselves, picking the ones that their model suggests are underpriced. In the US Zillow is doing it. I bet that there are plenty of similar initiatives all over the world. So I’m not proposing anything new here. Which makes it hard to understand why the heck Brazilian banks are not doing it yet. They have billions at stake. incentives matter I know better than to second guess the choices of people with skin in the game. But 70% of all mortgages in Brazil are concentrated in a state-owned bank - Caixa Econômica Federal (CEF). And when the state is involved the incentives are different. It’s not about delivering results but about broadening your mandate, blame-shifting, and securing resources. (If you want to have an idea of how government works, watch HBO’s Chernobyl.) So it’s not a market failure that we have here, but by and large a state failure. CEF’s bureaucrats do not get punished for using the wrong tool to do property valuation. In a state-owned company there is a lot of noise between your actions and your punishments/rewards. Which helps explain CEF’s long history of incompetence and corruption. Not to mention that an entire ecosystem has evolved around the status quo. You see, to be contracted by CEF as an appraiser you need to be certified by the Federal Council of Realtors (yes, that exists). There’s an exam for that and there are schools that charge good money to help prepare you for that exam. So, there are lots of people who benefit from the current system, which makes it harder to change. so long I guess this is it. I hope this post has inspired you to roll up your sleeves and do your own valuations from now on. Happy appraisals! # how much is your apartment worth? I’m buying an apartment and in the process I’ve learned a lot about property valuation. I’ve learned that there are people who do that for a living, that there are academic researchers who specialize in the subject, and that there is even regulation and licensing requirements. In particular, I’ve learned that people are using statistical models - like linear regression - to appraise properties. I thought that was super cool, so I got curious and read a bunch of model-based appraisal reports. And that, dear reader, was a fascinating lesson in the misuse of quantitative tools. And by the entities we’d expect to be most proficient in their use: banks. In what follows I show how Brazilian banks are doing it today and why it’s seriously messed up. Then in part 2 (forthcoming) I’ll show a better way to do it. (Trigger warning: if you’ve recently bought or sold property in Brazil, and if you paid those obscene appraisal fees that Brazilian banks charge, you may want to save yourself some aggravation and stop reading here.) regression and Rio The folks using statistical models to appraise real estate in Brazil are usually banks. If you don’t pay your mortgage the bank gets your property, so the bank needs to know beforehand whether the property is worth enough$.

In Brazil 70% of all mortgages are concentrated in one bank: Caixa Econômica Federal (CEF), which is owned by the state. You’d think that the bank with the most mortgages would be pretty good at estimating the value of real estate. You’d be wrong.

I downloaded dozens of property valuation reports by CEF. Well, they are not actually made by CEF: CEF outsources the appraisals to other companies. But CEF reviews and approves every appraisal report. And in any case the ultimate responsibility rests with CEF.

Let’s look at this report, which is pretty typical.

The property here is a plot of 41.695m2 in a small town not far from Rio de Janeiro (about 100km away). The appraiser started by gathering data on 38 other plots, all in the same small town. For each of the plots he collected four variables: area (in m2), whether the lot is paved, average family income of the area, and price (in R$) per m2. Then he dropped 13 of the 38 samples and used the remaining 25 to run a linear regression: price per m2 ~ area + paved + income. He then used the resulting model to estimate the price of the original plot. The resulting estimate was R$ 1.056.866,78, with an 80% confidence interval of [R$898.423,01, R$ 1.215.513,48]. The appraiser saw fit to manually adjust the estimated value to R$990.000,00 because, well, there’s some stuff that the model doesn’t capture. There is a lot that’s wrong here, but the main thing is: the appraiser doesn’t test the model. Normally, in machine learning tasks like this, we train the model using only a subset of the samples, ask the model to produce estimates for the samples that were left out, then check how close these estimates are to the actual values. Ideally we repeat this several times over, rotating which samples are left out at each iteration. But here there is no separation between training and testing. The appraiser used 100% of the 25 samples to train the model. There were no samples left for testing the model. So we have absolutely no idea how good or bad the model is. Maybe the actual market value of the plot is indeed close to R$ 1.056.866,78. But maybe it’s R$2 million. Or R$ 500k. We have no idea. Since the model isn’t tested, its performance is a complete mystery.

In response to which the appraiser may direct your attention to page 11 of the report, where you see this scatter plot of observed vs estimated values:

Clearly the model has a tremendously good fit: all estimates are super close to the actual values.

Except that that’s cheating: the appraiser used the same 25 samples for both training and testing the model. You don’t know a model’s performance until it has been subjected to samples it hasn’t seen before.

Not that it would make much sense to split training and testing samples with n=25. But if n=25 the thing to do is get more samples (more on this later). A small sample size doesn’t give you a license to simply not test your model.

“but that’s just one report”

Nope, that’s how every single model-based appraisal report I downloaded does it. Every. Single. One. No exceptions. At all. Google ‘CEF laudo imóvel’ or ‘Caixa Econômica laudo imóvel’ or ‘laudo avaliação imóvel’ and check for yourself. The only appraisals that are not like that are the ones that don’t use any statistical model whatsoever.

In other words: billions of R$in real estate transactions are based on property valuations that are completely worthless. let’s worry about all the wrong things If you ask the appraiser why he didn’t test the model he’ll be genuinely shocked and answer that he did test it. And he did run a bunch of tests, as it’s clear on the report: he tested whether the residuals are normally distributed, whether each coefficient is statistically significant, whether the model as a whole is statistically significant, and so on. Other reports go as for as testing for heteroskedasticity and autocorrelation. None of which makes the slightest sense here. This is not an econometrics problem. We’re not interested in the effect of each additional m2 on the price of the property. This is a machine learning problem. We’re interested in producing price estimates as close as possible to the actual prices. We don’t care about the assumptions behind linear regression in this context. In fact we don’t care about linear regression at all. The appraiser could have (and probably should have) used boosted trees or any other machine learning algorithm. The way to go about these things is to try different algorithms and pick the one that produced the best estimates. There is no reason to limit yourself to linear regression. regulation and its unintended consequences To be fair, this is not entirely the appraiser’s fault. It turns out that there is a set of semi-official guidelines for how to use linear regression to do property valuation. That’s regulation NBR-14653-2. It is not legally binding - you don’t go to jail or lose your license if you violate it. But it ends up being enforced anyway. CEF won’t subcontract your company to do appraisals if you don’t follow it. Regulation NBR-14653-2 tells appraisers to check for normality of the residuals, heteroskedasticity, autocorrelation, etc. It doesn’t say a word about testing the performance of the model. It’s completely silent on the topic of training vs testing samples, cross validation, accuracy, etc. In other words, regulation NBR-14653-2 recommends an econometric approach to a machine learning problem, which is bonkers. more wrongness (and a lesson in public policy) Suppose for a moment that the econometric approach were the right one here. Even then the current appraisals would be worthless. Take the report we discussed before. The sample size is 25. That’s just not good enough. “Oh, but it’s a small town, there probably aren’t that many plots for sale.” Yes, but Brazil has over five thousand small towns. Your samples don’t need to be in the same town where the property you’re appraising is. Yes, different towns will differ in GDP per capita, homicide rate, etc. But we have data on all that, so we can include those indicators in our models. And/or dummify “name of town”. Such a small sample is particularly egregious here, since CEF has 70% of all mortgages in Brazil, so they surely have a ton of data they could have used (or shared with the company they contracted to do the job). Imagine having millions of samples and then using only 25. (I suspect this has to do with appraisers’ lack of programming skills. They use software with graphical interfaces and probably just type every data point manually. So 1 million samples would be of no use to them. But I’m just guessing here.) Also, the appraiser doesn’t tell us which 25 - out of the original 38 - samples he actually used in the regression. I considered trying to replicate his results but there are 5.414.950.296 possible combinations of 25 elements out of 38, so that might take a while to run. The appraiser also doesn’t tell us why he dropped 13 of the 38 original samples. Were they outliers? Or maybe dropping them helped produce that incredible R2 of 0.98 we see on page 9…? At times it feels like the appraiser doesn’t really understand what he is doing. Like when he reports a p-value for the dependent variable (R$ per m2). Only independent variables have coefficients. Maybe he is confusing the constant and the dependent variable?

I also don’t know what the variable “average family income in the area” means or where it comes from. What’s “area” here? The neighborhood? The block? The zip code? What’s the source? He says it comes from “senso”, a misspelling of “censo” - census. But I have no idea which census he is talking about.

It’s also weird that he codes “is paved?” as no=1 and yes=2, instead of no=0 and yes=1.

So, just like in the other reports I read, I get the sense that the appraiser doesn’t have any quantitative training. It looks like all he can do is operate the software that produces the estimates (appraisers seem to like SisDEA). You input the data, you press this and that button, the software spits out content that mostly looks Greek to you (some of it is actual Greek), plus an estimate for the property you’re supposed to appraise. You copy and paste everything into a Word document, save it as a PDF file and email it to your client. That’s the job.

The stuff you copied and pasted contains, among other things, tests of the assumptions behind linear regression. You don’t understand any of that, but you see words that also appear on regulation NBR-14653-2 - normality, heteroskedasticity, autocorrelation -, so clearly you’re following the rules. No one can yell at you or fire you, right?

In other words, regulation substitutes for actual knowledge.

(Let this be a lesson to the perpetually outraged online mobs demanding that algorithms be regulated. They think they’ll get Andrew Ng to write the regulations. In the real world regulations are produced by a mix of bureaucrats with no skin in the game and politicians with too much skin in the game.)

“well, criticizing is easy! why don’t you do something about it instead?”

Fair enough. In part 2 I’ll show how we can improve things.

Despite all the hype about AI, product recommendations still suck. Amazon offers me the exact same Bose headphones I bought last month, plus two hundred close substitutes. Spotify thinks that because I enjoy Eurodance from the 1990s I will also enjoy Brazilian music from the 1990s. Netflix thinks that because I enjoy Black Mirror I will also enjoy Dear White People - and that’s despite Netflix knowing that I’m a fan of Rick & Morty, South Park, and any show that has Ricky Gervais. Goodreads thinks that because I enjoyed Ready Player One I will also enjoy Spell or High Water even though I’ve already read Spell or High Water and rated it only one star on Goodreads itself. No wonder there is zero peer-reviewed evidence that Cambridge Analytica swayed even a single voter in 2016. The singularity is nowhere near.

I could write to each of those companies or annoy them on Twitter or whatnot. But I believe in the DIY ethic.

I chose to start with Goodreads. It has by far the most egregious recommender system. And you hear people talk about their favorite TV shows or self-tracking wristbands all the time but you don’t always get to hear people talk about their favorite books.

“do the simple thing first”

I used something called user-based collaborative filtering, which is fancy language for “find other people who like the sort of stuff you like and then check what else they also like.”

There are lots of ways to do user-based collaborative filtering. Some algorithms only require grade school math while others rely on deep learning. (Aggarwal’s Recommender Systems is by far the best textbook on the subject that I know of, in case you’re interested.) Here I used something that only requires grade school math and can’t even be called an “algorithm”: cosine similarity.

But before I talk about cosine similarity let me talk about the data we need to collect in order to use it.

scraping

You start by collecting, for every item you ever rated, all other ratings it has received. And you also collect the identity (name or user ID) of all the other raters - all the people who have rated items in common with you. So I had to collect, for each book I ever rated on Goodreads, all other ratings that book received on Goodreads, along with the user IDs of the other raters.

Unfortunately, Goodreads doesn’t let us see all the ratings each book has received. Goodreads lets us see the full distribution of ratings: forty thousand 5-star ratings, twenty thousand 4-star ratings, and so on (see below). But it doesn’t let us see all the ratings themselves. And if I can’t see a rating I can’t know whose rating it is, so for our purposes here it might as well not exist.

How many ratings does Goodreads let us see? If you visit a book’s page - say, Ender’s Game page - you immediately see up to 30 ratings (fewer if it’s an obscure book that not even 30 people have bothered to rate). If the book has more than 30 ratings you can click “next” on the bottom of the page and that shows you 30 more ratings. You can keep clicking “next” until you get to page ten, which means you can get up to 300 ratings in total. That’s regardless of the book’s popularity: it can be Pride and Prejudice, which has 2.3 million ratings, and you still won’t be able to see beyond page ten.

Now, we can bypass that limit - to a point. We can filter the contents and ask to see just 1-star ratings or just 2-star ratings and so on. And each time we will get back up to ten pages (of 1-star ratings or 2-star ratings or whatever else we chose). Goodreads ratings go from 1 to 5 stars, integers only, so in practice this filtering lets us multiply five-fold the total number of ratings we can collect for each book: we increase it from 300 to 1500.

We can even go beyond 1500. See the “Sort” button in the above picture? We can sort the ratings from older to newer or from newer to older. Either way we still get back up to ten pages. So we can get 1500 ratings sorted from older to newer and then get 1500 ratings sorted from newer to older. That way we end up with 3000 ratings (some - or all - of which will be duplicates when the book is not popular enough to have received more than 3000 ratings in total).

I didn’t do the sorting thing though. It only occurred to me after I had done everything and I’m in no mood to go back and redo everything. So what I have is only up to 1500 ratings per book. Sorry, reviewer #2.

Alrigh then, enough talk. Here is the Python code I wrote to get all the ratings I ever gave on Goodreads. I used Selenium and Chrome. If you’re new to scraping then here’s a tutorial. You don’t need to be logged in to scrape Goodreads (though that reduces the size of the data a little: some Goodreads users choose to make their profiles available only to people who are logged in).

And here is the code I wrote to scrape all the ratings of every book I’ve ever rated. Now, this code takes a while to run - it’s getting every rating of every book I ever read -, so I used Chrome in headless mode.

On Goodreads you can give a star-based rating, as we’ve discussed, but you can also write a review about the book. The code above gets these reviews too. I won’t use them here, but I may in a future blog post (it might be fun to inspect the adjectives most frequently associated with each book).

les data

In total I got 292,043 ratings from 197,548 different Goodreads users. That’s all from 357 books, which is the number of books I had rated by the time I did this. That’s all the data I need to find people who like the same sort of stuff I like.

In fact, that’s actually more data than I need. I have to discard everyone with whom I have only one book in common (more about that in a minute). After doing that I’m left with 43,825 raters and the following distribution:

As you see, the vast majority of these users have only a couple of books in common with me: they are concentrated in the 2-10 range of the X axis. Only a few users have rated more than 10 books in common with me. The maximum is 68 books. That’s Stephen.

Turns out Stephen and I have a similar sci-fi taste. We’re both into the classics - Heinlein, Bradbury, Asimov - and into more recent authors like Neal Stephenson, Charles Stross, and Ted Chiang. We’re both also into economic and moral philosophy: Friedman, Hayek, Emerson, Nietzsche.

But no, I can’t just go after the books Stephen has read and I haven’t. Life is not so simple. His interests are far more diverse than mine: he’s into horror, fantasy, comics, crime, and dozens of other genres, whereas I can go a whole year reading nothing but sci-fi. I’ve only read 2% of the books he’s read, but he’s read 24% of the books I’ve read. So it’s not like Stephen and I are “book twins”, it’s just that Stephen reads a lot, so inevitably he will look like many people’s “book twin” at first glance.

Not to mention that our ratings don’t necessarily coincide. For instance, Stephen is much more indulgent with slow-paced novels than I am. He 4-starred Olaf Stapledon’s Last and First Men and C.J. Cherryh’s Downbelow Station, both of which I abandoned after just a few chapters. I like books that hook you from the very first paragraph.

The moon blew up without warning and for no apparent reason. (Seveneves, Neal Stephenson)

I did two things on my seventy-fifth birthday. I visited my wife’s grave. Then I joined the army. (Old Man’s War, John Scalzi)

Lolita, light of my life, fire of my loins. (Lolita, Vladimir Nabokov)

There are other complications but I’m getting ahead of myself. For now I just want to find potentially similar people, not evaluate each of them. Stephen is a potential neighbor - to use the jargon of user-based collaborative filtering -, but we won’t know for sure until much later. Keep reading.

There’s one complication I need to mention right now though. When I check Stephen’s profile on Goodreads I can see that we actually have rated 93 books in common, not 68. My dataset is lying to me. Why? Well, remember how I only got up to 1500 ratings for each book because Goodreads doesn’t let us see all the ratings? Turns out that for 25 of those 93 books Stephen’s ratings were not in the (up to) 1500 ratings I scraped for each book. Hence those 25 ratings didn’t make it into the dataset.

I could fix that by scraping each rater’s profile and checking all the books we actually have in common. But with 43,825 raters that would take forever. I could parallelize the scraping - have, say, 43k bots scrape ~1k raters each. But Goodreads would surely understand that as an attack and block my IP address. I could get around that by using proxies or remote servers but that’s a lot more effort than I’m willing to put into this. Sorry again, reviewer #2.

Alright, time to go back to the question I left hanging before: why did I have to drop users with only one book in common with me? To answer that I need to (finally) explain cosine similarity.

cosine similarity

Say we arrange the scraped data in a matrix where each column represents a book, each row represents a user, and each cell represents how many stars that user rated that book. Here I have 43,825 users and 357 books, so that’s a 43,825 by 357 matrix.

(Most users won’t have rated most books, so most of the cells are NULL or None or whatever null-like representation you prefer. It can’t be zero though - that would later be interpreted as a rating of zero stars -, so this isn’t really a sparse matrix, it’s something else. I don’t know that it has a specific name.)

Now suppose I pluck two arbitrary (row) vectors from that matrix and, in these vectors, I keep only the columns corresponding to books that both users have rated (i.e., I drop all columns corresponding to books that only one or neither user has rated). If you project the resulting two vectors onto an Euclidian space there will be an angle between them.

Here’s the thing: the cosine of that angle is a measure of how similar the two vectors are. Ta-da. Now you know what cosine similarity means.

A concrete example might help.

Let’s pluck two users from our matrix: myself and someone else. Say, Geneva. We have rated a total of five books in common according to the dataset. Here are our ratings for those five books:

 book Geneva Thiago The Moon is a Harsh Mistress 4 2 Neuromancer 3 3 Snow Crash 4 5 The End of Eternity 4 5 The Last Policeman 4 5

These are our two vectors then: [4, 3, 4, 4, 4] (Genevas’ ratings) and [2, 3, 5, 5, 5] (my ratings).

The first step here is to compute the Euclidean norm of each vector. It’s not hard: for each vector you square every element, you sum up all these squares, then you take the square root of that sum. So for Geneva’s vector the norm is $$\sqrt{4^2 + 3^2 + 4^2 + 4^2 + 4^2}\approx 8.54$$. For my own vector the norm is $$\sqrt{2^2 + 3^2 + 5^2 + 5^2 + 5^2}\approx 9.38$$.

Next we divide each vector’s element by the vector’s norm. Hence Geneva’s vector becomes $$[\frac{4}{8.54}, \frac{3}{8.54}, \frac{4}{8.54}, \frac{4}{8.54}, \frac{4}{8.54}]$$ = [0.46, 0.35, 0.46, 0.46, 0.46] and mine becomes $$[\frac{2}{9.38}, \frac{3}{9.38}, \frac{5}{9.38}, \frac{5}{9.38}, \frac{5}{9.38}]$$ = [0.21, 0.31, 0.53, 0.53, 0.53].

Finally, the cosine itself. It’s just the dot product of those two vectors: 0.46(0.21) + 0.35(0.31) + 0.46(0.53) + 0.46(0.53) + 0.46(0.53) = 0.09 + 0.10 + 0.24 + 0.24 + 0.24 $$\approx$$ 0.93.

How similar are the two vectors if the cosine of their angle is 0.93? They are about as similar as they can be: for any two vectors of our Goodreads matrix the cosine will be in the 0-1 range. The closer to 0, the more different the two vectors are. The closer to 1, the more similar.

(Before the math police come after me: “in abstract” a cosine can vary from -1 to +1. But in our Goodreads matrix we only have nonnegative values; Goodreads doesn’t let you rate a book negative stars. So here the cosine between any of our two vectors has to be in the 0 to +1 range.)

If I compute the cosine between my own vector and every other vector in our matrix then I’ll know which users are most similar to me, book-wise.

Perhaps now you can see why I dropped users who only had one book in common with me. An angle cannot exist between two points, only between two vectors. And I need at least two points to form one vector.

Note that I have only considered the five books that Geneva and I have both rated. We can’t compute the cosine between two vectors that are not the same size. If I add a sixth book that’ll be a book that either only Geneva has rated or only I have rated or neither of us has rated. We’d have one vector with six elements and one vector with five elements (no, we can’t input zero or any other value to fill the gap). That wouldn’t work.

Another thing to keep in mind is that, as I explained before, I only got up to 1500 ratings for each book, so the number of books I have in common with each of the other Goodreads users is actually higher than what’s in the dataset. I checked Geneva’s profile and it turns out that we share 21 books, not five. So by computing the cosines between my ratings and everyone else’s ratings I get a rough idea about who my closest neighbors probably are; I don’t get a precise, 100% accurate ranking of my closest neighbors.

too many book twins

So I sat down and computed the cosine similarity between me and every other person on my Goodreads dataset. (Well, actually I sat down and wrote code to do that for me.)

Turns out there are lots of people for whom cosine=1, the maximum possible value. But no, I don’t have lots of book twins out there. The problem is: a cosine of 1 is pretty common when the two vectors are short.

Take user Ajay, for instance. We have two books in common: How to Lie with Statistics and Skin in the Game. He rated both 5 stars. I rated both 3 stars. If you do the math you’ll find a cosine of 1. Or take Derek. We also have two books in common: The Signal and the Noise and Neuromancer. He rated them 3 and 2 stars respectively. I rated them 4 and 3 stars respectively. Cosine is also 1.

Let’s put this in perspective. Remember Stephen? Our cosine similarity is 0.93. Now that’s a guy with whom I share 68 books. Should Ajay and Derek rank higher than Stephen on my similarity ranking? I don’t think so. I’d much rather read a book suggested by Stephen than a book suggested by Ajay or Derek.

Alright then, should I drop everyone with fewer than, say, ten books in common with me? Nope. Not all books are equally important. The two books I share with Ajay and the two books I share with Derek are all pretty popular. Lots of people have read them, most of which are certainly not in my tribe. Now what if I had only two books in common with someone but those two books were, say, David Wingrove’s The Middle Kingdom and John Ringo’s Live Free or Die?

That would be a different story. Those two books are excellent but little known. The Middle Kingdom, published in 1989 - back when China’s GDP was 3.8% of what it is today -, shows a future where history has been rewritten and schoolchildren learn that the Chinese have been ruling the world since general Ban Chao (an actual person) destroyed the Roman Empire in the first century AD. The Middle Kingdom makes Lord of the Rings and Game of Thrones seem juvenile and unimaginative. And yet it has only 1.3k ratings on Goodreads. Live Free or Die, in turn, is about a guy named Tyler Vernon. Earth has been conquered by an alien race - the Horvath. Tyler sells maple syrup to other alien races (it works as a mixture of cocaine and aphrodisiac for them). He then uses the money to buy alien military technology and fight back against the Horvath. It’s great, unapologetic, libertarian sci-fi - government fails to fight off hostile aliens, private sector steps up; it’s Robert Heinlein meets Ayn Rand. And yet only 7.3k brave souls have rated it on Goodreads.

My point is: if I only have two books in common with you and these two books are The Middle Kingdom and Live Free or Die then I definitely want to hear your book recommendations. Those two books are too good and too obscure. Reading them certifies you as the sort of weirdo I want to get book recommendations from. That’s why I can’t simply drop people with whom I have only two or three books in common. They may be the right books.

What to do then?

measuring how weird you are

The solution is to use something called Inverse User Frequency (IUF). Here is how it works. Goodreads has ~65 million users. To Kill a Mockingbird has been rated by 3.6 million users. So its IUF is log(65/3.6) $$\approx$$ 2.8. The Middle Kingdom, on the other hand, has been rated by only 1.3k users. So its UDF is log(65/0.0013) $$\approx$$ 10.8. So, The Middle Kingdom is almost four times weirder than To Kill a Mockingbird (without taking the log the differences would be too extreme; we want to take the book’s popularity into account but not too much). (If you’re familiar with natural language processing you will notice this is similar to Inverse Document Frequency.)

So instead of computing raw cosine similarities we are going to compute weighted cosine similarities, with each rating being weighted by the IUF of the corresponding book.

To give a concrete example, let’s go back to Geneva. Here are the IUFs of the books we have in common.

 book raters IUF The Moon is a Harsh Mistress 91k 6.5 Neuromancer 216k 5.7 Snow Crash 200k 5.7 The End of Eternity 34k 7.5 The Last Policeman 20k 8

Now that we have the IUFs we incorporate them in the cosine computation.

That’s more like it. The books I have in common with Geneva are pretty popular and that’s reflected in our similarity going down from 0.93 to 0.15. Good. Now, the decrease is not as drastic as it seems. Weighting by IUF makes all cosine similarities smaller. Let’s go back to the hypothetical example of someone with whom I share two books and they are The Middle Kingdom and Live Free or Die. If that other person has rated both 4 and I have rated both 5 then our raw cosine similarity is 1 but our weighted cosine similarity is only 0.5.

still doesn’t quite feel like my tribe…

I adjusted the cosine similarities to account for the IUF, looked into the people with the highest similarity scores and… things were still a bit odd. When I looked into the top 10 folks and the books they’ve rated - and what their ratings were - I just couldn’t quite picture myself taking book recommendations from them in real life (like, if I had met them at a café). I know, this is super wishy-washy, but I’m sure you get the idea.

Turns out I had neglected to take into consideration that some people are just more generous and forgiving than others. Some people have a mean rating of 4.5 while others have a mean rating of 2.5. So we need to mean-center people’s ratings before we compute the similarity scores. As in: if your mean rating is 2.5 and you rated Snow Crash 4 stars then I’ll actually consider it 1.5 stars when computing our cosine similarity.

So, I redid everything but mean-centering the ratings.

did it work?

Well, we can’t answer that question just yet. First we need to answer another question: how do we know how good (or bad) the results are?

Turns out there is a simple metric for that. I can use the similarities I generated to “predict” my own ratings for books I have already rated, then compare the predicted ratings to my actual ratings.

Say I take Stephen and Geneva, which we’ve encountered before. Stephen rated Neal Stephenson’s Snow Crash five stars. Geneva rated it four stars. My similarity with Stephen is 0.93. My similarity with Geneva is 0.15. So if I wanted to use only Stephen and Geneva to produce a prediction, I could make it $\dfrac{5(0.93) + 4 (0.15)}{0.93 + 0.15} = 4.86$. In other words, I could use a weighted average of Stephen’s and Geneva’s ratings to “predict” what my own rating of Snow Crash would be (the weights being our respective similarities). My actual rating of Snow Crash is five stars, which is pretty close to 4.86.

I can do this operation for all books I ever rated. And then I can check how close my predicted ratings and my actual ratings are. How can we measure that? Hhmm, if only there was a way to compare the similarity of two vectors… Oh wait, there is: cosine similarity. So I use cosine similarity twice: first to compute user-user similarities, then to assess the quality of the predictions those similarities produce. What a handy tool.

Now, here I don’t want to use all Goodreads users I scraped. The whole point of this exercise is to find my tribe, i.e., the people whose literary tastes are most similar to mine. That’s the folks I want to get book recommendations from. So I have to define a threshold - say, my top 100 or my top 10000 most similar users. And then I predict the ratings for the books I have rated in common with these top k people - my top k “neighbors”, as textbooks refer to them.

There is a trade off in choosing k. The higher the k the more neighbors you have to get predictions from, so the more books you have predictions for. But the higher the k the more you get predictions from neighbors who are not so close to you. Here I tried a bunch of different values for k and it turns out that k=1000 seemed like a good compromise.

Now we’re ready to the answer the original question: how good are the results? Turns out the cosine similarity between my predicted ratings and my actual ratings, using my top 1000 neighbors, is 0.91 (after adding back the mean, which we removed before - remember?). Since cosines vary from 0 to 1, that seems pretty good. Looks like I’ve finally found my tribe.

so what?

Ok, so how do we generate book recommendations? Well, I just “predicted” what my ratings would be for books I have already rated. That’s how I assessed the quality of the similarities. (That’s how we assess any predictive model: we “predict” things that have already happened, then check how right or wrong those “predictions” are.) Now, to get actual book recommendations I simply predict what my ratings will be for books I have not rated and I pick the books which I’m predicted to rate highly. So this time I will predict without quotes: what I’m predicting hasn’t happened yet.

I went back to my top 1000 neighbors and scraped their ratings for every book they have ever rated (and that I never did). Then I predicted what my own rating would be for each one of those books.

That resulted in a ranking of 29,335 books. Here are the top 12:

• Jim Butcher’s Storm Front. This is the first of a series of books, several of which were in the top 100. I had never heard of it before. I’m not really into the fantasy genre, but I guess I’ll give it a try.

• Brandon Sanderson’s The Way of Kings. Another first book of a fantasy series I had never heard of.

• Carl Sagan’s Cosmos. I’ve seen the Neil deGrasse Tyson show and I loved it, so I’m surely going to read this.

• Brian Vaughan and Fiona Staples’ Saga. I’ve never read a graphic novel before and I never thought I would, but I guess there is a first time for everything.

• Terry Pratchett’s The Color of Magic. Yet more fantasy. This one I had seen around but the book description hadn’t appealed to me.

• Mario Puzo’s The Godfather. Ok, this probably should have been in my want-to-read shelf already.

• Yuval Harari’s Sapiens. At one point I had this book in my want-to-read shelf but then I read Harari’s other book, Homo Deus, and I didn’t really enjoy it, so I gave up on Sapiens. I guess I’ll put it back there now.

• Brandon Sanderson’s The Final Empire. Yet more fantasy I had never heard of.

• David Weber’s On Basilisk Station. Now, that’s what I’m talking about! Finally some sci-fi in my recommendations. This is probably the one I’ll read first.

• Patrick Rothfuss’ The Name of the Wind. More. Freaking. Fantasy. This one I had seen around before but the title is so uninspiring that I never even bothered to read the book’s description.

• Dostoyevsky’s The Brothers Karamazov. Not sure what to do about this one. I endured Tolstoy’s Anna Karenina and War and Peace and from what I’ve heard Dostoyevsky is similarly soporific. I may not have the fortitude to go through this one.

• J.J. Benítez’ Jerusalén. This one I read before - some 25 years ago, as a kid. (But I rated a different edition on Goodreads, so it ended up on my recommendations here.) This was the first book about time travel I ever read and I absolutely loved it. Now, my thirteen-year-old self was not a particularly discriminating reader. So I think I’ll pass. I don’t want to ruin the good memories I have of it.

And voilà, I have built and used my own book recommender system.

These results are less exciting than I expected. I don’t understand all the fantasy books. I’ve read a fantasy novel here and there but I’ve read all of Asimov’s Robot, Empire, and Foundation novels, most Neal Stephenson books, a lot of Robert Heinlein and John Scalzi - so why the heck is there only one sci-fi book in my top 12 recommendations?

I guess some of the oddities may be due to my including non-fiction books when computing the cosine similarities. The fact that you and I have both read Daron Acemoglu’s Economic origins of dictatorship and democracy and William Greene’s Econometric analysis and Luciano Ramalho’s Fluent Python may result in a high similarity score but that doesn’t mean we like the same stuff when it comes to fiction. Dropping non-fiction books might improve the results. That would require going through each of my books and flagging them as either fiction and non-fiction, then redoing everythig I’ve done here, so I’m not doing it. You and I will just have to live with that, reviewer #2.

But there is a lot more room for improvement here - it’s not just about dropping non-fiction books. Let’s talk about some possibilities.

what next?

If you’re building a recommender system on top of cosine similarity or some other such metric the way to go about it is to tweak the similarity formula in all sorts of ways - give more weight to this or that, etc -, check how good or bad the resulting predictions are in each case, then pick the formula that yields the best predictions. I did a little tweaking here (IUF, mean-centering), but if you’re serious about it you should try a lot more.

Besides, there are other ways to build recommender systems that don’t rely on hard-coded similarity formulas. For instance, we could use linear regression to learn the weight each neighbor should have. We’d still end up with a weighted average for the predictions, but the weights would have been learned from the data.

Alternatively we could use algorithms like random forest, support vector machine, or neural network. These algorithms have one big advantage over a linear regression: they learn weird relationships between the variables you have. It could be, for instance, that I’m unlikely to enjoy books that are recommended by both Stephen and Geneva but likely to enjoy books that are recommended only by Geneva and not by Stephen or vice-versa. Or that I’m super likely to enjoy a book recommended by Peter only if his rating is exactly five stars and all other ratings are exactly one star. Linear regression won’t capture any of that unless you know beforehand what these oddities are and model them explicitly. But random forest, SVM, and neural networks can learn those oddities from the data and incorporate them into their predictions. The drawback is that if you don’t have a lot of data these algorithms can easily mistake noise for signal and incorrectly pick up idiosyncrasies.

You could also use clustering algorithms, like k-means or DBSCAN. They work by grouping (clustering) samples (in our case, those would be Goodreads users) together according to their similarities.

Now, using linear regression, random forest, SVM, neural network, or clustering requires overcoming a problem: missing data. Here my top 1000 neighbors don’t have a single book that they all have rated in common. There are too many missing data points to run a regression or do anything of the kind. So before doing any of that you’ll need to either impute the missing data points or use matrix factorization to reduce the dimensionality of your dataset. The exception is k-means, which is relatively straightforward to adapt for use with massive missing data (see the Aggarwal book I mentioned before, chapter 12).

An alternatively that bypasses the missing data issue entirely is to model the data as a big graph, which is just fancy data science jargon for a network, like Facebook or Twitter: you have nodes (people and pages) and links between them (A follows B, A is friends with B, etc). Here we could model each Goodreads user as a node and each rating that two users have in common as a link between them. Then we could use a class of algorithms known as community detection algorithms to find what the “cliques” are.

Here I was just building a recommender system for myself, for the fun of it, so I’m not going to do any of that. But if you’re doing something serious with it then you probably should try at least some of these alternatives and see which one yields the best predictions.

Outside of a dog, a book is man’s best friend. Inside of a dog it’s too dark to read. (Groucho Marx)

# using Slack as middleware

I started working remote a few weeks ago. It’s sheer awesomeness - no distractions, no commute, no shabby cafeteria lunch. But there was one minor glitch: I was having trouble accessing my organization’s proprietary data. Being a government agency, we hoard tons of sensitive data on people - addresses, taxpayer IDs, personal income, etc. So, naturally, we restrict access to our databases. There is no URL I can go to when I need to run some database query; I need to be inside my organization’s network to run database queries.

I can use a VPN to log into my office PC from any machine. And once I’m logged into my office PC I can access whatever I want. But that means being forced to use my lame, old office PC instead of my sleak, fast MacBook Pro. Using the VPN also means enduring a maddening lag. It’s only a fraction of a second, but it’s noticeable and it can drive you insane. Finally, using the VPN means I have no local internet access - while I’m in the VPN my laptop has no other contact with the outside world, which results in my not being able to use Spotify and a bunch of other stuff. And I need my 90s Eurodance playlists to be in the proper mindset for writing code.

After enduring all that (I know, tiny violin…) for a couple of weeks I decided to do something about it. I realized that I needed a “bridge” between the data and my laptop. Some web service of sorts that could receive data requests and respond to them. Now, I’d never trust myself to build something like that. I don’t know nearly enough about information security to go about building that sort of tool. I don’t wanna be the guy who caused every Brazilian’s monthly income to be exposed to the world.

Then it hit me: I don’t need to build anything like that, these tools have already been built and we already use them to exchange sensitive data. I’m talking about messaging apps like Slack, Telegram, and the like. My office PC can access Slack. My personal laptop can access Slack. Slack is what my team uses for communication, which means sensitive information already circulates through it. In sum, the middleman I needed was already in place. All I had to do was to repurpose it. And that’s what I did. What follows bellow is a brief account of how I did it, in case other people may be in the same situation.

step 1: stuff that doesn’t involve code

The first thing you need to create are the Slack channels that will handle the data requests. I chose to create two - #incoming, to receive the data requests, and #outgoing, to send the requested data. I made them private, so as not to annoy my teammates with notifications and messages they don’t need to see. Alternatively, you could create an entirely new Slack workspace; that way you isolate human messaging from bot messaging.

Once you’ve created the channels you’ll need to create a Slack bot. It’s this bot that will: a) read the data requests that arrive on #incoming; and b) post the requested data to #outgoing. Slack lets you choose between two types of bots: “app bots” and “custom bots”. They nudge you towards the former but the latter is a lot more straightforward to set up: just click here, click “Add Configuration”, and follow the instructions. When you’re done, write down your bot’s API token - it’s the string that starts with xoxb- -, and, on your Slack workspace, invite your bot to join #incoming and #outgoing.

We need to make sure that your Slack bot can read from #incoming and post to #outgoing.

Let’s start with reading. There are a number of ways to go about this - Slack has a number of APIs. I think the Web API is the best pick for the impatient. Now, the documentation doesn’t have a quickstart or many useful examples. The explanations are verbose and frustratingly unhelpful if you just want to “get it done” quick. So instead of making you read the docs I’ll just give you what you need to know: make a GET request to https://slack.com/api/groups.history?token=xoxb-your-bot-token&channel=id_of_incoming, where xoxb-your-bot-token is the token you wrote down in step 1 and id_of_incoming is the ID of the #incoming channel (it’s the endpoint of the channel’s URL). That will return to you the channel’s messages (up to 100 messages). If there are no messages in #incoming you won’t get anything interesting back. If that’s the case, just post anything to the channel first.

In real life you won’t be using Terminal/cmd for this, you’ll be using a Python or R script or something along these lines. Here’s how to do that in Python:

What you get back is a Python dict which should have a key named ‘messages’. So, if 'messages' in 'response', then inside response['messages'] you’ll find a list of dicts, each dict being a message, each dict’s key being an attribute of said message (timestamp, text, user who posted it, etc).

Now, you don’t want to access #incoming’s entire history every time you poll it. You can include a parameter named oldest in the params dict and assign a timestamp to it. Then read_messages won’t return messages older than the specified timestamp.

(A little gotcha: what you pass as channel is not the channel’s name but the channel’s ID, which you can get from its URL. Some Slack methods do accept the channel’s name but I never remember which ones, so it’s easier to just use the channel’s ID for everything.)

(Because you went with a custom bot instead of an app bot you won’t have to deal with a bunch of error messages having to do with something Slack calls “scope”. You won’t waste two days in a mad loop of trying to get the scopes right, failing, cursing Slack, refusing to read the API documentation, failing, cursing Slack, refusing to read the API documentation. I envy you.)

Alright then, let’s move on to posting. Here’s how you do it: make a POST request to https://slack.com/api/chat.postMessage, using your bot’s token, your channel’s ID, and the message of the text as payload. Like this:

There. Once you run this code you should see the message “hey macarena” appear in #outgoing.

step 3: receiving #incoming messages

Ok, now you need a server-side program that will check #incoming for new messages - say, every five seconds or so. By server-side I mean it will run inside your company’s network; it needs to run from a machine that has access to your company’s databases. Here’s an example:

Now, you probably want this “listener” to run in the background, so that you can log off without killing it. If you’re running it on a Linux machine the simplest solution is to use tmux. It lets you create multiple “sessions” and run each session in the background. If you’re doing it on a Windows machine you can use cygwin or, if that’s Windows 10, you can use tmux with the native Ubuntu binaries.

step 4: processing #incoming messages

Receiving messages is not enough, your script needs to do something about them. The simple, quick-and-dirty solution is to have your #incoming messages be the very database queries you want to run. An #incoming message could be, say, SELECT [some_column] FROM [some].[table] WHERE [some_other_column] = 42. Then the listener (the server-side program we created before) would read the query and use an ODBC package - like pyodbc or rodbc - to run it. If that works for you, here’s how you’d amend the listener we created before to have it handle SQL queries:

Ok, I’m glossing over a bunch of details here. First you’ll need to set up an ODBC driver, which isn’t always easy to get right the first time - it depends on what SQL engine you have (SQL Server, MySQL, etc), on whether your script is running on Linux or Windows, and on what credentials you’re using to connect to your SQL engine. I can’t really help you out on this, you’ll have to google your way around. If you’ve never set up an ODBC connection before this is probably the part that’s going to take up most of your time.

Once the ODBC part is taken care of, leave the script above running and post some SQL query on #incoming. You should see the the result set of the query. Well done then, everything is working so far.

step 5: replying to #incoming messages

So you have a script that receives queries and executes them. Now your script needs to post the result sets to #outgoing. There really isn’t much mystery here - we already wrote post_to_outgoing above. The only thing left is to convert our result set into a string, so that Slack can accept it. In Python the json module handles that for us: json.dumps(your_data) takes a list or dict (or list of dicts, or dict of lists) and turns it into a string. It’s all below.

Ta-da. As long as this script is running continuously inside your company’s network you no longer need a VPN to query name_of_your_database. If you want more flexibility you can tweak run_query so that it takes the name of the database as a second argument. And you should sprinkle try/except statements here and there to capture database errors and the like.

You’ve taken remote work one step further. It’s not only you who can work remote now: the applications you develop no longer need to live inside your company’s network. You can develop on whatever machine and environment you choose and have your applications post their queries to #incoming and retrieve the result sets from #outgoing.

One gotcha here: Slack automatically breaks up long messages, so if your query exceeds Slack’s maximum length it will be truncated and run_query will probably (hopefully) raise an error. Keep it short.

step 6: make it neat

Alright, you have a functioning “bridge” between you and your company’s databases. But that’s still a crude tool, especially if you will develop applications on top of it. You don’t want your apps to post raw SQL queries on Slack - that’s a lot of unnecessary characters being passed back and forth. Instead of a run_query function you should have a get_data function that stores a “template” of the query and only adds to it, say, the part that comes after the WHERE [something] = . Something like this:

This is still too crude to be called an API but it’s a first step in that direction. The idea is to make the Slack-your_app interface as tight as possible, so as to minimize the types of errors you will encounter and to minimize the exchange of unnecessary strings. If you know exactly the sort of stuff that will be passed to get_data it’s easier to reason about what the code is doing.

I think this is it. Happy remoting!

# doing data science in the government

Today it’s been three years since I first started working as a data scientist in the Brazilian government. Overall it’s been a great experience and I think this is a good time to reflect upon what I’ve learned so far. I’ll start with the ugly and the bad and then I’ll move on to the good.

bureaucrats won’t share data unless physically compelled to do so

There is no shortage of decrees saying that government agencies will give their data to other government agencies when requested to do so (here’s the latest one). But between the pretty text of the law and what really goes on in the intestines of the bureaucracy there is a huge gap. Every government agency is in favor of data sharing - except when it comes to its own data.

Excuses abound. I can’t share my data because it’s too sensitive. I can’t share my data because extracting it would be too costly. I can’t share my data because we’re in the middle of a major IT restructuring and things are too messy right now. I can’t share my data because we’ve lost its documentation. I can’t share my data because the IT guy is on vacation. I can’t share my data because you might misinterpret it (this is my favorite).

The actual reasons are not always easy to pinpoint. Sometimes there are legitimate concerns about privacy, as in the case of fiscal data. But this is rare. More often than not the alleged privacy concerns are just a convenient excuse. In some cases the privacy excuse is used even when the data is already public. For instance, everything the government buys (other than, say, spy gear) goes in the official bulletin, which anyone can read. The equivalent of the IRS in Brazil - Receita Federal - has all the corresponding tax invoices neatly collected and organized in a database. That database would make it much easier to compute, say, the average price the Brazilian government pays for ballpoint pens. You’d think that database would be readily available not just for the entire government but for the citizenry as well.

You’d be wrong. Government agencies have been trying - and failing - to put their hands in that database for years. Our IRS says it’s protected by privacy laws. But the law says that government purchases must be public. And they already are, it’s all in the official bulletin - but that’s unstructured data that would require lots of scraping, OCRing, and parsing. The data is already public but not in a machine-readable format. That’s why the IRS database is so valuable. But the IRS legal folks haven’t found their match yet.

(Meanwhile data that is actually sensitive - like people’s addresses and tax returns - can be bought for $10 from shady vendors not too hard to find; all it takes is a walk along Rua 25 de Março, in São Paulo. In Brazil if the government has any data on you then you can be sure it’s for sale at Rua 25 de Março.) Sometimes bureaucrats don’t share data because they worry that the data will make them look bad. What if my data shows that I’m paying too much for office paper? What if my data shows that I have way more people than I need? What if my data shows that my agency is a complete waste of taxpayers’ money and should not exist? I suppose I understand the survival instinct behind such concerns, but those are precisely the cases where we absolutely must get the data, for that’s where the ugliest inefficiencies are. In general the more an agency refuses to share its data the more important it is to get it. (It doesn’t help that I work for the Office of the Comptroller-General, whose main goal is to find and punish irregularities involving taxpayers’ money. People sometimes assume that our asking for their data is the prelude of bad things. We need to explain that inside the Office of the Comptroller-General there is this little unit called the Observatory of Public Spending and that our goal at the OPS is often to help other agencies by extracting insights from their data.) government data is a mess The idea that you should document your databases is something that hasn’t taken root in the Brazilian government. Nine times out of ten all you get is a MSSQL dump with no accompanying data dictionary. You try to guess the contents based on the names of the tables and columns, but these are usually uninformative (like D_CNTR_IN_APOSTILAMENTO or SF_TB_SP_TAB_REM_GDM_PST - both real-life examples). So you end up guessing based on the contents themselves. As in: if it’s an 11-digit numeric field then that’s probably the Brazilian equivalent of the Social Security Number. As you might imagine, sometimes you guess wrong. You mistake quantity for total price and vice-versa and the resulting unit prices don’t make any sense. You think that a time field is in years but it’s actually in months. You mistake an ID field for a numeric field. Etc etc. You’re lucky when you catch the errors before whatever you’re doing becomes policy. And some fields you just leave unused because they are too mysterious and you can’t come up with any reasonable guesses. Like when it’s probably a dummy variable because the values are all 0s and 1s but you have no clue what the 0s and 1s mean. Besides the lack of documentation there are also many errors. Null names, misclassification, missing data, typos (according to one database the Brazilian government signed an IT contract of US$ 1 quadrillion back in 2014 - that’s 26 times the budget of the entire US government). To give you an idea, half the government purchases classified as “wheeled vehicles” are under R$1,000 (roughly US$ 300); when we inspect the product descriptions we see that they are not actually vehicles but spare parts, which have a different code and should have been classified elsewhere.

The problem begins in the data generation process, i.e., in the systems bureaucrats use to enter the data. These systems are too permissive; they lack basic validation like checking input type (numeric, text, date, etc), input length (does the state code have more than two characters?), and the like. And there is no punishment for the bureaucrat who enters incorrect data.

The most frequent result is absurd averages. You try to compute, say, spending per employee, and what you get back is a bazillion dollars or something close to zero. That means many hours of data cleaning before we can really get started. You have to do all the sanity checks that the government systems fail to do - or else it’s garbage in, garbage out. After you’ve filtered out the US$1 quadrillion IT contracts and the US$ 0 hospitals and schools you are left with lots of missing data - and that’s often not missing at random, which poses additional problems. It’s no wonder that most of our academic work is about cleaning up data (like here and here).

recruiting is a different ball game

The Brazilian government does not deliberately recruit data scientists. Data scientists come in through a bunch of miscellaneous doors and then we find each other - by word-of-mouth or Twitter or conferences - and come up with ways to work together. By now there are a few well-known “data science places” in the government - like the OPS (where I work) and the TCU - and data-curious government employees have been flocking to them, by various means; but there isn’t a clear policy to attract data talent.

In order to enter the Brazilian government you usually have to pass a public exam and such exams do not cover any content even remotely related to data. What you do need to learn to pass these exams is a large number of arcane pieces of legislation, mostly relating to government procedures - like the three phases of a procurement process, what they are called, what the deadlines are, what the many exceptions are, and so on. As you may imagine, that doesn’t usually attract people interested in data. A few of us slip through the cracks somehow, but that’s largely by accident.

There is also the issue of salary negotiations. The issue being that they are not possible. When you work for the government the law determines your salary - there is no room for negotiation. Sometimes you can offer a potential candidate a little extra money if they agree to take up some administrative responsibilities but this is usually under US$1000/month and most data scientists prefer to avoid having any responsibilities that involve paperwork. So whoever you are trying to lure must be really excited by the work you and your team do because the work itself is pretty much all you have to offer. But enough with the bad and the ugly. you have a lot of freedom to experiment Paradoxically, in the midst of this red tape jungle we have a lot of leeway to play around and try new things. Data science is a highly technical subject, one that takes at least a few Coursera courses to even begin to grasp, and that helps keep it unregulated. We have to fill out the same stupid forms everyone else does when we want to buy stuff, but whether we use Hadoop or not, whether we adopt Python or R for a project, whether we go with an SVM or a neural network or both, and whether we think any given project is worth pursuing is all entirely up to us. Legal doesn’t have a say in any of that. The budget folks don’t have a say in any of that. The minister himself - heck, the president himself - doesn’t have a say in any of that. They wouldn’t even know where to start. So, thanks to the highly specialized nature of our trade we don’t have higher-ups trying to micromanage what we do. There is also the tenure factor. You see, in Brazil once you enter the civil service you get automatically tenured after three years. And the constitution says that, once tenured, you need to do something really outrageous to get fired - and even then there are several appeal instances and often times nothing happens in the end. I bet that if I showed up naked for work tomorrow I still wouldn’t get fired; I might get a written warning or something along these lines, and I’d probably appear in the local news, but I would still have my job. It takes something outright criminal to get a government employee fired. Like, they need to catch you taking a bribe or not showing up for work for months. And even then sometimes people don’t get fired. Overall tenure is bad: too many lazy idiots spend their days browsing Facebook and entertaining themselves with hallway gossip. But for experimenting purposes tenure is great. It makes “move fast and break things” possible. Some bureaucrats want our assistance and happily give us their data and collaborate with us, helping us understand their problems and needs. But other bureaucrats get upset that you’re even daring to ask for their data. And they worry that you might disrupt the way they work or that you might automate them altogether. If we had to worry about our jobs at every step of the way we wouldn’t accomplish much. Without tenure heads might roll. you can help taxpayers; a lot The Brazilian government is humongous - it takes up ~40% of the country’s GDP. Most of that money goes down the toilet: the government overpays for everything it buys; it contracts suppliers that do not deliver the goods and services they promised; corruption is generalized. But data science can help. For instance, at the OPS we have trained a model that predicts whether a supplier is likely to become a headache (say, because it won’t deliver or because it will shut down). (Here’s a talk I gave about it earlier this year.) We’re now refining that model so that we can later appify it and plug it into the government’s procurement system. That way the government will be alerted of potential problems before it signs the contract. That project has taken a lot of time and effort - the first version of the model was the master’s research of my colleague Leonardo and since then he and other people have put a lot more work into it. That’s a lot of salary-hours. But if a single problematic contract of small size - say, US$ 100k or so - is prevented because of the model then all that effort will have been worth it. And given the size of the government’s budget - around US$1 trillion a year - we should be able to save a lot more money than US$ 100k. That’s money that could go back to taxpayers.