measuring statism

When it comes to economic freedom, Brazil ranks a shameful 144th - behind former Soviet republics like Ukraine (134th) and Uzbekistan (114th). Down here the recently announced PlayStation 5 is going to sell for twice the price it is sold in the US, because decades ago some academics and bureaucrats decided that heavy taxes on videogames would make industrialists manufacture “useful” stuff instead (cars and whatnot). We even try to regulate the work of people who watch parked cars for a living, if you can believe that (I kid you not).

But how does that interventionist appetite vary across economic activities?

People try to answer that question in all sorts of ways. They look at how much money each industry spends on lobbying. They look at how often lobbists travel to the country’s capital. They look at how well-funded each regulatory agency is. They count the number of words in each industry’s regulations (RegData is a cool example; Letícia Valle is doing something similar for Brazil).

Here I’ll try something else: I’ll look at the companies that get mentioned in the Diário Oficial da União, which is the official gazette where all the acts of the Brazilian government are published. I’ll look up what each company does (mining? banking? retail? etc) so that I can aggregate company mentions by economic sector. That will let me know how much government attention each sector receives.

(Yes, there are lots of caveats. We’ll get to them in due time - chill out, reviewer #2.)

mining the Diário Oficial da União

Each company in Brazil has a unique identifier. It’s like a Social Security Number, but for organizations. It’s a 14-digit number; we call it CNPJ (Cadastro Nacional de Pessoas Jurídicas). Here is an example: 20.631.209/0001-60. When a company is mentioned in the Diário Oficial da União, that company’s CNPJ is there.

I had already downloaded the Diário Oficial da União, for something else. Well, not all of it: the Diário was launched in 1862, but the online version only goes back to 2002. So here we have about 18 years of government publications.

To get all CNPJs mentioned in the Diário I used this regular expression:

'([0-9]{2}[\.]?[0-9]{3}[\.]?[0-9]{3}[\/]?[0-9]{4}[-]?[0-9]{2})'

I didn’t mine the whole Diário. The Diário is divided into three sections: section 1, which has laws and regulations; section 2, which has personnel-related publications (promotions, appointments, etc); and section 3, which has procurement-related publications (invitations for bids, contracts, etc). I only mined section 1, as the other sections would merely add noise to our analysis. Just because your company won a contract to supply toilet paper to some government agency doesn’t mean your industry is regulated.

That regular expression resulted in 1.5 million matches (after dropping matches that were not valid CNPJs). In other words, section 1 of the Diário contains 1.5 million CNPJ mentions between 2002 and mid-2020 (I scraped the Diário back in July and I was too lazy to scrape the rest of it now).

The result was a table that looks like this:

date cnpj
2010-09-28 39302369000194
2010-09-28 39405063000163
2010-09-28 60960994000110
2010-09-29 31376361000160
2010-09-29 76507706000106
2010-09-29 08388911000140

That’s it for the Diário part. Now on to economic sectors.

from CNPJs to CNAEs

The Brazilian government has a big list of economic activities, and assigns to each of them a 5-digit numeric code. Hence 35140 is “distribution of electric energy”, 64212 is “commercial banks”, and so on. We call that taxonomy the CNAE (Classificação Nacional de Atividades Econômicas). There are some 700 CNAE codes in total.

(The CNAE is similar to the International Standard Industrial Classification of All Economic Activities - ISIC. You can find CNAE-ISIC correspondence tables here).

When you start a company in Brazil you’re required to inform the government what type of economic activity your company will be doing, and you do that by informing the appropriate CNAE code. That means each CNPJ is associated with a CNAE code. Fortunately, that data is publicly available.

I parsed that data to create a big CNPJ->CNAE table. If you want to do that yourself you need to download all the 20 zip files, unzip each of them, then run something like this:

import os
import pandas as pd

# path to the unzipped data files
path_to_files = '/Users/thiagomarzagao/Desktop/data/cnaes/'

# position of the CNPJ and CNAE fields
colspecs = [
    (4, 17), # CNPJ
    (376, 382), # CNAE
    ]

# fix indexing
colspecs = [(t[0] - 1, t[1]) for t in colspecs]

# set field names and types
names = {
    'CNPJ': str,
    'CNAE': str,
}

# load and merge datasets
df = pd.DataFrame([])
for fname in os.listdir(path_to_files):
    if ('.csv' in fname) or ('.zip' in fname):
        continue
    df_new = pd.read_fwf(
        path_to_files + fname, 
        skiprows = 1, 
        skipfooter = 1, 
        header = None, 
        colspecs = colspecs, 
        names = list(names.keys()), 
        dtype = names
        )
    df = df.append(df_new)

# drop CNPJS that don't have a valid CNAE
df = df.dropna()
df = df[df['CNAE'] != '0000000']

# drop duplicates
df = df.drop_duplicates()

# save to csv
df.to_csv(path_to_files + 'cnpjs_to_cnaes.csv', index = False)

I joined the table produced by that script with the table I had created before (with dates and CNPJs - see above). The result was something like this:

date cnpj cnae
2010-09-28 39302369000194 85996
2010-09-28 39405063000163 46192
2010-09-28 60960994000110 58115
2010-09-29 31376361000160 80111
2010-09-29 76507706000106 32302
2010-09-29 08388911000140 80111

That’s all we need to finally learn which sectors get the most government attention!

But first a word from my inner reviewer #2.

caveats

Perhaps pet shops appear in the Diário Oficial da União a lot simply because there a lot of pet shops - and not because pet shops are heavily regulated. Also, the Diário Oficial da União only covers the federal government - which means that I am ignoring all state and municipal regulations/interventions. (Some nice folks are trying to standardize non-federal publications; they could use your help if you can spare the time.) Finally, each CNPJ can be associated with multiple CNAE codes; one of them has to be picked as the “primary” one, and that’s the one I’m using here, but it’s possible that using each CNPJ’s secondary CNAE codes might change the results.

This whole idea could be completely bonkers - please let me know what you think.

statism across economic sectors

Here are the 30 economic sectors whose companies most often show up in the Diário. Mouse over each bar to see the corresponding count.

(The description of some sectors is shortented/simplified to fit the chart. Sometimes the full description includes lots of footnotes and exceptions - “retail sale of X except of the A, B, and C types”, that sort of thing.)

Wow. Lots to unpack here.

drugs

The top result - “retail sale of pharmaceutical goods” - is a big surprise to me. I mean, yes, I know that selling drugs to people is a heavily regulated activity. But I thought it was the job of states or municipalities to authorize/inspect each individual drugstore. I thought the federal government only laid out general guidelines.

Well, I was wrong. Turns out if you want to open a drugstore in Brazil, you need the permission of the federal government. And that permission, if granted, is published in the Diário. Also, it falls to the federal government to punish your little pharmacy when you do something wrong - irregular advertising, improper storage of medicine, and a myriad of other offenses. Those punishments also go in the Diário.

Now, we need to keep in mind that there are a lot more drugstores than, say, nuclear power plants. I’m sure that nuclear plants are under super intense, minute regulation, but because they are rare they don’t show up in the Diário very often. So we can’t conclude that selling drugs to people is the most heavily regulated sector of the Brazilian economy.

We could normalize the counts by the number of CNPJs in each economic sector. But I think the raw counts tell an interesting story. The raw counts give us a sense of how “busy” the state gets with each economic sector. I’m sure that nuclear plants are more heavily regulated than drugstores, but I bet that a lot more bureaucrat-hours go into regulating drugstores than into regulating nuclear plants.

NGOs (sort of)

The second most frequent category - “social rights organizations” - corresponds to what most people call non-governmental organizations. Except that this is Brazil, where NGOs are not really NGOs: they receive a lot of funding from the state and they perform all kinds of activities on behalf of the state. Which explains why CNPJs belonging to NGOs (well, “NGOs”) have appeared over 90 thousand times in the Diário since 2002.

I looked into some of the cases. There are NGOs receiving state funding to provide healthcare in remote areas; to provide computer classes to kids in poor neighborhoods; to fight for the interests of disabled people; just about anything you can think of. Brazil has turned NGOs into government agencies. Our non-governmental organizations are independent from the Brazilian government in the same way that the Hitlerjugend was independent from the Reich.

Needless to say, NGOs are often at the center of corruption scandals. People come up with a pretty name and a CNPJ, apply for funding to do some social work, and then just pocket the money.

arts

As if government-funded NGOs weren’t embarassing enough, a good chunk of the performing arts in Brazil is also government-funded. Hence the third and fifth most frequent categories here: “performing arts, concerts, dancing” and “movies, videos, and TV shows”.

You don’t have to be a good producer. As long as you are in the good graces of the government, you can apply for funding from the Ministry of Culture and use it to do your play/movie/concert. That’s why CNPJs belonging to those two categories have appeared in the Diário over 90 thousand times since 2002. I did the math and that’s about fifteen CNPJs a day receiving funding to have people dancing or acting or whatnot. Mind you, that’s just the federal funding.

And that’s just the official funding. Sometimes taxpayers end up funding “cultural” productions through less-than-transparent means. For instance, back in 2009 there was a biopic about Lula da Silva, who happened to be the president at the time. Well, turns out that 12 of the 17 companies that invested in the production of the movie had received hundreds of millions of dollars in government contracts. Neat, right?

Every now and then a good movie or play comes out of it. If you haven’t seen Tropa de Elite yet you should stop whatever you’re doing and go watch it (it’s on Netflix). But nearly all productions are flops. For every Tropa de Elite there are thirty Lula, Filho do Brasil.

If you want to have a taste of how bad most productions are, here is a teaser of Xuxa e os Duendes, which pocketed a few million bucks in taxpayers money. Trust me, you don’t need to understand Portuguese to assess the merits of the thing:

Meanwhile, we have over 40 thousand homicides a year, only a tiny fraction of which get solved. But what do I know, maybe Xuxa e os duendes is the sort of thing Albert Hirschman had in mind when he talked about backward and forward linkages.

I leave the analysis of the other categories as an exercise for the reader. If you want to see the full results, it’s here.

to do

It would be interesting to see how these counts by state or municipality; how these counts correlate with other measures of statism; how they change over the years; and so on.

That’s it. Remember, folks: these are the people in charge of public policy.

word embeddings for bureaucratese

You can find pre-trained word embeddings for hundreds of different languages - FastText alone has pre-trained embeddings for 157 languages (see here). But a single language can come in multiple “flavors”. I’m not talking about dialects, but about the different vocabulary and writing styles you find in news articles vs social media vs academic papers, etc. Most word embeddings come from a limited number of sources, with Wikipedia and news articles being the most common ones. If you have, say, social media texts, using word embeddings trained on Wikipedia entries may not yield good results.

So I decided to train my own Brazilian Portuguese word embeddings on the source that interests me the most: government publications. Decrees, invitations for bids, contracts, appointments, all that mind-numbingly boring stuff that makes up the day-to-day life of the public sector. Those embeddings might help me in future text-related tasks, like classifying government decrees and identifying certain types of contracts. I imagine it could be useful for other folks working with Brazilian government publications, so here’s how I did that.

I started by scraping the official bulleting where all the acts of the Brazilian government are published: the Diário Oficial da União. To give you an idea of how much text that is, the Diário’s 2020-07-06 issue has a total of 344 pages - with a tiny font and single spaces. (The Brazilian state is humongous and the size of the Diário reflects that.) The Diário is available online going as far back as 2002-01-01 and I scraped all of it. That amounted to about 8GB of zip files. Here is how to scrape it yourself (I used Python for everything):

import os
import requests
from bs4 import BeautifulSoup

months = [
    # month names in Portuguese
    'janeiro',
    'fevereiro',
    'marco',
    'abril',
    'maio',
    'junho',
    'julho',
    'agosto',
    'setembro',
    'outubro',
    'novembro',
    'dezembro'
]

# path to the folder that will store the zip files
basepath = '/path/to/diario/' # create the folder first

# loop through years and months
for year in range(2002, 2021): # change end year if you're in the future
    for month in months:
        print(year, month)
        url = 'http://www.in.gov.br/dados-abertos/base-de-dados/publicacoes-do-dou/{}/{}'.format(str(year), month)
        r = requests.get(url)
        soup = BeautifulSoup(r.content)
        tags = soup.find_all('a', class_ = 'link-arquivo')
        urls = ['http://www.in.gov.br' + e['href'] for e in tags]
        fnames = [e.text for e in tags]
        for url, fname in zip(urls, fnames):
            if not os.path.isfile(basepath + fname):
                try:
                    r = requests.get(url)
                    with open(basepath + fname, mode = 'wb') as f:
                        f.write(r.content)
                except:
                    continue

After the scraping is done you can unzip each of those 400+ files manually or you can automate the job:

import os
import zipfile

ipath = '/path/to/diario/'
opath = '/path/to/diario_xml/' # create folder first
for fname in os.listdir(ipath):
    if '.zip' in fname:
        year = fname[5:9]
        month = fname[3:5]
        section = fname[2:3]
        print(year, month, section, fname)
        destination = opath + year + '/' + month + '/' + section + '/'
        os.makedirs(destination)
        try:
            with zipfile.ZipFile(ipath + fname) as zip_ref:
                    zip_ref.extractall(destination)
        except:
            print('error; moving on') # some zip files are corrupted

This won’t give you all the text in the Diário Oficial da União since 2002-01-01. Some zip files are corrupted and most issues are incomplete. For 2016, for instance, only the May issues are available. And for all years except 2019 and 2020 one of the sections (section 3) is missing entirely (the Diário is divided in three sections - 1, 2, and 3). Also, after you unzip the files you find out that in many cases the text is not in XML but in JPEG format. I wasn’t in the mood to do OCR so I just ignored the JPEG files.

If you want to get in touch with the Diário’s publisher to discuss those problems be my guest. Here I don’t care much about those problems because all I need to train my word embeddings is a ton of data, not all of the data. And with the XML files that I got I have over 4 million government acts, which is probably way more than I need here.

After unzipping everything I trained my word embeddings. I chose to go with gensim’s implementation of word2vec. The beauty of gensim’s implementation is that you can stream the texts one by one straight from disk, without having to keep everything in memory. Now, that’s a little tricky to accomplish. Gensim’s documentation says that instead of a list of documents you can use a generator, but I found that not to be the case. I got this error: TypeError: You can't pass a generator as the sentences argument. Try an iterator. But I googled around and found a nifty workaround that tricks gensim into using a generator by wrapping it inside an iterator. So here I have a generator (yield_docs) that yields one document at a time and then I wrap it inside an iterator (SentencesIterator) so that gensim won’t complain.

About the documents, I have some 4.2 million XML files in total. In theory all these XML files should be easily parsable - they have tags with metadata, main content, etc. But in reality many are invalid. They have unclosed quotation marks and other problems that trip BeautifulSoup’s parser. So I ignored all the metadata and just focused on the stuff inside the <Texto> (text) tags, which is always a collection of <p> tags. Now, different paragraphs of the same publication can talk about entirely different issues, so instead of treating each publication (i.e., each XML file) as a document I’m treating each <p> content as a document. That should yield more coherent word associations. So while I have 4.2 million XML files, in the end I have 72 million documents, one corresponding to each <p> tag. That’s… a lot of text.

Back to word2vec. I don’t really know the ideal number of dimensions here. I found a nice paper that provides a way to estimate the ideal number of dimensions for any dimensionality reduction algorithm. But it’s too computationally expensive: you need to create a graph where each unique token is a node and the co-occurrences are represented by edges. I tried it but the thing got impossibly slow at around 200k nodes - and I have over 1M unique tokens. By my estimates it would take about half a year for the nodes to be created, and then I would need to find the graph’s maximum clique, which is also computationally expensive. So… no. If I had a specific text classification task in mind I would just try different numbers of dimensions and see what works best, but that’s not what I’m doing right now. So instead of relying on any theoretical or empirical approaches I just went with 300 dimensions because that’s a nice round number and I’ve seen it used in other word embeddings.

I’m discarding all words that appear in fewer than 1000 paragraphs (probably too rare to matter) and I’m using a short window of 5 (maximum distance between current and predicted word in a sentence).

Here’s the code:

import os
from bs4 import BeautifulSoup
from string import punctuation
from gensim.models import Word2Vec

def tokenize(raw_text):
    '''
    'Hey, dogs are awesome!' -> ['hey', 'dogs', 'are', 'awesome']

    using `re` would probably make it run faster but I got lazy
    '''

    # lowercase everything
    text = raw_text.lower()

    # remove punctuation
    for p in punctuation:
        text = text.replace(p, ' ')

    # remove extra whitespaces
    while '  ' in text:
        text = text.replace('  ', ' ')

    # tokenize
    tokens = text.strip().split()

    # remove digits
    tokens = [e for e in tokens if not e.isdigit()]

    return tokens

def yield_docs():
    '''
    crawl XML files, split each one in paragraphs
    and yield one tokenized paragraph at a time
    '''
    n = 0
    path = '/path/to/diario_xml/'
    for root, dirpath, fnames in os.walk(path):
        if not dirpath:
            for fname in fnames:
                if '.xml' in fname:
                    filepath = root + '/' + fname
                    with open(filepath, mode = 'r') as f:
                        content = f.read()
                    soup = BeautifulSoup(content, features = 'lxml')
                    paragraphs = soup.find_all('p')
                    for p in paragraphs:
                        print(n)
                        n += 1
                        tokens = tokenize(p.text)
                        if len(tokens) > 1:
                            yield tokens

class SentencesIterator():
    '''
    this tricks gensim into using a generator,
    so that I can stream the documents from disk
    and not run out of memory; I stole this code
    from here: 

    https://jacopofarina.eu/posts/gensim-generator-is-not-iterator/
    '''
    def __init__(self, generator_function):
        self.generator_function = generator_function
        self.generator = self.generator_function()

    def __iter__(self):
        # reset the generator
        self.generator = self.generator_function()
        return self

    def __next__(self):
        result = next(self.generator)
        if result is None:
            raise StopIteration
        else:
            return result

# train word2vec
model = Word2Vec(
    SentencesIterator(yield_docs), 
    size = 300, 
    window = 10, 
    min_count = 1000, 
    workers = 6
    )

# save to disk
model.save('word2vec.model')

And voilà, we have our word embeddings. We have a total of 27198 unique tokens (remember, we ignored any tokens that appeared in fewer than 1000 paragraphs) and 300 dimensions, so our word embeddings are a 27198x300 matrix. If you’re not familiar with word2vec Andrew Ng explains it here. The TL;DR is that word2vec’s output is a matrix where each unique token is represented as a vector - in our case, a 300-dimensional vector. That allows us to do a bunch of interesting stuff with that vocabulary - for instance, we can compute the cosine similarity between any two words to see how related they are. In gensim there is a neat method for that. For instance, suppose we want to find the words most related to “fraude” (fraud):

model.wv.most_similar(positive = ['fraude'])
>>> [('fraudes', 0.5694327354431152),
>>> ('conluio', 0.5639076232910156),
>>> ('superfaturamento', 0.5263874530792236),
>>> ('irregularidade', 0.4860353469848633),
>>> ('dolo', 0.47721606492996216),
>>> ('falsidade', 0.47426754236221313),
>>> ('suspeita', 0.47147220373153687),
>>> ('favorecimento', 0.4686395227909088),
>>> ('ilícito', 0.4681907892227173),
>>> ('falha', 0.4664713442325592)]

We can see that bid rigging (“conluio”) and overpricing (“superfaturamento”) are the two most fraud-related words in government publications (“fraudes” is just the plural form of “fraude”). Kinda cool to see it. You can also cluster the word embeddings to find groups of inter-related words; use t-SNE to reduce dimensionality to 2 so you can plot the embeddings on an XY plot; and try a bunch of other fun ideas.

Here I trained the word embeddings from scratch but you could also take pre-trained Brazilian Portuguese embeddings and use the Diário to fine-tune them. You could also tweak the parameters, changing the window (10 here) and the number of dimensions (300). Whatever works best for the task you have at hand.

That’s all for today! And remember, bureaucratese is bad writing - don’t spend too long reading those texts, lest you start emulating them. The best antidote to bureaucratese (or to any bad writing) that I know is William Zinsser.

how much is your apartment worth? (part 2)

So, Brazilian banks are using predictive models to do property valuation but they are doing it wrong. It’s time for us data folks to step in and cause some disruption.

our hypothetical property

Let’s assume that we want to find the market value of an apartment in Brasília. It’s 120m2, it’s in the Noroeste sector (Brasília is a planned city, so it has sectors instead of neighborhoods), it has three bedrooms, three bathrooms, and two parking spots. The condo fee is R$ 580/month and the property tax is R$ 745/month.

getting data

It’s 2019, so there’s no need to go knocking on doors asking people how much they paid for their apartments, how many bedrooms they have, etc. We can simply scrape online listing sites like ZAP, wimoveis, or Viva Real. Any of the three would do (and scraping all three of them would be best). Here I’ll scrape wimoveis for the simple reason that it is the easiest one to scrape.

Granted, that just gives us asking prices, not transaction prices. But the alternative is to go knocking on doors asking people about their apartments. (Unless of course you are a bank and you have tons of mortgage data.)

Here’s the (Python) code I wrote to scrape the result pages:

import time
import requests

destination = '/Volumes/UNTITLED/wimoveis/paginas/'
base_url = 'https://www.wimoveis.com.br/'
num_pages = 1557 # number of results pages
for i in range(1, num_pages):
    print('page', i)
    query_url = base_url + 'apartamentos-venda-distrito-federal-goias-pagina-{}.html'.format(i)
    response = requests.get(query_url)
    if response.status_code == 200:
        # save source code of the page 
        with open(destination + 'pagina_{}.html'.format(i), mode = 'w') as f:
            f.write(response.text)
            time.sleep(2)

Every page of results contains up to 20 listings. But it only has summary information for each listing. The full data is in each listing’s own URL. So we need to get the URL of every listing from every page. I use BeautifulSoup for that:

import os
from bs4 import BeautifulSoup

hrefs = []
path = '/Volumes/UNTITLED/wimoveis/paginas/'
for fname in os.listdir(path):
    print(fname)
    if ('.html' in fname) and ('._' not in fname):
        with open(path + fname, mode = 'r') as f:
            html = f.read()
            soup = BeautifulSoup(html)
            h4 = soup.find_all('h4', class_ = 'aviso-data-title')
            href = [e.find('a')['href'] for e in h4]
            hrefs += href

print(len(hrefs))
df = pd.DataFrame(hrefs)
df.to_csv('hrefs.csv', index = False)

Now we’re finally ready to scrape the listings themselves, with all the data we need (price, m2, pictures, etc).

import os
import re
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup

basepath = '/Volumes/UNTITLED/wimoveis/anuncios/'
hrefs = pd.read_csv('hrefs.csv') # get URLs
hrefs = set(hrefs['href']) # remove duplicate URLs
for i, href in enumerate(hrefs):

    # get ID of the listing
    id_anuncio = re.findall(r'[0-9]{1,20}\.html', href)[0].replace('.html', '')

    # if listing has been downloaded before, ignore
    path = basepath + id_anuncio + '/'
    if os.path.exists(path):
        continue

    # get the source code of the listing;
    # doesn't always work on the first try, so
    # wait for 60s and try again if necessary;
    # looks like this risks infinite loops, but
    # somehow that didn't happen
    url = 'https://www.wimoveis.com.br' + href    
    while True:
        try:
            response = requests.get(url)
            break
        except:
            print('error; waiting')
            time.sleep(60)

    # if it worked, move on
    if response.status_code == 200:
        print(i, path)
        os.mkdir(path) # create destination directory
        html = response.text # get source code

        # save source code to file
        with open(path + 'anuncio_' + str(i) + '.html', mode = 'w') as f:
            f.write(html)

        # now the time-consuming part: getting the
        # pictures of the listing
        pic_path = path + 'pics/'
        os.mkdir(pic_path) # create destination directory

        # find URLs of the pictures
        soup = BeautifulSoup(html)
        figures = soup.find_all('figure', class_ = 'slide-content')
        links = [e.find('img')['data-flickity-lazyload'] for e in figures]

        # try downloading each picture
        for n, link in enumerate(links):
            while True:
                try:
                    response = requests.get(link, stream = True)
                    break
                except:
                    print('conn error; waiting')
                    time.sleep(60)

            # if it worked, save picture to file
            if response.status_code == 200:
                with open(pic_path + str(n) + '.jpg', mode = 'wb') as f:
                    for chunk in response:
                        f.write(chunk)

This will take a couple of days to run, because of all the pictures you’re downloading.

In the end we’ll have over 15k samples. That’s up from the current 25-250 samples that real estate appraisers are using.

parsing data

Ok, what we have now is a huge mess of HTML and JPG files. The data we need is all buried in those files. We need to extract it.

For now I’ll ignore the JPG files and only use the HTML files.

import os
import pandas as pd
from bs4 import BeautifulSoup

dados = []
basepath = '/Volumes/UNTITLED/wimoveis/anuncios/'
for i, anuncio_id in enumerate(os.listdir(basepath)):

    # Dropbox creates an annoying hidden folder,
    # let's ignore it
    if '.' in anuncio_id: 
        continue

    anuncio_path = basepath + anuncio_id + '/'
    for fname in os.listdir(anuncio_path):
        if '.html' in fname:
            print(i, fname)

            # empty dict to store the listing data
            dados_anuncio = {}

            # read source code of the listing
            with open(anuncio_path + fname, mode = 'r') as f:
                source = f.read()

            # soupify source code
            soup = BeautifulSoup(source, 'lxml')

            # get ID of the listing
            dados_anuncio['anuncio_id'] = anuncio_id

            # get title of the listing
            title = soup.find('h2', class_ = 'title-type-sup')
            if title:
                title = title.text.strip()
            try:
                title2 = soup.find_all('div', class_ = 'section-title')[1]
            except:
                continue
            if title2:
                title2 = title2.text.strip()
            dados_anuncio['titulo'] = title
            dados_anuncio['titulo_compl'] = title2

            # get location of the property
            local = soup.find('h2', class_ = 'title-location')
            if local:
                local = local.text.strip()
            local2 = soup.find('div', class_ = 'section-location')
            if local2:
                local2 = local2.text.strip()
            dados_anuncio['local'] = local
            dados_anuncio['local_compl'] = local2

            # get asking price (and rent price, if available)
            price_block = soup.find('div', class_ = 'block-price-container')
            try:
                price_list = price_block.find_all('div', class_ = 'block-price block-row')
            except:
                continue
            for e in price_list:
                operation = e.find('div', class_ = 'price-operation').text.strip()
                price = e.find('div', class_ = 'price-items').text.strip()
                dados_anuncio[operation] = price

            # get condo fee and property tax
            expense_list = price_block.find_all('div', class_ = 'block-expensas block-row')
            for e in expense_list:
                full_text = e.text.strip()
                idx = full_text.index('R$')
                expense = full_text[:idx]
                price = full_text[idx:]
                dados_anuncio[expense] = price

            # get ID of the seller
            anunciante = soup.find('h3', class_ = 'publisher-subtitle').text.strip()
            dados_anuncio['anunciante'] = anunciante

            # get text description of the property
            descricao = soup.find('div', {'id': 'verDatosDescripcion'}).text.strip()
            dados_anuncio['descricao'] = descricao

            # get structured features of the property
            # (those that have accompanying icons)
            features_block = soup.find('ul', class_ = 'section-icon-features')
            lis = features_block.find_all('li', class_ = 'icon-feature')
            for e in lis:
                label = e.find('span').text.strip()
                value = e.find('b').text.strip()
                dados_anuncio[label] = value

            # get other features of the property
            # (those that do not have accompanying icons)
            areas = soup.find_all('ul', class_ = 'section-bullets')
            for area in areas:
                lis = area.find_all('li')
                for e in lis:
                    if e.find('b'):
                        area_feature_label = e.find('h4').text.strip()
                        area_feature_value = e.find('b').text.strip()
                        dados_anuncio[area_feature_label] = area_feature_value
                    else:
                        area_feature = e.string
                        dados_anuncio[area_feature] = '1'

            # clean sobre garbagem from the data
            for key in dados_anuncio.keys():
                v = dados_anuncio[key]
                if v:
                    v = v.replace('\n', ' ')
                    v = v.replace('\r', ' ')
                    v = v.replace('\t', ' ')
                    while '  ' in v:
                        v = v.replace('  ', ' ')
                    dados_anuncio[key] = v.strip()

            # append row of data
            dados.append(dados_anuncio)

# save everything as a CSV file
df = pd.DataFrame.from_dict(dados)
df.to_csv('listings_data.csv', index = False)

Hooray, now we’ve put all the (non-image) data in a CSV file with proper column names and everything.

throwing data away

Now that we have all that data it’s time to throw some of it away.

You see, people are lazy. When they list their properties on wimoveis they don’t bother to tick all the boxes - “pool”, “playground”, “A/C”, etc. Whatever they consider relevant they’ll write down in the text field (often with lots of adjectives and exclamation marks). The result is that our CSV file is mostly empty: most of its cells are missing data. This varies according to the feature we’re talking about. But the vast majority of the features have simply too many missing data points to be useful. So let’s clean up a bit.

import numpy as np
import pandas as pd

# column names to load (and their new names)
colnames = {
    'anuncio_id': 'anuncio_id',
    'Venda': 'preco_total',
    'titulo': 'titulo',
    'titulo_compl': 'titulo_compl',
    'local': 'local',
    'local_compl': 'local_compl',
    'Área útil': 'area_util',
    'Vagas': 'vagas',
    'descricao': 'descricao',
    'Andares': 'andares',
    'Banheiros': 'banheiros',
    'Brinquedoteca': 'brinquedoteca',
    'Churrasqueira': 'churrasqueira',
    'Condomínio ': 'condominio',
    'Elevador': 'elevador',
    'IPTU ': 'iptu',
    'Piscina': 'piscina',
    'Circuito de TV': 'tv',
    'Fitness/Sala de Ginástica': 'academia',
    'Idade do imóvel': 'idade',
    'Playground': 'playground',
    'Portaria 24 horas': 'portaria',
    'Quartos': 'quartos',
    'Salão de Jogos': 'jogos',
    'Salão de festas': 'festas',
    'Sauna': 'sauna',
    'Suítes': 'suites',
    }

# load data
df = pd.read_csv('listings_data.csv', usecols = colnames)

# rename columns
old_names = list(df.columns)
new_names = [colnames[k] for k in old_names]
df.columns = new_names

# merge location columns
df['local'] = df['local'].fillna('')
df['local_compl'] = df['local_compl'].fillna('')
df['local'] = df['local'] + ' ' + df['local_compl']
del df['local_compl']

# clean up location
def fix_local(v):
    v_new = v.replace('\n', ' ')
    v_new = v_new.replace('\r', ' ')
    v_new = v_new.replace('\t', ' ')
    while '  ' in v_new:
        v_new = v_new.replace('  ', ' ')
    v_new = v_new.strip()
    return v_new

df['local'] = df['local'].map(lambda x: fix_local(x))

# drop sample if no location
df = df[df['local'] != '']

# drop sample if no price or m2
df = df.dropna(axis = 0, how = 'any', subset = ['preco_total', 'area_util'])

# merge title columns
df['titulo'] = df['titulo'].fillna('')
df['titulo'] = df['titulo_compl'].fillna('')
df['titulo'] = df['titulo'] + ' ' + df['titulo_compl']
del df['titulo_compl']

# transform some float variables into int
for var in ['vagas', 'andares', 'banheiros', 'quartos', 'suites']:
    df[var] = df[var].fillna(0)
    df[var] = df[var].map(lambda x: int(x))

df['idade'] = df['idade'].map(lambda x: int(x) if str(x).isdigit() else 0)

# convert money columns from str to float
def fix_money_value(v):
    try:
        v_new = v.replace('R$', '')
        v_new = v_new.replace('.', '')
        v_new = v_new.replace(' ', '')
        v_new = float(v_new)
        return v_new
    except:
        return None

for var in ['preco_total', 'iptu', 'condominio']:
    df[var] = df[var].map(lambda x: fix_money_value(x))

# convert m2 from string to float
def fix_area_value(v):
    v_new = v.replace('m2', '').replace('m²', '')
    v_new = v_new.replace('.', '')
    v_new = float(v_new)
    return v_new

df['area_util'] = df['area_util'].map(lambda x: fix_area_value(x))

# drop absurd values
df = df[df['preco_total'] > 1000]
df = df[df['area_util'] > 1]

# recode location
satelites = [
    'Gama',
    'Taguatinga',
    'Brazlândia',
    'Sobradinho',
    'Planaltina',
    'Paranoá',
    'Núcleo Bandeirante',
    'Ceilândia',
    'Guará',
    'Cruzeiro',
    'Samambaia',
    'Santa Maria',
    'São Sebastião',
    'Recanto das Emas',
    'Lago Sul',
    'Riacho Fundo',
    'Lago Norte',
    'Candangolândia',
    'Águas Claras',
    'Sudoeste',
    'Octogonal',
    'Varjão',
    'Park Way',
    'SCIA',
    'Jardim Botânico',
    'Itapoã',
    'SIA',
    'Vicente Pires',
    'Fercal'
    ]

def get_local(v):
    splitted = v.split(',')
    setor = splitted[-2].strip()
    cidade = splitted[-1].strip()
    if cidade == 'Brasília':
        return setor
    elif cidade in satelites:
        return cidade
    else:
        return 'Goiás'

df['endereco'] = df['local'].map(lambda x: get_local(x))

# handle some features where missing actually means "doesnt have it"
for var in ['churrasqueira', 'brinquedoteca', 'tv', 'piscina', 'playground', 'sauna', 'academia', 'portaria', 'jogos', 'festas']:
    df[var] = df[var].fillna(0)

# reorder and rename columns and save to CSV
df = df[['anuncio_id', 'preco_total', 'area_util', 'endereco', 'vagas', 'banheiros', 'quartos', 'churrasqueira', 'idade', 'brinquedoteca', 'tv', 'piscina', 'playground', 'sauna', 'academia', 'portaria', 'jogos', 'festas', 'suites', 'titulo', 'local', 'descricao', 'iptu', 'condominio', 'andares']]
df.columns = ['anuncio_id', 'preco_total', 'area_util', 'local', 'vagas', 'banheiros', 'quartos', 'churrasqueira', 'idade', 'brinquedoteca', 'tv', 'piscina', 'playground', 'sauna', 'academia', 'portaria', 'jogos', 'festas', 'suites', 'titulo', 'endereco', 'descricao', 'iptu', 'condominio', 'andares']
df.to_csv('wimoveis.csv', index = False, encoding = 'utf-8')

There! Now we have a clean, usable dataset.

train the model

I tried a few different algorithms to estimate the properties’ asking prices: linear regression, SVM, random forest, boosted trees, neural networks. For each of these algorithms (except linear regression) I tweaked the corresponding parameters a bunch of times (and for neural networks I tried lots of different architectures). The clear winner was boosted trees (which won’t be so surprising to Kaggle competitors).

Just a quick note: we discarded lots of features in the previous step because of missing data. Here we’ll add some of them back. People don’t always tick the “barbecue” box when filling out their listings, but they usually mention it in the text field. So the code below scans the text field looking for certain words.

import math
import numpy as np
import pandas as pd
from unicodedata import normalize
from sklearn.utils import shuffle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_predict

# set seed
random_state = 42

# load data
df = pd.read_csv('wimoveis.csv')
 
# keep only certain features
df = df[[
    'preco_total', # asking price
    'area_util', # m2
    'local', # location
    'iptu', # property tax
    'condominio', # condo fee
    'quartos', # num of bedrooms
    'suites', # num of suites
    'banheiros', # num of bathrooms
    'vagas', # num of parking spots 
    'titulo', # title of the listing
    'endereco', # address of the listing
    'descricao' # description of the listing
    ]]

# apartment we want to appraise
x_new = pd.DataFrame({
    'preco_total': [1.0], # just so we can drop it later
    'area_util': [120], 
    'local': ['Noroeste'],
    'iptu': [745],
    'condominio': [580],
    'quartos': [3],
    'suites': [3],
    'banheiros': [2],
    'vagas': [2],
    'titulo': '',
    'endereco': '',
    'descricao': '',
    })

# don't worry, we'll drop this before
# training the model
df = df.append(x_new)

# impute median values for missing condo fees and property tax
for var in ['iptu', 'condominio']:
    median = df[df[var] > 0][var].median()
    df[var] = df[var].fillna(median)

# drop outliers
df = df[df['area_util'] < 50000]
df = df[df['preco_total'] < 70000000]
df = df[df['condominio'] < 1500000]
df = df[df['iptu'] < 20000]
df = df[df['vagas'] < 10]

# take the log of all quantitative variables
for var in ['preco_total', 'area_util', 'iptu', 'condominio']:
    df[var] = df[var].map(lambda x: math.log(x))

# merge text columns (title + location + address + description)
# (this will let us extract some features later)
df['fulltext'] = df['local'] + ' ' + df['titulo'] + ' ' + df['endereco'] + ' ' + df['descricao']

# is it Caldas Novas? (that's a city in the state of Goiás)
def is_caldas(s):
    if 'caldas novas' in s.lower():
        return 1
    return 0
df['caldas'] = df['fulltext'].map(lambda x: is_caldas(str(x)))

# Goiania? (that's the capital of the state of Goiás)
def is_goiania(s):
    if ('goiania' in s.lower()) or ('goiânia' in s.lower()):
        return 1
    return 0
df['goiania'] = df['fulltext'].map(lambda x: is_goiania(str(x)))

# barbecue?
def has_grill(s):
    if 'churrasqueira' in s.lower():
        return 1
    return 0
df['churrasqueira'] = df['fulltext'].map(lambda x: has_grill(str(x)))

# pool?
def has_pool(s):
    if 'piscina' in s.lower():
        return 1
    return 0
df['piscina'] = df['fulltext'].map(lambda x: has_pool(str(x)))

# playground?
def has_playground(s):
    if ('playground' in s.lower()) or ('brinquedoteca' in s.lower()):
        return 1
    return 0
df['playground'] = df['fulltext'].map(lambda x: has_playground(str(x)))

# sauna?
def has_sauna(s):
    if 'sauna' in s.lower():
        return 1
    return 0
df['sauna'] = df['fulltext'].map(lambda x: has_sauna(str(x)))

# A/C?
def has_ac(s):
    if 'ar condicionado' in s.lower():
        return 1
    return 0
df['ar'] = df['fulltext'].map(lambda x: has_ac(str(x)))

# unobstructed view?
def has_view(s):
    if 'vista livre' in s.lower():
        return 1
    return 0
df['vista'] = df['fulltext'].map(lambda x: has_view(str(x)))

# high floor?
def is_high(s):
    if 'andar alto' in s.lower():
        return 1
    return 0
df['alto'] = df['fulltext'].map(lambda x: is_high(str(x)))

# faces sunrise?
def is_sunrise(s):
    if 'nascente' in s.lower():
        return 1
    return 0
df['nascente'] = df['fulltext'].map(lambda x: is_sunrise(str(x)))

# porcelain tiles?
def porcelanato(s):
    if 'porcelanato' in s.lower():
        return 1
    return 0
df['porcelanato'] = df['fulltext'].map(lambda x: porcelanato(str(x)))

# accepts mortgage?
def mortgage_ok(s):
    if 'aceita financiamento' in s.lower():
        return 1
    return 0
df['financiamento'] = df['fulltext'].map(lambda x: mortgage_ok(str(x)))

# renovated?
def is_renovated(s):
    if ('reformado' in s.lower()) or ('reformada' in s.lower()):
        return 1
    return 0
df['reformado'] = df['fulltext'].map(lambda x: is_renovated(str(x)))

# cabinets?
def has_cabinets(s):
    if ('armarios' in s.lower()) or ('armários' in s.lower()):
        return 1
    return 0
df['armarios'] = df['fulltext'].map(lambda x: has_cabinets(str(x)))

# private garage?
def has_parking(s):
    if 'garagem' in s.lower():
        return 1
    return 0
df['garagem'] = df['fulltext'].map(lambda x: has_parking(str(x)))

# recode location
# (to standardize spellings and do some aggregation)
new_locals = {
    'Asa Sul': 'asa_sul',
    'Asa Norte': 'asa_norte',
    'Goiás': 'goias',
    'Águas Claras': 'aguas_claras',
    'Taguatinga': 'taguatinga',
    'Guará': 'guara',
    'Sudoeste': 'sudoeste',
    'Noroeste': 'noroeste',
    'Lago Norte': 'lago_norte',
    'Samambaia': 'samambaia',
    'Ceilândia': 'ceilandia',
    'Centro': 'outros', # melhorar isso aqui depois (tem de tudo)
    'Setor De Industrias': 'asa_sul', # eh quase tudo SIG
    'Sobradinho': 'sobradinho',
    'Núcleo Bandeirante': 'bandeirante',
    'Riacho Fundo': 'riacho',
    'Vicente Pires': 'vicente',
    'Park Sul': 'parksul',
    'Recanto das Emas': 'recanto',
    'Lago Sul': 'lago_sul',
    'Gama': 'gama',
    'Setor De Industria Graficas': 'asa_sul',
    'Setor Habitacional Jardim Botânico': 'outros',
    'Octogonal': 'octogonal',
    'Planaltina': 'planaltina',
    'Cruzeiro': 'cruzeiro',
    'Santa Maria': 'santamaria',
    'São Sebastião': 'saosebastiao',
    'Setor Da Industria E Abastecimento': 'outros',
    'Zona Industrial': 'outros',
    'Paranoá': 'paranoa',
    'Setor De Autarquias Sul': 'asa_sul',
    'Setor Comercial Sul': 'asa_sul',
    'Setor Bancario Sul': 'asa_sul',
    'Setores Complementares': 'outros',
    'Park Way': 'parkway',
    'Candangolândia': 'candangolandia',
    'Setor De Radio E Televisao Sul': 'asa_sul',
    'Taquari': 'outros',
    'Setor Hoteleiro Sul': 'asa_sul',
    'Setor de Múltiplas Atividades Sul': 'outros',
    'Setor de Armazenagem e Abastecimento Norte': 'outros',
    'Setor Hospitalar Local Sul': 'asa_sul',
    'Zona Civico-administrativa': 'asa_sul',
    'Setor de Grandes Áreas Norte': 'asa_norte',
    'Setor De Clubes Esportivos Norte': 'lago_norte',
    'Setor De Clubes Esportivos Sul': 'lago_sul',
    'Zona Rural': 'outros',
    'Setor De Diversoes Norte': 'asa_norte',
    'Superquadra Sudoeste': 'sudoeste',
    'Setor de Mansões Dom Bosco': 'outros',
    'Setor Bancario Norte': 'asa_norte',
    'Setor Comercial Norte': 'asa_norte',
    'Setor De Oficinas Norte': 'asa_norte',
    'Setor Hoteleiro Norte': 'asa_norte',
    'Setor de Hotéis e Turismo Norte': 'asa_norte',
    'Quadra Mista Sudoeste': 'sudoeste',
    'Superquadra Noroeste': 'noroeste',
    'Setor Habitacional Jardins Mangueiral': 'outros',
    'Setor Habitacional Jardins Mangueiral': 'outros',
    'Vila Planalto': 'outros',
    'Alphaville': 'outros',
    'Granja Do Torto': 'outros',
    'Comércio Local Noroeste': 'noroeste',
    'Superquadra Sul': 'asa_sul',
    'Setor Terminal Norte': 'asa_norte',
    'Setor Terminal Sul': 'asa_sul',
    'Centro Comercial Noroeste': 'noroeste',
    'Setor de Grandes Áreas Norte': 'asa_norte',
    'Mansões do Lago': 'outros',
    'Setor de Garagens e Concessionárias de Veículos': 'outros',
    'Setor Comercial Local Residencial Norte': 'asa_norte',
    'Centro de Atividades': 'lago_norte',
    'Setor de Habitações Individuais Norte': 'lago_norte',
    'Superquadra Norte': 'asa_norte',
    'Centro Comercial Sudoeste': 'sudoeste',
    }
df['local'] = df['local'].map(lambda x: new_locals[x])
#print(df['local'].value_counts())

# dummify location
locais = pd.get_dummies(df['local'], prefix = 'local', drop_first = True)
for col in locais.columns:
    df[col] = locais[col]
del df['local']

# if city is Brasília, get more location details
def detail_location(s):
    locations = {

        # industras graficas
        ' sig ': 'sig',
        'graficas': 'sig',
        'setor de industrias': 'sig',

        # industria e abastecimento
        ' sia ': 'sia',
        'setor de industria e abastecimento': 'sia',

        # armazenagem e abastecimento norte
        ' saan': 'saan',
        'setor de armazenagem e abastecimento norte': 'saan',

        # armazenagem e abastecimento sul
        ' saas': 'saas',
        'setor de armazenagem e abastecimento sul': 'saas',

        # autarquias norte
        ' san ': 'san',
        'setor de autarquias norte': 'san',

        # autarquias sul
        ' sas ': 'sas',
        'setor de autarquias sul': 'sas',

        # comercial norte
        ' cln ': 'cln',
        ' scln ': 'cln',
        'comercial local norte': 'cln',
        'comercio local norte': 'cln',
        ' scn ': 'scn',
        'setor comercial norte': 'scn',
        ' sdn ': 'sdn',
        'setor de diversoes norte': 'sdn',

        # comercial sul
        ' cls ': 'cls',
        ' scls ': 'cls',
        'comercial local sul': 'cls',
        'comercio local sul': 'cls',
        ' scs ': 'scs',
        'setor comercial sul': 'scs',
        ' sds ': 'sds',
        'setor de diversoes sul': 'sds',

        # comercial noroeste
        ' clnw ': 'clnw',

        # comercial sudoeste
        ' clsw ': 'clsw',

        # bancario norte
        ' sbn ': 'sbn',
        'setor bancario norte': 'sbn',

        # bancario sul
        ' sbs ': 'sbs',
        'setor bancario sul': 'sbs',

        # grandes areas norte
        ' sgan ': 'sgan',
        'setor de grandes areas norte': 'sgan',

        # grandes areas sul
        ' sgas ': 'sgas',
        'setor de grandes areas sul': 'sgas',

        # hoteleiro norte
        ' shn ': 'shn',
        ' shtn ': 'shn',
        'setor hoteleiro norte': 'shn',
        'setor de hoteis e turismo norte': 'shn',

        # hoteleiro sul
        ' shs ': 'shs',
        ' shts ': 'shts',
        'setor hoteleiro sul': 'shs',
        'setor de hoteis e turismo sul': 'shs',

        # hospitalar norte
        ' shln ': 'shln',
        'setor hospitalar local norte': 'shln',

        # hospitalar sul
        ' shls ': 'shls',
        'setor hospitalar local sul': 'shls',

        # radio e tv norte
        ' srtvn ': 'srtvn',
        'setor de radio norte': 'srtvn',

        # radio e tv sul
        ' srtvs ': 'srtvs',
        'setor de radio sul': 'srtvs',

        # oficinas
        ' sof ': 'sof',
        'setor de oficina': 'sof',

        # seuperquadras
        'sqn': 'sqn',
        'super quadra norte': 'sqn',
        'superquadra norte': 'sqn',
        'sqs': 'sqs',
        'super quadra sul': 'sqs',
        'superquadra sul': 'sqs',
        'sqsw': 'sqsw',
        'super quadra sudoeste': 'sqsw',
        'superquadra sudoeste': 'sqsw',
        'sqnw': 'sqnw',
        'super quadra noroeste': 'sqnw',
        'superquadra noroeste': 'sqnw',

        'qmsw': 'qmsw',
        'quadra mista sudoeste': 'qmsw',
        'qrsw': 'qrsw',
    }
    s = normalize('NFKD', s).encode('ASCII', 'ignore').decode('ASCII')
    s = s.lower()
    for key in locations.keys():
        if key in s:
            return locations[key]
    return 'outros'

df['local_det'] = df['fulltext'].map(lambda x: detail_location(str(x)))

# dummify location details (for when city=Brasília)
locais = pd.get_dummies(df['local_det'], prefix = 'local_det', drop_first = True)
for col in locais.columns:
    df[col] = locais[col]
del df['local_det']

# drop text columns
del df['fulltext']
del df['titulo']
del df['endereco']
del df['descricao']

# drop row that contains the property we want to appraise
x_new = df[df['preco_total'] == 0]
del x_new['preco_total']
x_new = x_new.values
df = df[df['preco_total'] > 0]

# split X and Y
y = df['preco_total'].values
del df['preco_total']
X = df.values

# shuffle sample order
X, y = shuffle(X, y, random_state = random_state)

# instantiate model
model = GradientBoostingRegressor(loss = 'quantile', alpha = 0.5, n_estimators = 1000, random_state = random_state)

# train model
yhat = cross_val_predict(model, X, y, cv = 10)

# put estimated prices back in R$
yhat_reais = np.exp(yhat)

# put observed prices back in R$
y_reais = np.exp(y)

# compute errors
erros = yhat_reais - y_reais

# compute median absolute error
erro_absoluto_mediano = np.median(np.absolute(erros))
print('erro absoluto mediano:', erro_absoluto_mediano)

# compute proportional error (error / asking price)
erros_relativos = erros / y_reais
erro_relativo_mediano = np.median(np.absolute(erros_relativos))
erro_relativo_medio = np.mean(np.absolute(erros_relativos))
print('erro relativo mediano:', erro_relativo_mediano)
print('erro relativo medio:', erro_relativo_medio)

This gives me a median absolute error of R$ 46k. In proportional terms (i.e., error / asking price) we have a median absolute error of 10% and a mean absolute error of 23%. Which is line with previous work (see here), where the mean absolute error is 25%-30%, and here, where the mean absolute error is 22%.)

We’re not capturing everything here. Say, maybe the property is next door to a police station or to a church or to a loud bar. Maybe there was a murder in the premises. Etc etc. My point is not that these estimates should be final. My point is simply that these estimates are probably closer to the truth than the ones being produced today by professional appraisers all over Brazil.

appraise!

Alright then, time to appraise our Noroeste apartment. Just append the following lines to the previous code block and run it.

# train models for lower bound, point estimate, and upper bound
model_lower = GradientBoostingRegressor(loss = 'quantile', alpha = 0.25, n_estimators = 1000, random_state = random_state)
model_mid = GradientBoostingRegressor(loss = 'quantile', alpha = 0.5, n_estimators = 1000, random_state = random_state)
model_upper = GradientBoostingRegressor(loss = 'quantile', alpha = 0.75, n_estimators = 1000, random_state = random_state)
model_lower.fit(X, y)
model_mid.fit(X, y)
model_upper.fit(X, y)

# appraise
yhat_lower = model_lower.predict(x_new)
yhat_mid = model_mid.predict(x_new)
yhat_upper = model_upper.predict(x_new)
print(np.exp(yhat_lower), np.exp(yhat_mid), np.exp(yhat_upper))

And voilà, we have our point estimate: R$ 978k (That’s about US$ 254k).

We also have a prediction interval with lower and upper bounds: [788k, 1060k]. To produce this interval I used something similar to quantile regression. The lower bound is an estimate of the 25th percentile of the distribution. The upper bound is an estimate of the 75th percentile. The point estimate is an estimate of the 50th percentile (i.e., the median). As we have three different models, the lower and upper bounds are not centered around the point estimate (we actually have three point estimates). More details here.

text

Here I’m scanning the property descriptions for words like “A/C”, “barbecue”, etc, and featurizing them as dummies. But you can use the texts themselves in the model. Just insert the following code between the line where you shuffle the samples and the line where you instantiate the model:

# little function to make corpus
def make_corpus(df):
    for row in df.iterrows():
        yield str(row[1]['fulltext'])

# vetorize corpus and create TFIDF matrix
vectorizer = TfidfVectorizer(strip_accents = 'unicode', lowercase = True, ngram_range = (1, 2))
tfidf = vectorizer.fit_transform(make_corpus(df))

# reduce dimensions (i.e., extract topics)
svd = TruncatedSVD(n_components = 400)
topics = svd.fit_transform(tfidf)

# add topics to the dataframe
for n in range(0, topics.shape[1]):
    df[str(n)] = topics[:,n]

# rescale topics to make them compatible with the m2 scale
scaler = MinMaxScaler(feature_range = (df['area_util'].min(), df['area_util'].max()))
for col in df.columns:
    if col.isdigit():
        df[col] = scaler.fit_transform(df[col].values.reshape(-1, 1))

The TFIDF matrix is too big - we end up with more columns than samples. So we don’t use it directly, we use LSA to reduce the TFIDF matrix to a documentsXtopics matrix of 400 topics.

This improves the performance of the model a bit. But it’s ultimately nonsensical: when you’re appraising a new property you could keep tweaking the text until you get the price you want. So I did this just to inspect which topic vectors would be more relevant (see here), and then which words had more weight in these topics. This helped me decide which words to look for in the text fields (sauna, pool, etc).

images

I’m still figuring out the best way to use the pictures. I tried using the metadata first: number of pictures, height and width of the pictures, etc. That didn’t improve the model. (I know, shocking. But I like to try the simple things first.)

I also checked whether the dominant colors help us predict the price. To do that I clustered the pixels of every picture of the property. Each pixel is defined by three values: R, G, B, each of which can vary from 0 to 255 and represents the intensity of each of the three primary colors (red, green, and blue). So the pixels exist in the same tridimensional space and therefore we can cluster them. The centroid of each cluster is a dominant color.

Ideally we’d use DBSCAN for this, as the clusters may have widely different sizes and shapes and we don’t even know a priori how many clusters each picture has. But DBSCAN just takes forever to run. So I used k-means instead. I used the elbow technique to find the ideal number of clusters and it turns out that for most images that number was two or three.

That was a massive waste of time. K-means is faster but it still took almost a week to run. And in the end those centroids didn’t improve the model one bit.

A friend who knows a lot more about images than I do suggested that I try something along these lines. I.e., having a branched neural network where I can input both structured features (m2, location, etc) and image features. So that’s what I’m trying right now. It’s tricky though because the number of pictures varies across samples, the pictures are of different sizes, and the pictures aren’t standardized in any way (some listings have five pictures of the kitchen and none of the bedrooms, others have no pictures of kitchen, etc).

The same friend also suggested that I use some pre-trained neural network capable of identifying objects like “window”, “A/C unit”, and so on, and then use the identified objects as features. That’s the next item on my to-do list.

All that said, the truth is that I’m not sure the images will be useful in the end. It’s like with the texts: you could keep taking new pictures of the property until you get the “right” price. I think that’s harder to do with pictures than with texts, but who knows. I need to think more about it. Suggestions are welcome.

“someone must be doing it already!”

You bet. ZAP is doing it. Which makes sense: they have tons of data, so why not use it? In fact just last month they announced the next step: they’ll start buying property themselves, picking the ones that their model suggests are underpriced.

In the US Zillow is doing it. I bet that there are plenty of similar initiatives all over the world.

So I’m not proposing anything new here. Which makes it hard to understand why the heck Brazilian banks are not doing it yet. They have billions at stake.

incentives matter

I know better than to second guess the choices of people with skin in the game. But 70% of all mortgages in Brazil are concentrated in a state-owned bank - Caixa Econômica Federal (CEF). And when the state is involved the incentives are different. It’s not about delivering results but about broadening your mandate, blame-shifting, and securing resources. (If you want to have an idea of how government works, watch HBO’s Chernobyl.)

So it’s not a market failure that we have here, but by and large a state failure. CEF’s bureaucrats do not get punished for using the wrong tool to do property valuation. In a state-owned company there is a lot of noise between your actions and your punishments/rewards. Which helps explain CEF’s long history of incompetence and corruption.

Not to mention that an entire ecosystem has evolved around the status quo. You see, to be contracted by CEF as an appraiser you need to be certified by the Federal Council of Realtors (yes, that exists). There’s an exam for that and there are schools that charge good money to help prepare you for that exam. So, there are lots of people who benefit from the current system, which makes it harder to change.

so long

I guess this is it. I hope this post has inspired you to roll up your sleeves and do your own valuations from now on.

Happy appraisals!

how much is your apartment worth?

I’m buying an apartment and in the process I’ve learned a lot about property valuation. I’ve learned that there are people who do that for a living, that there are academic researchers who specialize in the subject, and that there is even regulation and licensing requirements.

In particular, I’ve learned that people are using statistical models - like linear regression - to appraise properties. I thought that was super cool, so I got curious and read a bunch of model-based appraisal reports. And that, dear reader, was a fascinating lesson in the misuse of quantitative tools. And by the entities we’d expect to be most proficient in their use: banks.

In what follows I show how Brazilian banks are doing it today and why it’s seriously messed up. Then in part 2 (forthcoming) I’ll show a better way to do it.

(Trigger warning: if you’ve recently bought or sold property in Brazil, and if you paid those obscene appraisal fees that Brazilian banks charge, you may want to save yourself some aggravation and stop reading here.)

regression and Rio

The folks using statistical models to appraise real estate in Brazil are usually banks. If you don’t pay your mortgage the bank gets your property, so the bank needs to know beforehand whether the property is worth enough $.

In Brazil 70% of all mortgages are concentrated in one bank: Caixa Econômica Federal (CEF), which is owned by the state. You’d think that the bank with the most mortgages would be pretty good at estimating the value of real estate. You’d be wrong.

I downloaded dozens of property valuation reports by CEF. Well, they are not actually made by CEF: CEF outsources the appraisals to other companies. But CEF reviews and approves every appraisal report. And in any case the ultimate responsibility rests with CEF.

Let’s look at this report, which is pretty typical.

The property here is a plot of 41.695m2 in a small town not far from Rio de Janeiro (about 100km away). The appraiser started by gathering data on 38 other plots, all in the same small town. For each of the plots he collected four variables: area (in m2), whether the lot is paved, average family income of the area, and price (in R$) per m2. Then he dropped 13 of the 38 samples and used the remaining 25 to run a linear regression: price per m2 ~ area + paved + income. He then used the resulting model to estimate the price of the original plot. The resulting estimate was R$ 1.056.866,78, with an 80% confidence interval of [R$ 898.423,01, R$ 1.215.513,48]. The appraiser saw fit to manually adjust the estimated value to R$ 990.000,00 because, well, there’s some stuff that the model doesn’t capture.

There is a lot that’s wrong here, but the main thing is: the appraiser doesn’t test the model.

Normally, in machine learning tasks like this, we train the model using only a subset of the samples, ask the model to produce estimates for the samples that were left out, then check how close these estimates are to the actual values. Ideally we repeat this several times over, rotating which samples are left out at each iteration.

But here there is no separation between training and testing. The appraiser used 100% of the 25 samples to train the model. There were no samples left for testing the model. So we have absolutely no idea how good or bad the model is. Maybe the actual market value of the plot is indeed close to R$ 1.056.866,78. But maybe it’s R$ 2 million. Or R$ 500k. We have no idea. Since the model isn’t tested, its performance is a complete mystery.

In response to which the appraiser may direct your attention to page 11 of the report, where you see this scatter plot of observed vs estimated values:

Clearly the model has a tremendously good fit: all estimates are super close to the actual values.

Except that that’s cheating: the appraiser used the same 25 samples for both training and testing the model. You don’t know a model’s performance until it has been subjected to samples it hasn’t seen before.

Not that it would make much sense to split training and testing samples with n=25. But if n=25 the thing to do is get more samples (more on this later). A small sample size doesn’t give you a license to simply not test your model.

“but that’s just one report”

Nope, that’s how every single model-based appraisal report I downloaded does it. Every. Single. One. No exceptions. At all. Google ‘CEF laudo imóvel’ or ‘Caixa Econômica laudo imóvel’ or ‘laudo avaliação imóvel’ and check for yourself. The only appraisals that are not like that are the ones that don’t use any statistical model whatsoever.

In other words: billions of R$ in real estate transactions are based on property valuations that are completely worthless.

let’s worry about all the wrong things

If you ask the appraiser why he didn’t test the model he’ll be genuinely shocked and answer that he did test it. And he did run a bunch of tests, as it’s clear on the report: he tested whether the residuals are normally distributed, whether each coefficient is statistically significant, whether the model as a whole is statistically significant, and so on. Other reports go as for as testing for heteroskedasticity and autocorrelation.

None of which makes the slightest sense here. This is not an econometrics problem. We’re not interested in the effect of each additional m2 on the price of the property. This is a machine learning problem. We’re interested in producing price estimates as close as possible to the actual prices. We don’t care about the assumptions behind linear regression in this context.

In fact we don’t care about linear regression at all. The appraiser could have (and probably should have) used boosted trees or any other machine learning algorithm. The way to go about these things is to try different algorithms and pick the one that produced the best estimates. There is no reason to limit yourself to linear regression.

regulation and its unintended consequences

To be fair, this is not entirely the appraiser’s fault. It turns out that there is a set of semi-official guidelines for how to use linear regression to do property valuation. That’s regulation NBR-14653-2. It is not legally binding - you don’t go to jail or lose your license if you violate it. But it ends up being enforced anyway. CEF won’t subcontract your company to do appraisals if you don’t follow it.

Regulation NBR-14653-2 tells appraisers to check for normality of the residuals, heteroskedasticity, autocorrelation, etc. It doesn’t say a word about testing the performance of the model. It’s completely silent on the topic of training vs testing samples, cross validation, accuracy, etc. In other words, regulation NBR-14653-2 recommends an econometric approach to a machine learning problem, which is bonkers.

more wrongness (and a lesson in public policy)

Suppose for a moment that the econometric approach were the right one here. Even then the current appraisals would be worthless.

Take the report we discussed before. The sample size is 25. That’s just not good enough. “Oh, but it’s a small town, there probably aren’t that many plots for sale.” Yes, but Brazil has over five thousand small towns. Your samples don’t need to be in the same town where the property you’re appraising is. Yes, different towns will differ in GDP per capita, homicide rate, etc. But we have data on all that, so we can include those indicators in our models. And/or dummify “name of town”.

Such a small sample is particularly egregious here, since CEF has 70% of all mortgages in Brazil, so they surely have a ton of data they could have used (or shared with the company they contracted to do the job). Imagine having millions of samples and then using only 25.

(I suspect this has to do with appraisers’ lack of programming skills. They use software with graphical interfaces and probably just type every data point manually. So 1 million samples would be of no use to them. But I’m just guessing here.)

Also, the appraiser doesn’t tell us which 25 - out of the original 38 - samples he actually used in the regression. I considered trying to replicate his results but there are 5.414.950.296 possible combinations of 25 elements out of 38, so that might take a while to run.

The appraiser also doesn’t tell us why he dropped 13 of the 38 original samples. Were they outliers? Or maybe dropping them helped produce that incredible R2 of 0.98 we see on page 9…?

At times it feels like the appraiser doesn’t really understand what he is doing. Like when he reports a p-value for the dependent variable (R$ per m2). Only independent variables have coefficients. Maybe he is confusing the constant and the dependent variable?

I also don’t know what the variable “average family income in the area” means or where it comes from. What’s “area” here? The neighborhood? The block? The zip code? What’s the source? He says it comes from “senso”, a misspelling of “censo” - census. But I have no idea which census he is talking about.

It’s also weird that he codes “is paved?” as no=1 and yes=2, instead of no=0 and yes=1.

So, just like in the other reports I read, I get the sense that the appraiser doesn’t have any quantitative training. It looks like all he can do is operate the software that produces the estimates (appraisers seem to like SisDEA). You input the data, you press this and that button, the software spits out content that mostly looks Greek to you (some of it is actual Greek), plus an estimate for the property you’re supposed to appraise. You copy and paste everything into a Word document, save it as a PDF file and email it to your client. That’s the job.

The stuff you copied and pasted contains, among other things, tests of the assumptions behind linear regression. You don’t understand any of that, but you see words that also appear on regulation NBR-14653-2 - normality, heteroskedasticity, autocorrelation -, so clearly you’re following the rules. No one can yell at you or fire you, right?

In other words, regulation substitutes for actual knowledge.

(Let this be a lesson to the perpetually outraged online mobs demanding that algorithms be regulated. They think they’ll get Andrew Ng to write the regulations. In the real world regulations are produced by a mix of bureaucrats with no skin in the game and politicians with too much skin in the game.)

“well, criticizing is easy! why don’t you do something about it instead?”

Fair enough. In part 2 I’ll show how we can improve things.

finding books to read

Despite all the hype about AI, product recommendations still suck. Amazon offers me the exact same Bose headphones I bought last month, plus two hundred close substitutes. Spotify thinks that because I enjoy Eurodance from the 1990s I will also enjoy Brazilian music from the 1990s. Netflix thinks that because I enjoy Black Mirror I will also enjoy Dear White People - and that’s despite Netflix knowing that I’m a fan of Rick & Morty, South Park, and any show that has Ricky Gervais. Goodreads thinks that because I enjoyed Ready Player One I will also enjoy Spell or High Water even though I’ve already read Spell or High Water and rated it only one star on Goodreads itself. No wonder there is zero peer-reviewed evidence that Cambridge Analytica swayed even a single voter in 2016. The singularity is nowhere near.

I could write to each of those companies or annoy them on Twitter or whatnot. But I believe in the DIY ethic.

I chose to start with Goodreads. It has by far the most egregious recommender system. And you hear people talk about their favorite TV shows or self-tracking wristbands all the time but you don’t always get to hear people talk about their favorite books.

“do the simple thing first”

I used something called user-based collaborative filtering, which is fancy language for “find other people who like the sort of stuff you like and then check what else they also like.”

There are lots of ways to do user-based collaborative filtering. Some algorithms only require grade school math while others rely on deep learning. (Aggarwal’s Recommender Systems is by far the best textbook on the subject that I know of, in case you’re interested.) Here I used something that only requires grade school math and can’t even be called an “algorithm”: cosine similarity.

But before I talk about cosine similarity let me talk about the data we need to collect in order to use it.

scraping

You start by collecting, for every item you ever rated, all other ratings it has received. And you also collect the identity (name or user ID) of all the other raters - all the people who have rated items in common with you. So I had to collect, for each book I ever rated on Goodreads, all other ratings that book received on Goodreads, along with the user IDs of the other raters.

Unfortunately, Goodreads doesn’t let us see all the ratings each book has received. Goodreads lets us see the full distribution of ratings: forty thousand 5-star ratings, twenty thousand 4-star ratings, and so on (see below). But it doesn’t let us see all the ratings themselves. And if I can’t see a rating I can’t know whose rating it is, so for our purposes here it might as well not exist.

How many ratings does Goodreads let us see? If you visit a book’s page - say, Ender’s Game page - you immediately see up to 30 ratings (fewer if it’s an obscure book that not even 30 people have bothered to rate). If the book has more than 30 ratings you can click “next” on the bottom of the page and that shows you 30 more ratings. You can keep clicking “next” until you get to page ten, which means you can get up to 300 ratings in total. That’s regardless of the book’s popularity: it can be Pride and Prejudice, which has 2.3 million ratings, and you still won’t be able to see beyond page ten.

Now, we can bypass that limit - to a point. We can filter the contents and ask to see just 1-star ratings or just 2-star ratings and so on. And each time we will get back up to ten pages (of 1-star ratings or 2-star ratings or whatever else we chose). Goodreads ratings go from 1 to 5 stars, integers only, so in practice this filtering lets us multiply five-fold the total number of ratings we can collect for each book: we increase it from 300 to 1500.

We can even go beyond 1500. See the “Sort” button in the above picture? We can sort the ratings from older to newer or from newer to older. Either way we still get back up to ten pages. So we can get 1500 ratings sorted from older to newer and then get 1500 ratings sorted from newer to older. That way we end up with 3000 ratings (some - or all - of which will be duplicates when the book is not popular enough to have received more than 3000 ratings in total).

I didn’t do the sorting thing though. It only occurred to me after I had done everything and I’m in no mood to go back and redo everything. So what I have is only up to 1500 ratings per book. Sorry, reviewer #2.

Alrigh then, enough talk. Here is the Python code I wrote to get all the ratings I ever gave on Goodreads. I used Selenium and Chrome. If you’re new to scraping then here’s a tutorial. You don’t need to be logged in to scrape Goodreads (though that reduces the size of the data a little: some Goodreads users choose to make their profiles available only to people who are logged in).

import time
import pickle
from selenium import webdriver
from bs4 import BeautifulSoup

# initialize browser
path_to_executable = '/path/to/chromedriver'
browser = webdriver.Chrome(path_to_executable)

# get my own reviews
my_url = 'https://www.goodreads.com/review/list/123456789-your-user-id-here?print=true'
browser.get(my_url)
time.sleep(5)
source = browser.page_source
soup = BeautifulSoup(source, 'lxml')

# get total number of pages
pagination_div = soup.find('div', {'id': 'reviewPagination'})
page_links = pagination_div.find_all('a')
last_page = page_links[-2] # page_links[-1] is "next >>"
total_pages = int(last_page.text)
print('total_pages:', total_pages)

my_reviews = {}
for page in range(1, total_pages + 1):

    # get page and extract table that contains the reviews
    new_page = my_url + '&page=' + str(page)
    browser.get(new_page)
    time.sleep(5)
    source = browser.page_source
    soup = BeautifulSoup(source, 'lxml')
    table = soup.find('table', {'id': 'books'})
    rows = table.find_all('tr', class_ = 'bookalike review')

    # for each review get book title, link, and my rating
    for i, row in enumerate(rows):
        title_column = row.find('td', class_ = 'field title')
        title = title_column.find('a').text
        href = title_column.find('a')['href']
        rating_column = row.find('td', class_ = 'field rating')
        rating_div = rating_column.find('div', class_ = 'value')
        rating_span = rating_div.find('span', class_ = 'staticStars')
        if rating_span.has_attr('title'):
            my_rating = rating_span['title']
            my_reviews[str(page) + '_' + str(i)] = (title, href, my_rating)

with open('my_reviews.pkl', mode = 'wb') as f:
    pickle.dump(my_reviews, f)

And here is the code I wrote to scrape all the ratings of every book I’ve ever rated. Now, this code takes a while to run - it’s getting every rating of every book I ever read -, so I used Chrome in headless mode.

import time
import pickle
from selenium import webdriver
from bs4 import BeautifulSoup

# load my own reviews
with open('my_reviews.pkl', mode = 'rb') as f:
    my_reviews = pickle.load(f)

# get book URLs
book_endpoints = [e[1] for e in my_reviews.values()]

# initialize browser
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('headless')
chrome_options.add_argument('log-level=3')
chrome_options.add_experimental_option('prefs', {'profile.default_content_setting_values.automatic_downloads': 1})
executable_path = '/path/to/chromedriver'
browser = webdriver.Chrome(executable_path = executable_path, chrome_options = chrome_options)

# get book reviews
all_reviews = {}
for i, book_endpoint in enumerate(book_endpoints):
    print(' ')
    print(i, book_endpoint)

    # get book id
    if '.' in book_endpoint:
        book_id = int(book_endpoint[11:book_endpoint.index('.')])
    elif '-' in book_endpoint:
        book_id = int(book_endpoint[11:book_endpoint.index('-')])
    else:
        book_id = book_endpoint[11:]

    # create empty list to store book's reviews
    all_reviews[book_id] = []

    source = ''

    # iterate through all possible ratings (1-5)
    for rating in range(1, 6):
        print('rating:', rating)

        # get source code of book's first page and soupify it
        browser.get('https://www.goodreads.com' + book_endpoint + '?rating=' + str(rating))
        time.sleep(5)
        source = browser.page_source
        soup = BeautifulSoup(source, 'lxml')

        page = 1
        while True:

            # get divs that contain the reviews
            reviews = soup.find_all('div', class_ = 'review')

            for review in reviews:

                # get endoint of reviewer's URL
                reviewer = review.find('a', class_ = 'user')['href']

                # get text
                text_div = review.find('div', class_ = 'reviewText stacked')
                if text_div:
                    text = text_div.find('span', class_ = 'readable').text
                else:
                    text = None

                # store review
                all_reviews[book_id].append((reviewer, rating, text))

            print('page:', page, 'book reviews:', len(set(all_reviews[book_id])))

            # go to next page
            if '<a class="next_page"' in source:
                try:
                    element = browser.find_element_by_partial_link_text('next ')
                    browser.execute_script("arguments[0].click();", element)
                    while browser.page_source == source:
                        time.sleep(1)
                    source = browser.page_source
                    soup = BeautifulSoup(source, 'lxml')
                    page += 1
                except:
                    print('"next" present but couldnt click it')
                    break
            else:
                print('no "next" element')
                break

with open('raters.pkl', mode = 'wb') as f:
    pickle.dump(all_reviews, f)

On Goodreads you can give a star-based rating, as we’ve discussed, but you can also write a review about the book. The code above gets these reviews too. I won’t use them here, but I may in a future blog post (it might be fun to inspect the adjectives most frequently associated with each book).

les data

In total I got 292,043 ratings from 197,548 different Goodreads users. That’s all from 357 books, which is the number of books I had rated by the time I did this. That’s all the data I need to find people who like the same sort of stuff I like.

In fact, that’s actually more data than I need. I have to discard everyone with whom I have only one book in common (more about that in a minute). After doing that I’m left with 43,825 raters and the following distribution:

As you see, the vast majority of these users have only a couple of books in common with me: they are concentrated in the 2-10 range of the X axis. Only a few users have rated more than 10 books in common with me. The maximum is 68 books. That’s Stephen.

Turns out Stephen and I have a similar sci-fi taste. We’re both into the classics - Heinlein, Bradbury, Asimov - and into more recent authors like Neal Stephenson, Charles Stross, and Ted Chiang. We’re both also into economic and moral philosophy: Friedman, Hayek, Emerson, Nietzsche.

But no, I can’t just go after the books Stephen has read and I haven’t. Life is not so simple. His interests are far more diverse than mine: he’s into horror, fantasy, comics, crime, and dozens of other genres, whereas I can go a whole year reading nothing but sci-fi. I’ve only read 2% of the books he’s read, but he’s read 24% of the books I’ve read. So it’s not like Stephen and I are “book twins”, it’s just that Stephen reads a lot, so inevitably he will look like many people’s “book twin” at first glance.

Not to mention that our ratings don’t necessarily coincide. For instance, Stephen is much more indulgent with slow-paced novels than I am. He 4-starred Olaf Stapledon’s Last and First Men and C.J. Cherryh’s Downbelow Station, both of which I abandoned after just a few chapters. I like books that hook you from the very first paragraph.

The moon blew up without warning and for no apparent reason. (Seveneves, Neal Stephenson)

I did two things on my seventy-fifth birthday. I visited my wife’s grave. Then I joined the army. (Old Man’s War, John Scalzi)

Lolita, light of my life, fire of my loins. (Lolita, Vladimir Nabokov)

There are other complications but I’m getting ahead of myself. For now I just want to find potentially similar people, not evaluate each of them. Stephen is a potential neighbor - to use the jargon of user-based collaborative filtering -, but we won’t know for sure until much later. Keep reading.

There’s one complication I need to mention right now though. When I check Stephen’s profile on Goodreads I can see that we actually have rated 93 books in common, not 68. My dataset is lying to me. Why? Well, remember how I only got up to 1500 ratings for each book because Goodreads doesn’t let us see all the ratings? Turns out that for 25 of those 93 books Stephen’s ratings were not in the (up to) 1500 ratings I scraped for each book. Hence those 25 ratings didn’t make it into the dataset.

I could fix that by scraping each rater’s profile and checking all the books we actually have in common. But with 43,825 raters that would take forever. I could parallelize the scraping - have, say, 43k bots scrape ~1k raters each. But Goodreads would surely understand that as an attack and block my IP address. I could get around that by using proxies or remote servers but that’s a lot more effort than I’m willing to put into this. Sorry again, reviewer #2.

Alright, time to go back to the question I left hanging before: why did I have to drop users with only one book in common with me? To answer that I need to (finally) explain cosine similarity.

cosine similarity

Say we arrange the scraped data in a matrix where each column represents a book, each row represents a user, and each cell represents how many stars that user rated that book. Here I have 43,825 users and 357 books, so that’s a 43,825 by 357 matrix.

(Most users won’t have rated most books, so most of the cells are NULL or None or whatever null-like representation you prefer. It can’t be zero though - that would later be interpreted as a rating of zero stars -, so this isn’t really a sparse matrix, it’s something else. I don’t know that it has a specific name.)

Now suppose I pluck two arbitrary (row) vectors from that matrix and, in these vectors, I keep only the columns corresponding to books that both users have rated (i.e., I drop all columns corresponding to books that only one or neither user has rated). If you project the resulting two vectors onto an Euclidian space there will be an angle between them.

Here’s the thing: the cosine of that angle is a measure of how similar the two vectors are. Ta-da. Now you know what cosine similarity means.


A concrete example might help.

Let’s pluck two users from our matrix: myself and someone else. Say, Geneva. We have rated a total of five books in common according to the dataset. Here are our ratings for those five books:

book Geneva Thiago
The Moon is a Harsh Mistress 4 2
Neuromancer 3 3
Snow Crash 4 5
The End of Eternity 4 5
The Last Policeman 4 5

These are our two vectors then: [4, 3, 4, 4, 4] (Genevas’ ratings) and [2, 3, 5, 5, 5] (my ratings).

The first step here is to compute the Euclidean norm of each vector. It’s not hard: for each vector you square every element, you sum up all these squares, then you take the square root of that sum. So for Geneva’s vector the norm is \(\sqrt{4^2 + 3^2 + 4^2 + 4^2 + 4^2}\approx 8.54\). For my own vector the norm is \(\sqrt{2^2 + 3^2 + 5^2 + 5^2 + 5^2}\approx 9.38\).

Next we divide each vector’s element by the vector’s norm. Hence Geneva’s vector becomes \([\frac{4}{8.54}, \frac{3}{8.54}, \frac{4}{8.54}, \frac{4}{8.54}, \frac{4}{8.54}]\) = [0.46, 0.35, 0.46, 0.46, 0.46] and mine becomes \([\frac{2}{9.38}, \frac{3}{9.38}, \frac{5}{9.38}, \frac{5}{9.38}, \frac{5}{9.38}]\) = [0.21, 0.31, 0.53, 0.53, 0.53].

Finally, the cosine itself. It’s just the dot product of those two vectors: 0.46(0.21) + 0.35(0.31) + 0.46(0.53) + 0.46(0.53) + 0.46(0.53) = 0.09 + 0.10 + 0.24 + 0.24 + 0.24 \(\approx\) 0.93.

How similar are the two vectors if the cosine of their angle is 0.93? They are about as similar as they can be: for any two vectors of our Goodreads matrix the cosine will be in the 0-1 range. The closer to 0, the more different the two vectors are. The closer to 1, the more similar.

(Before the math police come after me: “in abstract” a cosine can vary from -1 to +1. But in our Goodreads matrix we only have nonnegative values; Goodreads doesn’t let you rate a book negative stars. So here the cosine between any of our two vectors has to be in the 0 to +1 range.)

If I compute the cosine between my own vector and every other vector in our matrix then I’ll know which users are most similar to me, book-wise.

Perhaps now you can see why I dropped users who only had one book in common with me. An angle cannot exist between two points, only between two vectors. And I need at least two points to form one vector.

Note that I have only considered the five books that Geneva and I have both rated. We can’t compute the cosine between two vectors that are not the same size. If I add a sixth book that’ll be a book that either only Geneva has rated or only I have rated or neither of us has rated. We’d have one vector with six elements and one vector with five elements (no, we can’t input zero or any other value to fill the gap). That wouldn’t work.

Another thing to keep in mind is that, as I explained before, I only got up to 1500 ratings for each book, so the number of books I have in common with each of the other Goodreads users is actually higher than what’s in the dataset. I checked Geneva’s profile and it turns out that we share 21 books, not five. So by computing the cosines between my ratings and everyone else’s ratings I get a rough idea about who my closest neighbors probably are; I don’t get a precise, 100% accurate ranking of my closest neighbors.

too many book twins

So I sat down and computed the cosine similarity between me and every other person on my Goodreads dataset. (Well, actually I sat down and wrote code to do that for me.)

Turns out there are lots of people for whom cosine=1, the maximum possible value. But no, I don’t have lots of book twins out there. The problem is: a cosine of 1 is pretty common when the two vectors are short.

Take user Ajay, for instance. We have two books in common: How to Lie with Statistics and Skin in the Game. He rated both 5 stars. I rated both 3 stars. If you do the math you’ll find a cosine of 1. Or take Derek. We also have two books in common: The Signal and the Noise and Neuromancer. He rated them 3 and 2 stars respectively. I rated them 4 and 3 stars respectively. Cosine is also 1.

Let’s put this in perspective. Remember Stephen? Our cosine similarity is 0.93. Now that’s a guy with whom I share 68 books. Should Ajay and Derek rank higher than Stephen on my similarity ranking? I don’t think so. I’d much rather read a book suggested by Stephen than a book suggested by Ajay or Derek.

Alright then, should I drop everyone with fewer than, say, ten books in common with me? Nope. Not all books are equally important. The two books I share with Ajay and the two books I share with Derek are all pretty popular. Lots of people have read them, most of which are certainly not in my tribe. Now what if I had only two books in common with someone but those two books were, say, David Wingrove’s The Middle Kingdom and John Ringo’s Live Free or Die?

That would be a different story. Those two books are excellent but little known. The Middle Kingdom, published in 1989 - back when China’s GDP was 3.8% of what it is today -, shows a future where history has been rewritten and schoolchildren learn that the Chinese have been ruling the world since general Ban Chao (an actual person) destroyed the Roman Empire in the first century AD. The Middle Kingdom makes Lord of the Rings and Game of Thrones seem juvenile and unimaginative. And yet it has only 1.3k ratings on Goodreads. Live Free or Die, in turn, is about a guy named Tyler Vernon. Earth has been conquered by an alien race - the Horvath. Tyler sells maple syrup to other alien races (it works as a mixture of cocaine and aphrodisiac for them). He then uses the money to buy alien military technology and fight back against the Horvath. It’s great, unapologetic, libertarian sci-fi - government fails to fight off hostile aliens, private sector steps up; it’s Robert Heinlein meets Ayn Rand. And yet only 7.3k brave souls have rated it on Goodreads.

My point is: if I only have two books in common with you and these two books are The Middle Kingdom and Live Free or Die then I definitely want to hear your book recommendations. Those two books are too good and too obscure. Reading them certifies you as the sort of weirdo I want to get book recommendations from. That’s why I can’t simply drop people with whom I have only two or three books in common. They may be the right books.

We can think about it from a probabilistic perspective. The fact you’ve read The Middle Kingdom and Live Free or Die tells me more about you than the fact that you’ve read, say, Ender’s Game. Don’t get me wrong, I love Ender’s Game - it’s on my top 5 list of all-time favorites. But nearly a million people have rated Ender’s Game on Goodreads. The more popular the book, the more diverse its readers. Surely there are Ender’s Game readers who enjoy boring novels like Anne of Green Gables or The Handmaid’s Tale. You telling me that you’ve read Ender’s Game doesn’t tell me much about what type of reader you are. Same with Ready Player One, Snow Crash, and World War Z: excellent books, but so popular that they attract all sorts of people, including lots of people I absolutately don’t want to get book recommendations from. In probabilistic terms, P(good recommendations|has read The Middle Kingdom) > P(good recommendations|has read World War Z).

What to do then?

measuring how weird you are

The solution is to use something called Inverse User Frequency (IUF). Here is how it works. Goodreads has ~65 million users. To Kill a Mockingbird has been rated by 3.6 million users. So its IUF is log(65/3.6) \(\approx\) 2.8. The Middle Kingdom, on the other hand, has been rated by only 1.3k users. So its UDF is log(65/0.0013) \(\approx\) 10.8. So, The Middle Kingdom is almost four times weirder than To Kill a Mockingbird (without taking the log the differences would be too extreme; we want to take the book’s popularity into account but not too much). (If you’re familiar with natural language processing you will notice this is similar to Inverse Document Frequency.)

So instead of computing raw cosine similarities we are going to compute weighted cosine similarities, with each rating being weighted by the IUF of the corresponding book.

To give a concrete example, let’s go back to Geneva. Here are the IUFs of the books we have in common.

book raters IUF
The Moon is a Harsh Mistress 91k 6.5
Neuromancer 216k 5.7
Snow Crash 200k 5.7
The End of Eternity 34k 7.5
The Last Policeman 20k 8

Now that we have the IUFs we incorporate them in the cosine computation.

\[\dfrac{0.09(6.5) + 0.10(5.7) + 0.24(5.7) + 0.24(7.5) + 0.24(8.0)}{6.5 + 5.7 + 5.7 + 7.5 + 8.0}\\ \approx 0.15\]

That’s more like it. The books I have in common with Geneva are pretty popular and that’s reflected in our similarity going down from 0.93 to 0.15. Good. Now, the decrease is not as drastic as it seems. Weighting by IUF makes all cosine similarities smaller. Let’s go back to the hypothetical example of someone with whom I share two books and they are The Middle Kingdom and Live Free or Die. If that other person has rated both 4 and I have rated both 5 then our raw cosine similarity is 1 but our weighted cosine similarity is only 0.5.

still doesn’t quite feel like my tribe…

I adjusted the cosine similarities to account for the IUF, looked into the people with the highest similarity scores and… things were still a bit odd. When I looked into the top 10 folks and the books they’ve rated - and what their ratings were - I just couldn’t quite picture myself taking book recommendations from them in real life (like, if I had met them at a café). I know, this is super wishy-washy, but I’m sure you get the idea.

Turns out I had neglected to take into consideration that some people are just more generous and forgiving than others. Some people have a mean rating of 4.5 while others have a mean rating of 2.5. So we need to mean-center people’s ratings before we compute the similarity scores. As in: if your mean rating is 2.5 and you rated Snow Crash 4 stars then I’ll actually consider it 1.5 stars when computing our cosine similarity.

So, I redid everything but mean-centering the ratings.

did it work?

Well, we can’t answer that question just yet. First we need to answer another question: how do we know how good (or bad) the results are?

Turns out there is a simple metric for that. I can use the similarities I generated to “predict” my own ratings for books I have already rated, then compare the predicted ratings to my actual ratings.

Say I take Stephen and Geneva, which we’ve encountered before. Stephen rated Neal Stephenson’s Snow Crash five stars. Geneva rated it four stars. My similarity with Stephen is 0.93. My similarity with Geneva is 0.15. So if I wanted to use only Stephen and Geneva to produce a prediction, I could make it \(\dfrac{5(0.93) + 4 (0.15)}{0.93 + 0.15} = 4.86\). In other words, I could use a weighted average of Stephen’s and Geneva’s ratings to “predict” what my own rating of Snow Crash would be (the weights being our respective similarities). My actual rating of Snow Crash is five stars, which is pretty close to 4.86.

I can do this operation for all books I ever rated. And then I can check how close my predicted ratings and my actual ratings are. How can we measure that? Hhmm, if only there was a way to compare the similarity of two vectors… Oh wait, there is: cosine similarity. So I use cosine similarity twice: first to compute user-user similarities, then to assess the quality of the predictions those similarities produce. What a handy tool.

Now, here I don’t want to use all Goodreads users I scraped. The whole point of this exercise is to find my tribe, i.e., the people whose literary tastes are most similar to mine. That’s the folks I want to get book recommendations from. So I have to define a threshold - say, my top 100 or my top 10000 most similar users. And then I predict the ratings for the books I have rated in common with these top k people - my top k “neighbors”, as textbooks refer to them.

There is a trade off in choosing k. The higher the k the more neighbors you have to get predictions from, so the more books you have predictions for. But the higher the k the more you get predictions from neighbors who are not so close to you. Here I tried a bunch of different values for k and it turns out that k=1000 seemed like a good compromise.

Now we’re ready to the answer the original question: how good are the results? Turns out the cosine similarity between my predicted ratings and my actual ratings, using my top 1000 neighbors, is 0.91 (after adding back the mean, which we removed before - remember?). Since cosines vary from 0 to 1, that seems pretty good. Looks like I’ve finally found my tribe.

so what?

Ok, so how do we generate book recommendations? Well, I just “predicted” what my ratings would be for books I have already rated. That’s how I assessed the quality of the similarities. (That’s how we assess any predictive model: we “predict” things that have already happened, then check how right or wrong those “predictions” are.) Now, to get actual book recommendations I simply predict what my ratings will be for books I have not rated and I pick the books which I’m predicted to rate highly. So this time I will predict without quotes: what I’m predicting hasn’t happened yet.

I went back to my top 1000 neighbors and scraped their ratings for every book they have ever rated (and that I never did). Then I predicted what my own rating would be for each one of those books.

That resulted in a ranking of 29,335 books. Here are the top 12:

  • Jim Butcher’s Storm Front. This is the first of a series of books, several of which were in the top 100. I had never heard of it before. I’m not really into the fantasy genre, but I guess I’ll give it a try.

  • Brandon Sanderson’s The Way of Kings. Another first book of a fantasy series I had never heard of.

  • Carl Sagan’s Cosmos. I’ve seen the Neil deGrasse Tyson show and I loved it, so I’m surely going to read this.

  • Brian Vaughan and Fiona Staples’ Saga. I’ve never read a graphic novel before and I never thought I would, but I guess there is a first time for everything.

  • Terry Pratchett’s The Color of Magic. Yet more fantasy. This one I had seen around but the book description hadn’t appealed to me.

  • Mario Puzo’s The Godfather. Ok, this probably should have been in my want-to-read shelf already.

  • Yuval Harari’s Sapiens. At one point I had this book in my want-to-read shelf but then I read Harari’s other book, Homo Deus, and I didn’t really enjoy it, so I gave up on Sapiens. I guess I’ll put it back there now.

  • Brandon Sanderson’s The Final Empire. Yet more fantasy I had never heard of.

  • David Weber’s On Basilisk Station. Now, that’s what I’m talking about! Finally some sci-fi in my recommendations. This is probably the one I’ll read first.

  • Patrick Rothfuss’ The Name of the Wind. More. Freaking. Fantasy. This one I had seen around before but the title is so uninspiring that I never even bothered to read the book’s description.

  • Dostoyevsky’s The Brothers Karamazov. Not sure what to do about this one. I endured Tolstoy’s Anna Karenina and War and Peace and from what I’ve heard Dostoyevsky is similarly soporific. I may not have the fortitude to go through this one.

  • J.J. Benítez’ Jerusalén. This one I read before - some 25 years ago, as a kid. (But I rated a different edition on Goodreads, so it ended up on my recommendations here.) This was the first book about time travel I ever read and I absolutely loved it. Now, my thirteen-year-old self was not a particularly discriminating reader. So I think I’ll pass. I don’t want to ruin the good memories I have of it.

And voilà, I have built and used my own book recommender system.

These results are less exciting than I expected. I don’t understand all the fantasy books. I’ve read a fantasy novel here and there but I’ve read all of Asimov’s Robot, Empire, and Foundation novels, most Neal Stephenson books, a lot of Robert Heinlein and John Scalzi - so why the heck is there only one sci-fi book in my top 12 recommendations?

I guess some of the oddities may be due to my including non-fiction books when computing the cosine similarities. The fact that you and I have both read Daron Acemoglu’s Economic origins of dictatorship and democracy and William Greene’s Econometric analysis and Luciano Ramalho’s Fluent Python may result in a high similarity score but that doesn’t mean we like the same stuff when it comes to fiction. Dropping non-fiction books might improve the results. That would require going through each of my books and flagging them as either fiction and non-fiction, then redoing everythig I’ve done here, so I’m not doing it. You and I will just have to live with that, reviewer #2.

But there is a lot more room for improvement here - it’s not just about dropping non-fiction books. Let’s talk about some possibilities.

what next?

If you’re building a recommender system on top of cosine similarity or some other such metric the way to go about it is to tweak the similarity formula in all sorts of ways - give more weight to this or that, etc -, check how good or bad the resulting predictions are in each case, then pick the formula that yields the best predictions. I did a little tweaking here (IUF, mean-centering), but if you’re serious about it you should try a lot more.

Besides, there are other ways to build recommender systems that don’t rely on hard-coded similarity formulas. For instance, we could use linear regression to learn the weight each neighbor should have. We’d still end up with a weighted average for the predictions, but the weights would have been learned from the data.

Alternatively we could use algorithms like random forest, support vector machine, or neural network. These algorithms have one big advantage over a linear regression: they learn weird relationships between the variables you have. It could be, for instance, that I’m unlikely to enjoy books that are recommended by both Stephen and Geneva but likely to enjoy books that are recommended only by Geneva and not by Stephen or vice-versa. Or that I’m super likely to enjoy a book recommended by Peter only if his rating is exactly five stars and all other ratings are exactly one star. Linear regression won’t capture any of that unless you know beforehand what these oddities are and model them explicitly. But random forest, SVM, and neural networks can learn those oddities from the data and incorporate them into their predictions. The drawback is that if you don’t have a lot of data these algorithms can easily mistake noise for signal and incorrectly pick up idiosyncrasies.

You could also use clustering algorithms, like k-means or DBSCAN. They work by grouping (clustering) samples (in our case, those would be Goodreads users) together according to their similarities.

Now, using linear regression, random forest, SVM, neural network, or clustering requires overcoming a problem: missing data. Here my top 1000 neighbors don’t have a single book that they all have rated in common. There are too many missing data points to run a regression or do anything of the kind. So before doing any of that you’ll need to either impute the missing data points or use matrix factorization to reduce the dimensionality of your dataset. The exception is k-means, which is relatively straightforward to adapt for use with massive missing data (see the Aggarwal book I mentioned before, chapter 12).

An alternatively that bypasses the missing data issue entirely is to model the data as a big graph, which is just fancy data science jargon for a network, like Facebook or Twitter: you have nodes (people and pages) and links between them (A follows B, A is friends with B, etc). Here we could model each Goodreads user as a node and each rating that two users have in common as a link between them. Then we could use a class of algorithms known as community detection algorithms to find what the “cliques” are.

Here I was just building a recommender system for myself, for the fun of it, so I’m not going to do any of that. But if you’re doing something serious with it then you probably should try at least some of these alternatives and see which one yields the best predictions.


This is it. Happy reading!

Outside of a dog, a book is man’s best friend. Inside of a dog it’s too dark to read. (Groucho Marx)