Mancur Olson and stock picking

I just finished re-reading Mancur Olson’s “The rise and decline of nations”, published in 1982. (According to my notes I had read parts of it back in grad school, for my general exams, but I have absolutely no memory of that.) Olson offers an elegant, concise explanation for why economic growth varies over time. TL;DR: parasitic coalitions - unions, subsidized industries, licensed professions, etc - multiply, causing distributive conflicts and allocative inefficiency, and those coalitions can’t be destroyed unless there is radical institutional change, like foreign occupation or totalitarianism. In a different life I would ponder the political implications of the theory. But I’ve been in a mercenary mood lately, so instead I’ve been wondering what the financial implications are for us folks trying to grow a retirement fund.

the argument

Small groups organize more easily than large groups. That’s why tariffs exist: car makers are few and each has a lot to gain from restricting competition in the car industry, whereas consumers are many and each has comparatively less to gain from promoting competition in the car industry. This argument is more fully developed in Olson’s previous book “The logic of collective action”. What Olson does in “The rise and decline of nations” is to use that logic to understand why rich countries decline.

Once enacted, laws that benefit particular groups at the expense of the rest of society - like tariff laws - are hard to eliminate. The group that benefits from the special treatment will fight for its continuation. Organizations will be created for that purpose. Ideologies will be fomented to justify the special treatment (dependency theory, for instance, is a handy justification for tariffs and other types of economic malfeasance).

Meanwhile, the rest of society will be mostly oblivious to the existence of that special treatment - as Olson reminds us, “information about collective goods is itself a collective good and accordingly there is normally litle of it” (a tariff is a collective good to the protected industry). We all have better things to do with our lives; and even if we could keep up with all the rent-seeking that goes on we’d be able to do little about it, so the rational thing to do is to stay ignorant (more on this).

Hence interest groups and special treatments multiply; competition is restricted, people invest in the wrong industries, money is redistributed from the many to the few. The returns to lobbying relative to the returns of producing stuff increase (more on this). Economic growth slows down. Doing business gets costlier. As Olson puts it, “the accumulation of distributional coalitions increases the complexity of regulation”. Entropy. Sclerosis.

Until… your country gets invaded by a foreign power. Or a totalitarian regime takes over and ends freedom of association. One way or another the distributional coalitions must be obliterated - that’s what reverses economic decline. That’s the most important take-away from the book. Not that Olson is advocating foreign invasion or totalitarianism. (Though he does quote Thomas Jefferson: “the tree of liberty must be refreshed from time to time with the blood of patriots and tyrants”.) Olson is merely arguing that that’s how things work.

I won’t get into the empirical evidence here. The book is from 1982 so, unsurprisingly, Olson wasn’t too worried about identification strategy, DAGs, RCTs. If the book came out today the reception would probably be way less enthusiastic. There are some regressions here and there but the book is mostly narrative-driven. But since 1982 many people have tested Olson’s argument and a 2016 paper found that things look good:

Overall, the bulk of the evidence from over 50 separate studies favors Olson’s theory of institutional sclerosis. The overall degree of support appears to be independent of the methodological approach between econometric regression analysis on growth rates versus narrative case studies, publication in an economics or a political science journal, location of authorship from an American or European institution, or the year of publication.

can Olson help us make money?

There isn’t a whole lot of actionable knowledge in the book. We don’t have many wars anymore:

I mean, between 2004 and 2019 Iraq’s GDP grew at over twice the rate of Egypt’s or Saudi Arabia’s. And between 2002 and 2019 Afghanistan’s GDP grew at a rate 36% faster than Pakistan’s. But that’s about it. A Foreign-Occupied Countries ETF would have a dangerously small number of holdings.

And there aren’t many totalitarian regimes in place:

Waves of democracy.png

A Dear Leader ETF would also be super concentrated. (Not to mention that it would contain North Korea and Turkmenistan.)

In any case, GDP growth and stock market growth are different things. Over the last five years the S&P500 outgrew the US GDP by 83%. Conversely, the annualized return of the MSCI Spain was lower than 1% between 1958 and 2007 even though the annualized GDP growth was over 3.5% in that same period.

So, country-wise there isn’t a lot we can do here.

What about industry-wise?

Olson’s book is about countries - it’s in the very title. But his argument extends to industries. Olson himself says so when he talks about the work of economist Peter Murrell:

Murrell also worked out an ingenious set of tests of the hypothesis that the special-interest groups in Britain reduced the country’s rate of growth in comparison with West Germany’s. If the special-interest groups were in fact causally connected with Britain’s slower growth, Murrell reasoned, this should put old British industries at a particular disadvantage in comparison with their West German counterparts, whereas in new industries where there may not yet have been time enough for special-interest organizations to emerge in either country, British and West German performance should be more nearly comparable.

In other words, new industries will have fewer barriers to entry and other organizational rigidities. This seemingly banal observation, which Olson saw as nothing more than a means to test his theory, may actually explain why we don’t have flying cars.

Back in 1796 Edward Jenner, noting that milkmaids were often immune to smallpox, and that they often had cowpox, took aside his gardener’s eight-year-old son and inoculated the boy with cowpox. Jenner later inoculated the boy with smallpox, to see what would happen, and it turned out that the boy didn’t get sick - the modern vaccine was invented. Now imagine if something like the FDA existed back then. We probably wouldn’t have vaccines today, or ever.

It’s the same with the internet, social media, Uber, just about any new industry or technology: at first the pioneers can do whatever they want. But then parasitic coalitions form. They can form both outside the new sector and inside it. Taxi drivers will lobby against Uber. Uber will lobby for the creation of entry barriers, to avoid new competitors - incumbents love (the right type of) regulation. Both sources of pressure will reduce efficiency and slow growth.

Investment-wise, what does that mean? That means it will become a lot harder for, say, ARK funds to keep their current growth rate. ARK funds invest in disruptive technologies - everything from genomics to fintech to space exploration. They are all new industries, with few parasitic coalitions, so right now they’re booming. (Well, they may also be booming because we’re in a big bull market.) But - if Olson is right - as those industries mature they will become more regulated and grow slower. ARK will need to continually look for new industries, shedding its holdings on older industries as it moves forward.

In short: Olson doesn’t help us pick particular countries to invest in, and he doesn’t help us pick particular industries to invest in, but he helps us manage our expectations about future returns. Democratic stability means that fewer of us die in wars and revolutions, but it also means that buying index funds/ETFs may not work so well in the future. Democratic stability - and the parasitic coalitions it fosters - means that today’s 20-year-old kids may be forced to pick stocks if they want to grow a retirement fund. (Though if you’re not Jim Simons what are your chances of beating the market? Even Simons is right only 50.75% of the time. Are we all going to become quant traders? Will there be any anomalies left to exploit when that happens?)

Or maybe when things get bad enough we will see revolutions and wars and then economic growth will be restarted? Maybe it’s all cyclical? Or maybe climate change will be catastrophic enough to emulate the effects of war and revolution? Maybe we will have both catastrophic climate change and wars/revolutions?

but what if the country has the right policies?

Olson gives one policy prescription: liberalize trade and join an economic bloc. Free trade disrupts local coalitions, and joining an economic bloc increases the cost of lobbying (it’s easier for a Spanish lobby to influence the national government in Madrid than the EU government in Brussels). But policy is endogenous: the decision to liberate trade or join an economic bloc will be fought by the very parasitic coalitions we would like to disrupt. And if somehow the correct policy is chosen, it is always reversible - as the example of Brexit makes clear. Also, not all economic blocs are market-friendly. Mercosur, for instance, has failed to liberalize trade; instead, it subjects Argentine consumers to tariffs demanded by Brazilian producers and vice-versa. (I have no source to quote here except that I wasted several years of my life in Mercosur meetings, as a representative of Brazil’s Ministry of Finance.)

So it won’t do to look at the Heritage Foundation’s index of economic freedom and find, say, the countries whose index show the most “momentum”. One day Argentina is stabilizing its currency, Brazil is privatizing Telebrás. Latin America certainly looked promising in the 1990s. Then there is a policy reversal and we’re back to import-substitution. You can’t trust momentum. You can’t trust economic reform. You can’t trust any change that doesn’t involve the complete elimination of parasitic coalitions. And even then things may decline sooner than you’d expect - Chile is now rolling back some of its most important advances.

what if ideas matter?

In the last chapter of the book Olson allows himself to daydream:

Suppose […] that the message of this book was then passed on to the public through the educational system and the mass media, and that most people came to believe that the argument in this book was true. There would then be irresistible political support for policies to solve the problem that this book explains. A society with the consensus that has just been described might choose the most obvious and far-reaching remedy: it might simply repeal all special-interest legislation or regulation and at the same time apply rigorous anti-trust laws to every type of cartel or collusion that used its power to obtain prices or wages above competitive level.

That sounds lovely, but it’s hard to see it happening in our lifetimes. Even if ideas do matter the Overton window is just too narrow for that sort of ideas. It’s probably The Great Stagnation from now on.

measuring statism

When it comes to economic freedom, Brazil ranks a shameful 144th - behind former Soviet republics like Ukraine (134th) and Uzbekistan (114th). Down here the recently announced PlayStation 5 is going to sell for twice the price it is sold in the US, because decades ago some academics and bureaucrats decided that heavy taxes on videogames would make industrialists manufacture “useful” stuff instead (cars and whatnot). We even try to regulate the work of people who watch parked cars for a living, if you can believe that (I kid you not).

But how does that interventionist appetite vary across economic activities?

People try to answer that question in all sorts of ways. They look at how much money each industry spends on lobbying. They look at how often lobbists travel to the country’s capital. They look at how well-funded each regulatory agency is. They count the number of words in each industry’s regulations (RegData is a cool example; Letícia Valle is doing something similar for Brazil).

Here I’ll try something else: I’ll look at the companies that get mentioned in the Diário Oficial da União, which is the official gazette where all the acts of the Brazilian government are published. I’ll look up what each company does (mining? banking? retail? etc) so that I can aggregate company mentions by economic sector. That will let me know how much government attention each sector receives.

(Yes, there are lots of caveats. We’ll get to them in due time - chill out, reviewer #2.)

mining the Diário Oficial da União

Each company in Brazil has a unique identifier. It’s like a Social Security Number, but for organizations. It’s a 14-digit number; we call it CNPJ (Cadastro Nacional de Pessoas Jurídicas). Here is an example: 20.631.209/0001-60. When a company is mentioned in the Diário Oficial da União, that company’s CNPJ is there.

I had already downloaded the Diário Oficial da União, for something else. Well, not all of it: the Diário was launched in 1862, but the online version only goes back to 2002. So here we have about 18 years of government publications.

To get all CNPJs mentioned in the Diário I used this regular expression:


I didn’t mine the whole Diário. The Diário is divided into three sections: section 1, which has laws and regulations; section 2, which has personnel-related publications (promotions, appointments, etc); and section 3, which has procurement-related publications (invitations for bids, contracts, etc). I only mined section 1, as the other sections would merely add noise to our analysis. Just because your company won a contract to supply toilet paper to some government agency doesn’t mean your industry is regulated.

That regular expression resulted in 1.5 million matches (after dropping matches that were not valid CNPJs). In other words, section 1 of the Diário contains 1.5 million CNPJ mentions between 2002 and mid-2020 (I scraped the Diário back in July and I was too lazy to scrape the rest of it now).

The result was a table that looks like this:

date cnpj
2010-09-28 39302369000194
2010-09-28 39405063000163
2010-09-28 60960994000110
2010-09-29 31376361000160
2010-09-29 76507706000106
2010-09-29 08388911000140

That’s it for the Diário part. Now on to economic sectors.

from CNPJs to CNAEs

The Brazilian government has a big list of economic activities, and assigns to each of them a 5-digit numeric code. Hence 35140 is “distribution of electric energy”, 64212 is “commercial banks”, and so on. We call that taxonomy the CNAE (Classificação Nacional de Atividades Econômicas). There are some 700 CNAE codes in total.

(The CNAE is similar to the International Standard Industrial Classification of All Economic Activities - ISIC. You can find CNAE-ISIC correspondence tables here).

When you start a company in Brazil you’re required to inform the government what type of economic activity your company will be doing, and you do that by informing the appropriate CNAE code. That means each CNPJ is associated with a CNAE code. Fortunately, that data is publicly available.

I parsed that data to create a big CNPJ->CNAE table. If you want to do that yourself you need to download all the 20 zip files, unzip each of them, then run something like this:

import os
import pandas as pd

# path to the unzipped data files
path_to_files = '/Users/thiagomarzagao/Desktop/data/cnaes/'

# position of the CNPJ and CNAE fields
colspecs = [
    (4, 17), # CNPJ
    (376, 382), # CNAE

# fix indexing
colspecs = [(t[0] - 1, t[1]) for t in colspecs]

# set field names and types
names = {
    'CNPJ': str,
    'CNAE': str,

# load and merge datasets
df = pd.DataFrame([])
for fname in os.listdir(path_to_files):
    if ('.csv' in fname) or ('.zip' in fname):
    df_new = pd.read_fwf(
        path_to_files + fname, 
        skiprows = 1, 
        skipfooter = 1, 
        header = None, 
        colspecs = colspecs, 
        names = list(names.keys()), 
        dtype = names
    df = df.append(df_new)

# drop CNPJS that don't have a valid CNAE
df = df.dropna()
df = df[df['CNAE'] != '0000000']

# drop duplicates
df = df.drop_duplicates()

# save to csv
df.to_csv(path_to_files + 'cnpjs_to_cnaes.csv', index = False)

I joined the table produced by that script with the table I had created before (with dates and CNPJs - see above). The result was something like this:

date cnpj cnae
2010-09-28 39302369000194 85996
2010-09-28 39405063000163 46192
2010-09-28 60960994000110 58115
2010-09-29 31376361000160 80111
2010-09-29 76507706000106 32302
2010-09-29 08388911000140 80111

That’s all we need to finally learn which sectors get the most government attention!

But first a word from my inner reviewer #2.


Perhaps pet shops appear in the Diário Oficial da União a lot simply because there a lot of pet shops - and not because pet shops are heavily regulated. Also, the Diário Oficial da União only covers the federal government - which means that I am ignoring all state and municipal regulations/interventions. (Some nice folks are trying to standardize non-federal publications; they could use your help if you can spare the time.) Finally, each CNPJ can be associated with multiple CNAE codes; one of them has to be picked as the “primary” one, and that’s the one I’m using here, but it’s possible that using each CNPJ’s secondary CNAE codes might change the results.

This whole idea could be completely bonkers - please let me know what you think.

statism across economic sectors

Here are the 30 economic sectors whose companies most often show up in the Diário. Mouse over each bar to see the corresponding count.

(The description of some sectors is shortented/simplified to fit the chart. Sometimes the full description includes lots of footnotes and exceptions - “retail sale of X except of the A, B, and C types”, that sort of thing.)

Wow. Lots to unpack here.


The top result - “retail sale of pharmaceutical goods” - is a big surprise to me. I mean, yes, I know that selling drugs to people is a heavily regulated activity. But I thought it was the job of states or municipalities to authorize/inspect each individual drugstore. I thought the federal government only laid out general guidelines.

Well, I was wrong. Turns out if you want to open a drugstore in Brazil, you need the permission of the federal government. And that permission, if granted, is published in the Diário. Also, it falls to the federal government to punish your little pharmacy when you do something wrong - irregular advertising, improper storage of medicine, and a myriad of other offenses. Those punishments also go in the Diário.

Now, we need to keep in mind that there are a lot more drugstores than, say, nuclear power plants. I’m sure that nuclear plants are under super intense, minute regulation, but because they are rare they don’t show up in the Diário very often. So we can’t conclude that selling drugs to people is the most heavily regulated sector of the Brazilian economy.

We could normalize the counts by the number of CNPJs in each economic sector. But I think the raw counts tell an interesting story. The raw counts give us a sense of how “busy” the state gets with each economic sector. I’m sure that nuclear plants are more heavily regulated than drugstores, but I bet that a lot more bureaucrat-hours go into regulating drugstores than into regulating nuclear plants.

NGOs (sort of)

The second most frequent category - “social rights organizations” - corresponds to what most people call non-governmental organizations. Except that this is Brazil, where NGOs are not really NGOs: they receive a lot of funding from the state and they perform all kinds of activities on behalf of the state. Which explains why CNPJs belonging to NGOs (well, “NGOs”) have appeared over 90 thousand times in the Diário since 2002.

I looked into some of the cases. There are NGOs receiving state funding to provide healthcare in remote areas; to provide computer classes to kids in poor neighborhoods; to fight for the interests of disabled people; just about anything you can think of. Brazil has turned NGOs into government agencies. Our non-governmental organizations are independent from the Brazilian government in the same way that the Hitlerjugend was independent from the Reich.

Needless to say, NGOs are often at the center of corruption scandals. People come up with a pretty name and a CNPJ, apply for funding to do some social work, and then just pocket the money.


As if government-funded NGOs weren’t embarassing enough, a good chunk of the performing arts in Brazil is also government-funded. Hence the third and fifth most frequent categories here: “performing arts, concerts, dancing” and “movies, videos, and TV shows”.

You don’t have to be a good producer. As long as you are in the good graces of the government, you can apply for funding from the Ministry of Culture and use it to do your play/movie/concert. That’s why CNPJs belonging to those two categories have appeared in the Diário over 90 thousand times since 2002. I did the math and that’s about fifteen CNPJs a day receiving funding to have people dancing or acting or whatnot. Mind you, that’s just the federal funding.

And that’s just the official funding. Sometimes taxpayers end up funding “cultural” productions through less-than-transparent means. For instance, back in 2009 there was a biopic about Lula da Silva, who happened to be the president at the time. Well, turns out that 12 of the 17 companies that invested in the production of the movie had received hundreds of millions of dollars in government contracts. Neat, right?

Every now and then a good movie or play comes out of it. If you haven’t seen Tropa de Elite yet you should stop whatever you’re doing and go watch it (it’s on Netflix). But nearly all productions are flops. For every Tropa de Elite there are thirty Lula, Filho do Brasil.

If you want to have a taste of how bad most productions are, here is a teaser of Xuxa e os Duendes, which pocketed a few million bucks in taxpayers money. Trust me, you don’t need to understand Portuguese to assess the merits of the thing:

Meanwhile, we have over 40 thousand homicides a year, only a tiny fraction of which get solved. But what do I know, maybe Xuxa e os duendes is the sort of thing Albert Hirschman had in mind when he talked about backward and forward linkages.

I leave the analysis of the other categories as an exercise for the reader. If you want to see the full results, it’s here.

to do

It would be interesting to see how these counts by state or municipality; how these counts correlate with other measures of statism; how they change over the years; and so on.

That’s it. Remember, folks: these are the people in charge of public policy.

word embeddings for bureaucratese

You can find pre-trained word embeddings for hundreds of different languages - FastText alone has pre-trained embeddings for 157 languages (see here). But a single language can come in multiple “flavors”. I’m not talking about dialects, but about the different vocabulary and writing styles you find in news articles vs social media vs academic papers, etc. Most word embeddings come from a limited number of sources, with Wikipedia and news articles being the most common ones. If you have, say, social media texts, using word embeddings trained on Wikipedia entries may not yield good results.

So I decided to train my own Brazilian Portuguese word embeddings on the source that interests me the most: government publications. Decrees, invitations for bids, contracts, appointments, all that mind-numbingly boring stuff that makes up the day-to-day life of the public sector. Those embeddings might help me in future text-related tasks, like classifying government decrees and identifying certain types of contracts. I imagine it could be useful for other folks working with Brazilian government publications, so here’s how I did that.

I started by scraping the official bulleting where all the acts of the Brazilian government are published: the Diário Oficial da União. To give you an idea of how much text that is, the Diário’s 2020-07-06 issue has a total of 344 pages - with a tiny font and single spaces. (The Brazilian state is humongous and the size of the Diário reflects that.) The Diário is available online going as far back as 2002-01-01 and I scraped all of it. That amounted to about 8GB of zip files. Here is how to scrape it yourself (I used Python for everything):

import os
import requests
from bs4 import BeautifulSoup

months = [
    # month names in Portuguese

# path to the folder that will store the zip files
basepath = '/path/to/diario/' # create the folder first

# loop through years and months
for year in range(2002, 2021): # change end year if you're in the future
    for month in months:
        print(year, month)
        url = '{}/{}'.format(str(year), month)
        r = requests.get(url)
        soup = BeautifulSoup(r.content)
        tags = soup.find_all('a', class_ = 'link-arquivo')
        urls = ['' + e['href'] for e in tags]
        fnames = [e.text for e in tags]
        for url, fname in zip(urls, fnames):
            if not os.path.isfile(basepath + fname):
                    r = requests.get(url)
                    with open(basepath + fname, mode = 'wb') as f:

After the scraping is done you can unzip each of those 400+ files manually or you can automate the job:

import os
import zipfile

ipath = '/path/to/diario/'
opath = '/path/to/diario_xml/' # create folder first
for fname in os.listdir(ipath):
    if '.zip' in fname:
        year = fname[5:9]
        month = fname[3:5]
        section = fname[2:3]
        print(year, month, section, fname)
        destination = opath + year + '/' + month + '/' + section + '/'
            with zipfile.ZipFile(ipath + fname) as zip_ref:
            print('error; moving on') # some zip files are corrupted

This won’t give you all the text in the Diário Oficial da União since 2002-01-01. Some zip files are corrupted and most issues are incomplete. For 2016, for instance, only the May issues are available. And for all years except 2019 and 2020 one of the sections (section 3) is missing entirely (the Diário is divided in three sections - 1, 2, and 3). Also, after you unzip the files you find out that in many cases the text is not in XML but in JPEG format. I wasn’t in the mood to do OCR so I just ignored the JPEG files.

If you want to get in touch with the Diário’s publisher to discuss those problems be my guest. Here I don’t care much about those problems because all I need to train my word embeddings is a ton of data, not all of the data. And with the XML files that I got I have over 4 million government acts, which is probably way more than I need here.

After unzipping everything I trained my word embeddings. I chose to go with gensim’s implementation of word2vec. The beauty of gensim’s implementation is that you can stream the texts one by one straight from disk, without having to keep everything in memory. Now, that’s a little tricky to accomplish. Gensim’s documentation says that instead of a list of documents you can use a generator, but I found that not to be the case. I got this error: TypeError: You can't pass a generator as the sentences argument. Try an iterator. But I googled around and found a nifty workaround that tricks gensim into using a generator by wrapping it inside an iterator. So here I have a generator (yield_docs) that yields one document at a time and then I wrap it inside an iterator (SentencesIterator) so that gensim won’t complain.

About the documents, I have some 4.2 million XML files in total. In theory all these XML files should be easily parsable - they have tags with metadata, main content, etc. But in reality many are invalid. They have unclosed quotation marks and other problems that trip BeautifulSoup’s parser. So I ignored all the metadata and just focused on the stuff inside the <Texto> (text) tags, which is always a collection of <p> tags. Now, different paragraphs of the same publication can talk about entirely different issues, so instead of treating each publication (i.e., each XML file) as a document I’m treating each <p> content as a document. That should yield more coherent word associations. So while I have 4.2 million XML files, in the end I have 72 million documents, one corresponding to each <p> tag. That’s… a lot of text.

Back to word2vec. I don’t really know the ideal number of dimensions here. I found a nice paper that provides a way to estimate the ideal number of dimensions for any dimensionality reduction algorithm. But it’s too computationally expensive: you need to create a graph where each unique token is a node and the co-occurrences are represented by edges. I tried it but the thing got impossibly slow at around 200k nodes - and I have over 1M unique tokens. By my estimates it would take about half a year for the nodes to be created, and then I would need to find the graph’s maximum clique, which is also computationally expensive. So… no. If I had a specific text classification task in mind I would just try different numbers of dimensions and see what works best, but that’s not what I’m doing right now. So instead of relying on any theoretical or empirical approaches I just went with 300 dimensions because that’s a nice round number and I’ve seen it used in other word embeddings.

I’m discarding all words that appear in fewer than 1000 paragraphs (probably too rare to matter) and I’m using a short window of 5 (maximum distance between current and predicted word in a sentence).

Here’s the code:

import os
from bs4 import BeautifulSoup
from string import punctuation
from gensim.models import Word2Vec

def tokenize(raw_text):
    'Hey, dogs are awesome!' -> ['hey', 'dogs', 'are', 'awesome']

    using `re` would probably make it run faster but I got lazy

    # lowercase everything
    text = raw_text.lower()

    # remove punctuation
    for p in punctuation:
        text = text.replace(p, ' ')

    # remove extra whitespaces
    while '  ' in text:
        text = text.replace('  ', ' ')

    # tokenize
    tokens = text.strip().split()

    # remove digits
    tokens = [e for e in tokens if not e.isdigit()]

    return tokens

def yield_docs():
    crawl XML files, split each one in paragraphs
    and yield one tokenized paragraph at a time
    n = 0
    path = '/path/to/diario_xml/'
    for root, dirpath, fnames in os.walk(path):
        if not dirpath:
            for fname in fnames:
                if '.xml' in fname:
                    filepath = root + '/' + fname
                    with open(filepath, mode = 'r') as f:
                        content =
                    soup = BeautifulSoup(content, features = 'lxml')
                    paragraphs = soup.find_all('p')
                    for p in paragraphs:
                        n += 1
                        tokens = tokenize(p.text)
                        if len(tokens) > 1:
                            yield tokens

class SentencesIterator():
    this tricks gensim into using a generator,
    so that I can stream the documents from disk
    and not run out of memory; I stole this code
    from here:
    def __init__(self, generator_function):
        self.generator_function = generator_function
        self.generator = self.generator_function()

    def __iter__(self):
        # reset the generator
        self.generator = self.generator_function()
        return self

    def __next__(self):
        result = next(self.generator)
        if result is None:
            raise StopIteration
            return result

# train word2vec
model = Word2Vec(
    size = 300, 
    window = 10, 
    min_count = 1000, 
    workers = 6

# save to disk'word2vec.model')

And voilà, we have our word embeddings. We have a total of 27198 unique tokens (remember, we ignored any tokens that appeared in fewer than 1000 paragraphs) and 300 dimensions, so our word embeddings are a 27198x300 matrix. If you’re not familiar with word2vec Andrew Ng explains it here. The TL;DR is that word2vec’s output is a matrix where each unique token is represented as a vector - in our case, a 300-dimensional vector. That allows us to do a bunch of interesting stuff with that vocabulary - for instance, we can compute the cosine similarity between any two words to see how related they are. In gensim there is a neat method for that. For instance, suppose we want to find the words most related to “fraude” (fraud):

model.wv.most_similar(positive = ['fraude'])
>>> [('fraudes', 0.5694327354431152),
>>> ('conluio', 0.5639076232910156),
>>> ('superfaturamento', 0.5263874530792236),
>>> ('irregularidade', 0.4860353469848633),
>>> ('dolo', 0.47721606492996216),
>>> ('falsidade', 0.47426754236221313),
>>> ('suspeita', 0.47147220373153687),
>>> ('favorecimento', 0.4686395227909088),
>>> ('ilícito', 0.4681907892227173),
>>> ('falha', 0.4664713442325592)]

We can see that bid rigging (“conluio”) and overpricing (“superfaturamento”) are the two most fraud-related words in government publications (“fraudes” is just the plural form of “fraude”). Kinda cool to see it. You can also cluster the word embeddings to find groups of inter-related words; use t-SNE to reduce dimensionality to 2 so you can plot the embeddings on an XY plot; and try a bunch of other fun ideas.

Here I trained the word embeddings from scratch but you could also take pre-trained Brazilian Portuguese embeddings and use the Diário to fine-tune them. You could also tweak the parameters, changing the window (10 here) and the number of dimensions (300). Whatever works best for the task you have at hand.

That’s all for today! And remember, bureaucratese is bad writing - don’t spend too long reading those texts, lest you start emulating them. The best antidote to bureaucratese (or to any bad writing) that I know is William Zinsser.

how much is your apartment worth? (part 2)

So, Brazilian banks are using predictive models to do property valuation but they are doing it wrong. It’s time for us data folks to step in and cause some disruption.

our hypothetical property

Let’s assume that we want to find the market value of an apartment in Brasília. It’s 120m2, it’s in the Noroeste sector (Brasília is a planned city, so it has sectors instead of neighborhoods), it has three bedrooms, three bathrooms, and two parking spots. The condo fee is R$ 580/month and the property tax is R$ 745/month.

getting data

It’s 2019, so there’s no need to go knocking on doors asking people how much they paid for their apartments, how many bedrooms they have, etc. We can simply scrape online listing sites like ZAP, wimoveis, or Viva Real. Any of the three would do (and scraping all three of them would be best). Here I’ll scrape wimoveis for the simple reason that it is the easiest one to scrape.

Granted, that just gives us asking prices, not transaction prices. But the alternative is to go knocking on doors asking people about their apartments. (Unless of course you are a bank and you have tons of mortgage data.)

Here’s the (Python) code I wrote to scrape the result pages:

import time
import requests

destination = '/Volumes/UNTITLED/wimoveis/paginas/'
base_url = ''
num_pages = 1557 # number of results pages
for i in range(1, num_pages):
    print('page', i)
    query_url = base_url + 'apartamentos-venda-distrito-federal-goias-pagina-{}.html'.format(i)
    response = requests.get(query_url)
    if response.status_code == 200:
        # save source code of the page 
        with open(destination + 'pagina_{}.html'.format(i), mode = 'w') as f:

Every page of results contains up to 20 listings. But it only has summary information for each listing. The full data is in each listing’s own URL. So we need to get the URL of every listing from every page. I use BeautifulSoup for that:

import os
from bs4 import BeautifulSoup

hrefs = []
path = '/Volumes/UNTITLED/wimoveis/paginas/'
for fname in os.listdir(path):
    if ('.html' in fname) and ('._' not in fname):
        with open(path + fname, mode = 'r') as f:
            html =
            soup = BeautifulSoup(html)
            h4 = soup.find_all('h4', class_ = 'aviso-data-title')
            href = [e.find('a')['href'] for e in h4]
            hrefs += href

df = pd.DataFrame(hrefs)
df.to_csv('hrefs.csv', index = False)

Now we’re finally ready to scrape the listings themselves, with all the data we need (price, m2, pictures, etc).

import os
import re
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup

basepath = '/Volumes/UNTITLED/wimoveis/anuncios/'
hrefs = pd.read_csv('hrefs.csv') # get URLs
hrefs = set(hrefs['href']) # remove duplicate URLs
for i, href in enumerate(hrefs):

    # get ID of the listing
    id_anuncio = re.findall(r'[0-9]{1,20}\.html', href)[0].replace('.html', '')

    # if listing has been downloaded before, ignore
    path = basepath + id_anuncio + '/'
    if os.path.exists(path):

    # get the source code of the listing;
    # doesn't always work on the first try, so
    # wait for 60s and try again if necessary;
    # looks like this risks infinite loops, but
    # somehow that didn't happen
    url = '' + href    
    while True:
            response = requests.get(url)
            print('error; waiting')

    # if it worked, move on
    if response.status_code == 200:
        print(i, path)
        os.mkdir(path) # create destination directory
        html = response.text # get source code

        # save source code to file
        with open(path + 'anuncio_' + str(i) + '.html', mode = 'w') as f:

        # now the time-consuming part: getting the
        # pictures of the listing
        pic_path = path + 'pics/'
        os.mkdir(pic_path) # create destination directory

        # find URLs of the pictures
        soup = BeautifulSoup(html)
        figures = soup.find_all('figure', class_ = 'slide-content')
        links = [e.find('img')['data-flickity-lazyload'] for e in figures]

        # try downloading each picture
        for n, link in enumerate(links):
            while True:
                    response = requests.get(link, stream = True)
                    print('conn error; waiting')

            # if it worked, save picture to file
            if response.status_code == 200:
                with open(pic_path + str(n) + '.jpg', mode = 'wb') as f:
                    for chunk in response:

This will take a couple of days to run, because of all the pictures you’re downloading.

In the end we’ll have over 15k samples. That’s up from the current 25-250 samples that real estate appraisers are using.

parsing data

Ok, what we have now is a huge mess of HTML and JPG files. The data we need is all buried in those files. We need to extract it.

For now I’ll ignore the JPG files and only use the HTML files.

import os
import pandas as pd
from bs4 import BeautifulSoup

dados = []
basepath = '/Volumes/UNTITLED/wimoveis/anuncios/'
for i, anuncio_id in enumerate(os.listdir(basepath)):

    # Dropbox creates an annoying hidden folder,
    # let's ignore it
    if '.' in anuncio_id: 

    anuncio_path = basepath + anuncio_id + '/'
    for fname in os.listdir(anuncio_path):
        if '.html' in fname:
            print(i, fname)

            # empty dict to store the listing data
            dados_anuncio = {}

            # read source code of the listing
            with open(anuncio_path + fname, mode = 'r') as f:
                source =

            # soupify source code
            soup = BeautifulSoup(source, 'lxml')

            # get ID of the listing
            dados_anuncio['anuncio_id'] = anuncio_id

            # get title of the listing
            title = soup.find('h2', class_ = 'title-type-sup')
            if title:
                title = title.text.strip()
                title2 = soup.find_all('div', class_ = 'section-title')[1]
            if title2:
                title2 = title2.text.strip()
            dados_anuncio['titulo'] = title
            dados_anuncio['titulo_compl'] = title2

            # get location of the property
            local = soup.find('h2', class_ = 'title-location')
            if local:
                local = local.text.strip()
            local2 = soup.find('div', class_ = 'section-location')
            if local2:
                local2 = local2.text.strip()
            dados_anuncio['local'] = local
            dados_anuncio['local_compl'] = local2

            # get asking price (and rent price, if available)
            price_block = soup.find('div', class_ = 'block-price-container')
                price_list = price_block.find_all('div', class_ = 'block-price block-row')
            for e in price_list:
                operation = e.find('div', class_ = 'price-operation').text.strip()
                price = e.find('div', class_ = 'price-items').text.strip()
                dados_anuncio[operation] = price

            # get condo fee and property tax
            expense_list = price_block.find_all('div', class_ = 'block-expensas block-row')
            for e in expense_list:
                full_text = e.text.strip()
                idx = full_text.index('R$')
                expense = full_text[:idx]
                price = full_text[idx:]
                dados_anuncio[expense] = price

            # get ID of the seller
            anunciante = soup.find('h3', class_ = 'publisher-subtitle').text.strip()
            dados_anuncio['anunciante'] = anunciante

            # get text description of the property
            descricao = soup.find('div', {'id': 'verDatosDescripcion'}).text.strip()
            dados_anuncio['descricao'] = descricao

            # get structured features of the property
            # (those that have accompanying icons)
            features_block = soup.find('ul', class_ = 'section-icon-features')
            lis = features_block.find_all('li', class_ = 'icon-feature')
            for e in lis:
                label = e.find('span').text.strip()
                value = e.find('b').text.strip()
                dados_anuncio[label] = value

            # get other features of the property
            # (those that do not have accompanying icons)
            areas = soup.find_all('ul', class_ = 'section-bullets')
            for area in areas:
                lis = area.find_all('li')
                for e in lis:
                    if e.find('b'):
                        area_feature_label = e.find('h4').text.strip()
                        area_feature_value = e.find('b').text.strip()
                        dados_anuncio[area_feature_label] = area_feature_value
                        area_feature = e.string
                        dados_anuncio[area_feature] = '1'

            # clean sobre garbagem from the data
            for key in dados_anuncio.keys():
                v = dados_anuncio[key]
                if v:
                    v = v.replace('\n', ' ')
                    v = v.replace('\r', ' ')
                    v = v.replace('\t', ' ')
                    while '  ' in v:
                        v = v.replace('  ', ' ')
                    dados_anuncio[key] = v.strip()

            # append row of data

# save everything as a CSV file
df = pd.DataFrame.from_dict(dados)
df.to_csv('listings_data.csv', index = False)

Hooray, now we’ve put all the (non-image) data in a CSV file with proper column names and everything.

throwing data away

Now that we have all that data it’s time to throw some of it away.

You see, people are lazy. When they list their properties on wimoveis they don’t bother to tick all the boxes - “pool”, “playground”, “A/C”, etc. Whatever they consider relevant they’ll write down in the text field (often with lots of adjectives and exclamation marks). The result is that our CSV file is mostly empty: most of its cells are missing data. This varies according to the feature we’re talking about. But the vast majority of the features have simply too many missing data points to be useful. So let’s clean up a bit.

import numpy as np
import pandas as pd

# column names to load (and their new names)
colnames = {
    'anuncio_id': 'anuncio_id',
    'Venda': 'preco_total',
    'titulo': 'titulo',
    'titulo_compl': 'titulo_compl',
    'local': 'local',
    'local_compl': 'local_compl',
    'Área útil': 'area_util',
    'Vagas': 'vagas',
    'descricao': 'descricao',
    'Andares': 'andares',
    'Banheiros': 'banheiros',
    'Brinquedoteca': 'brinquedoteca',
    'Churrasqueira': 'churrasqueira',
    'Condomínio ': 'condominio',
    'Elevador': 'elevador',
    'IPTU ': 'iptu',
    'Piscina': 'piscina',
    'Circuito de TV': 'tv',
    'Fitness/Sala de Ginástica': 'academia',
    'Idade do imóvel': 'idade',
    'Playground': 'playground',
    'Portaria 24 horas': 'portaria',
    'Quartos': 'quartos',
    'Salão de Jogos': 'jogos',
    'Salão de festas': 'festas',
    'Sauna': 'sauna',
    'Suítes': 'suites',

# load data
df = pd.read_csv('listings_data.csv', usecols = colnames)

# rename columns
old_names = list(df.columns)
new_names = [colnames[k] for k in old_names]
df.columns = new_names

# merge location columns
df['local'] = df['local'].fillna('')
df['local_compl'] = df['local_compl'].fillna('')
df['local'] = df['local'] + ' ' + df['local_compl']
del df['local_compl']

# clean up location
def fix_local(v):
    v_new = v.replace('\n', ' ')
    v_new = v_new.replace('\r', ' ')
    v_new = v_new.replace('\t', ' ')
    while '  ' in v_new:
        v_new = v_new.replace('  ', ' ')
    v_new = v_new.strip()
    return v_new

df['local'] = df['local'].map(lambda x: fix_local(x))

# drop sample if no location
df = df[df['local'] != '']

# drop sample if no price or m2
df = df.dropna(axis = 0, how = 'any', subset = ['preco_total', 'area_util'])

# merge title columns
df['titulo'] = df['titulo'].fillna('')
df['titulo'] = df['titulo_compl'].fillna('')
df['titulo'] = df['titulo'] + ' ' + df['titulo_compl']
del df['titulo_compl']

# transform some float variables into int
for var in ['vagas', 'andares', 'banheiros', 'quartos', 'suites']:
    df[var] = df[var].fillna(0)
    df[var] = df[var].map(lambda x: int(x))

df['idade'] = df['idade'].map(lambda x: int(x) if str(x).isdigit() else 0)

# convert money columns from str to float
def fix_money_value(v):
        v_new = v.replace('R$', '')
        v_new = v_new.replace('.', '')
        v_new = v_new.replace(' ', '')
        v_new = float(v_new)
        return v_new
        return None

for var in ['preco_total', 'iptu', 'condominio']:
    df[var] = df[var].map(lambda x: fix_money_value(x))

# convert m2 from string to float
def fix_area_value(v):
    v_new = v.replace('m2', '').replace('m²', '')
    v_new = v_new.replace('.', '')
    v_new = float(v_new)
    return v_new

df['area_util'] = df['area_util'].map(lambda x: fix_area_value(x))

# drop absurd values
df = df[df['preco_total'] > 1000]
df = df[df['area_util'] > 1]

# recode location
satelites = [
    'Núcleo Bandeirante',
    'Santa Maria',
    'São Sebastião',
    'Recanto das Emas',
    'Lago Sul',
    'Riacho Fundo',
    'Lago Norte',
    'Águas Claras',
    'Park Way',
    'Jardim Botânico',
    'Vicente Pires',

def get_local(v):
    splitted = v.split(',')
    setor = splitted[-2].strip()
    cidade = splitted[-1].strip()
    if cidade == 'Brasília':
        return setor
    elif cidade in satelites:
        return cidade
        return 'Goiás'

df['endereco'] = df['local'].map(lambda x: get_local(x))

# handle some features where missing actually means "doesnt have it"
for var in ['churrasqueira', 'brinquedoteca', 'tv', 'piscina', 'playground', 'sauna', 'academia', 'portaria', 'jogos', 'festas']:
    df[var] = df[var].fillna(0)

# reorder and rename columns and save to CSV
df = df[['anuncio_id', 'preco_total', 'area_util', 'endereco', 'vagas', 'banheiros', 'quartos', 'churrasqueira', 'idade', 'brinquedoteca', 'tv', 'piscina', 'playground', 'sauna', 'academia', 'portaria', 'jogos', 'festas', 'suites', 'titulo', 'local', 'descricao', 'iptu', 'condominio', 'andares']]
df.columns = ['anuncio_id', 'preco_total', 'area_util', 'local', 'vagas', 'banheiros', 'quartos', 'churrasqueira', 'idade', 'brinquedoteca', 'tv', 'piscina', 'playground', 'sauna', 'academia', 'portaria', 'jogos', 'festas', 'suites', 'titulo', 'endereco', 'descricao', 'iptu', 'condominio', 'andares']
df.to_csv('wimoveis.csv', index = False, encoding = 'utf-8')

There! Now we have a clean, usable dataset.

train the model

I tried a few different algorithms to estimate the properties’ asking prices: linear regression, SVM, random forest, boosted trees, neural networks. For each of these algorithms (except linear regression) I tweaked the corresponding parameters a bunch of times (and for neural networks I tried lots of different architectures). The clear winner was boosted trees (which won’t be so surprising to Kaggle competitors).

Just a quick note: we discarded lots of features in the previous step because of missing data. Here we’ll add some of them back. People don’t always tick the “barbecue” box when filling out their listings, but they usually mention it in the text field. So the code below scans the text field looking for certain words.

import math
import numpy as np
import pandas as pd
from unicodedata import normalize
from sklearn.utils import shuffle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_predict

# set seed
random_state = 42

# load data
df = pd.read_csv('wimoveis.csv')
# keep only certain features
df = df[[
    'preco_total', # asking price
    'area_util', # m2
    'local', # location
    'iptu', # property tax
    'condominio', # condo fee
    'quartos', # num of bedrooms
    'suites', # num of suites
    'banheiros', # num of bathrooms
    'vagas', # num of parking spots 
    'titulo', # title of the listing
    'endereco', # address of the listing
    'descricao' # description of the listing

# apartment we want to appraise
x_new = pd.DataFrame({
    'preco_total': [1.0], # just so we can drop it later
    'area_util': [120], 
    'local': ['Noroeste'],
    'iptu': [745],
    'condominio': [580],
    'quartos': [3],
    'suites': [3],
    'banheiros': [2],
    'vagas': [2],
    'titulo': '',
    'endereco': '',
    'descricao': '',

# don't worry, we'll drop this before
# training the model
df = df.append(x_new)

# impute median values for missing condo fees and property tax
for var in ['iptu', 'condominio']:
    median = df[df[var] > 0][var].median()
    df[var] = df[var].fillna(median)

# drop outliers
df = df[df['area_util'] < 50000]
df = df[df['preco_total'] < 70000000]
df = df[df['condominio'] < 1500000]
df = df[df['iptu'] < 20000]
df = df[df['vagas'] < 10]

# take the log of all quantitative variables
for var in ['preco_total', 'area_util', 'iptu', 'condominio']:
    df[var] = df[var].map(lambda x: math.log(x))

# merge text columns (title + location + address + description)
# (this will let us extract some features later)
df['fulltext'] = df['local'] + ' ' + df['titulo'] + ' ' + df['endereco'] + ' ' + df['descricao']

# is it Caldas Novas? (that's a city in the state of Goiás)
def is_caldas(s):
    if 'caldas novas' in s.lower():
        return 1
    return 0
df['caldas'] = df['fulltext'].map(lambda x: is_caldas(str(x)))

# Goiania? (that's the capital of the state of Goiás)
def is_goiania(s):
    if ('goiania' in s.lower()) or ('goiânia' in s.lower()):
        return 1
    return 0
df['goiania'] = df['fulltext'].map(lambda x: is_goiania(str(x)))

# barbecue?
def has_grill(s):
    if 'churrasqueira' in s.lower():
        return 1
    return 0
df['churrasqueira'] = df['fulltext'].map(lambda x: has_grill(str(x)))

# pool?
def has_pool(s):
    if 'piscina' in s.lower():
        return 1
    return 0
df['piscina'] = df['fulltext'].map(lambda x: has_pool(str(x)))

# playground?
def has_playground(s):
    if ('playground' in s.lower()) or ('brinquedoteca' in s.lower()):
        return 1
    return 0
df['playground'] = df['fulltext'].map(lambda x: has_playground(str(x)))

# sauna?
def has_sauna(s):
    if 'sauna' in s.lower():
        return 1
    return 0
df['sauna'] = df['fulltext'].map(lambda x: has_sauna(str(x)))

# A/C?
def has_ac(s):
    if 'ar condicionado' in s.lower():
        return 1
    return 0
df['ar'] = df['fulltext'].map(lambda x: has_ac(str(x)))

# unobstructed view?
def has_view(s):
    if 'vista livre' in s.lower():
        return 1
    return 0
df['vista'] = df['fulltext'].map(lambda x: has_view(str(x)))

# high floor?
def is_high(s):
    if 'andar alto' in s.lower():
        return 1
    return 0
df['alto'] = df['fulltext'].map(lambda x: is_high(str(x)))

# faces sunrise?
def is_sunrise(s):
    if 'nascente' in s.lower():
        return 1
    return 0
df['nascente'] = df['fulltext'].map(lambda x: is_sunrise(str(x)))

# porcelain tiles?
def porcelanato(s):
    if 'porcelanato' in s.lower():
        return 1
    return 0
df['porcelanato'] = df['fulltext'].map(lambda x: porcelanato(str(x)))

# accepts mortgage?
def mortgage_ok(s):
    if 'aceita financiamento' in s.lower():
        return 1
    return 0
df['financiamento'] = df['fulltext'].map(lambda x: mortgage_ok(str(x)))

# renovated?
def is_renovated(s):
    if ('reformado' in s.lower()) or ('reformada' in s.lower()):
        return 1
    return 0
df['reformado'] = df['fulltext'].map(lambda x: is_renovated(str(x)))

# cabinets?
def has_cabinets(s):
    if ('armarios' in s.lower()) or ('armários' in s.lower()):
        return 1
    return 0
df['armarios'] = df['fulltext'].map(lambda x: has_cabinets(str(x)))

# private garage?
def has_parking(s):
    if 'garagem' in s.lower():
        return 1
    return 0
df['garagem'] = df['fulltext'].map(lambda x: has_parking(str(x)))

# recode location
# (to standardize spellings and do some aggregation)
new_locals = {
    'Asa Sul': 'asa_sul',
    'Asa Norte': 'asa_norte',
    'Goiás': 'goias',
    'Águas Claras': 'aguas_claras',
    'Taguatinga': 'taguatinga',
    'Guará': 'guara',
    'Sudoeste': 'sudoeste',
    'Noroeste': 'noroeste',
    'Lago Norte': 'lago_norte',
    'Samambaia': 'samambaia',
    'Ceilândia': 'ceilandia',
    'Centro': 'outros', # melhorar isso aqui depois (tem de tudo)
    'Setor De Industrias': 'asa_sul', # eh quase tudo SIG
    'Sobradinho': 'sobradinho',
    'Núcleo Bandeirante': 'bandeirante',
    'Riacho Fundo': 'riacho',
    'Vicente Pires': 'vicente',
    'Park Sul': 'parksul',
    'Recanto das Emas': 'recanto',
    'Lago Sul': 'lago_sul',
    'Gama': 'gama',
    'Setor De Industria Graficas': 'asa_sul',
    'Setor Habitacional Jardim Botânico': 'outros',
    'Octogonal': 'octogonal',
    'Planaltina': 'planaltina',
    'Cruzeiro': 'cruzeiro',
    'Santa Maria': 'santamaria',
    'São Sebastião': 'saosebastiao',
    'Setor Da Industria E Abastecimento': 'outros',
    'Zona Industrial': 'outros',
    'Paranoá': 'paranoa',
    'Setor De Autarquias Sul': 'asa_sul',
    'Setor Comercial Sul': 'asa_sul',
    'Setor Bancario Sul': 'asa_sul',
    'Setores Complementares': 'outros',
    'Park Way': 'parkway',
    'Candangolândia': 'candangolandia',
    'Setor De Radio E Televisao Sul': 'asa_sul',
    'Taquari': 'outros',
    'Setor Hoteleiro Sul': 'asa_sul',
    'Setor de Múltiplas Atividades Sul': 'outros',
    'Setor de Armazenagem e Abastecimento Norte': 'outros',
    'Setor Hospitalar Local Sul': 'asa_sul',
    'Zona Civico-administrativa': 'asa_sul',
    'Setor de Grandes Áreas Norte': 'asa_norte',
    'Setor De Clubes Esportivos Norte': 'lago_norte',
    'Setor De Clubes Esportivos Sul': 'lago_sul',
    'Zona Rural': 'outros',
    'Setor De Diversoes Norte': 'asa_norte',
    'Superquadra Sudoeste': 'sudoeste',
    'Setor de Mansões Dom Bosco': 'outros',
    'Setor Bancario Norte': 'asa_norte',
    'Setor Comercial Norte': 'asa_norte',
    'Setor De Oficinas Norte': 'asa_norte',
    'Setor Hoteleiro Norte': 'asa_norte',
    'Setor de Hotéis e Turismo Norte': 'asa_norte',
    'Quadra Mista Sudoeste': 'sudoeste',
    'Superquadra Noroeste': 'noroeste',
    'Setor Habitacional Jardins Mangueiral': 'outros',
    'Setor Habitacional Jardins Mangueiral': 'outros',
    'Vila Planalto': 'outros',
    'Alphaville': 'outros',
    'Granja Do Torto': 'outros',
    'Comércio Local Noroeste': 'noroeste',
    'Superquadra Sul': 'asa_sul',
    'Setor Terminal Norte': 'asa_norte',
    'Setor Terminal Sul': 'asa_sul',
    'Centro Comercial Noroeste': 'noroeste',
    'Setor de Grandes Áreas Norte': 'asa_norte',
    'Mansões do Lago': 'outros',
    'Setor de Garagens e Concessionárias de Veículos': 'outros',
    'Setor Comercial Local Residencial Norte': 'asa_norte',
    'Centro de Atividades': 'lago_norte',
    'Setor de Habitações Individuais Norte': 'lago_norte',
    'Superquadra Norte': 'asa_norte',
    'Centro Comercial Sudoeste': 'sudoeste',
df['local'] = df['local'].map(lambda x: new_locals[x])

# dummify location
locais = pd.get_dummies(df['local'], prefix = 'local', drop_first = True)
for col in locais.columns:
    df[col] = locais[col]
del df['local']

# if city is Brasília, get more location details
def detail_location(s):
    locations = {

        # industras graficas
        ' sig ': 'sig',
        'graficas': 'sig',
        'setor de industrias': 'sig',

        # industria e abastecimento
        ' sia ': 'sia',
        'setor de industria e abastecimento': 'sia',

        # armazenagem e abastecimento norte
        ' saan': 'saan',
        'setor de armazenagem e abastecimento norte': 'saan',

        # armazenagem e abastecimento sul
        ' saas': 'saas',
        'setor de armazenagem e abastecimento sul': 'saas',

        # autarquias norte
        ' san ': 'san',
        'setor de autarquias norte': 'san',

        # autarquias sul
        ' sas ': 'sas',
        'setor de autarquias sul': 'sas',

        # comercial norte
        ' cln ': 'cln',
        ' scln ': 'cln',
        'comercial local norte': 'cln',
        'comercio local norte': 'cln',
        ' scn ': 'scn',
        'setor comercial norte': 'scn',
        ' sdn ': 'sdn',
        'setor de diversoes norte': 'sdn',

        # comercial sul
        ' cls ': 'cls',
        ' scls ': 'cls',
        'comercial local sul': 'cls',
        'comercio local sul': 'cls',
        ' scs ': 'scs',
        'setor comercial sul': 'scs',
        ' sds ': 'sds',
        'setor de diversoes sul': 'sds',

        # comercial noroeste
        ' clnw ': 'clnw',

        # comercial sudoeste
        ' clsw ': 'clsw',

        # bancario norte
        ' sbn ': 'sbn',
        'setor bancario norte': 'sbn',

        # bancario sul
        ' sbs ': 'sbs',
        'setor bancario sul': 'sbs',

        # grandes areas norte
        ' sgan ': 'sgan',
        'setor de grandes areas norte': 'sgan',

        # grandes areas sul
        ' sgas ': 'sgas',
        'setor de grandes areas sul': 'sgas',

        # hoteleiro norte
        ' shn ': 'shn',
        ' shtn ': 'shn',
        'setor hoteleiro norte': 'shn',
        'setor de hoteis e turismo norte': 'shn',

        # hoteleiro sul
        ' shs ': 'shs',
        ' shts ': 'shts',
        'setor hoteleiro sul': 'shs',
        'setor de hoteis e turismo sul': 'shs',

        # hospitalar norte
        ' shln ': 'shln',
        'setor hospitalar local norte': 'shln',

        # hospitalar sul
        ' shls ': 'shls',
        'setor hospitalar local sul': 'shls',

        # radio e tv norte
        ' srtvn ': 'srtvn',
        'setor de radio norte': 'srtvn',

        # radio e tv sul
        ' srtvs ': 'srtvs',
        'setor de radio sul': 'srtvs',

        # oficinas
        ' sof ': 'sof',
        'setor de oficina': 'sof',

        # seuperquadras
        'sqn': 'sqn',
        'super quadra norte': 'sqn',
        'superquadra norte': 'sqn',
        'sqs': 'sqs',
        'super quadra sul': 'sqs',
        'superquadra sul': 'sqs',
        'sqsw': 'sqsw',
        'super quadra sudoeste': 'sqsw',
        'superquadra sudoeste': 'sqsw',
        'sqnw': 'sqnw',
        'super quadra noroeste': 'sqnw',
        'superquadra noroeste': 'sqnw',

        'qmsw': 'qmsw',
        'quadra mista sudoeste': 'qmsw',
        'qrsw': 'qrsw',
    s = normalize('NFKD', s).encode('ASCII', 'ignore').decode('ASCII')
    s = s.lower()
    for key in locations.keys():
        if key in s:
            return locations[key]
    return 'outros'

df['local_det'] = df['fulltext'].map(lambda x: detail_location(str(x)))

# dummify location details (for when city=Brasília)
locais = pd.get_dummies(df['local_det'], prefix = 'local_det', drop_first = True)
for col in locais.columns:
    df[col] = locais[col]
del df['local_det']

# drop text columns
del df['fulltext']
del df['titulo']
del df['endereco']
del df['descricao']

# drop row that contains the property we want to appraise
x_new = df[df['preco_total'] == 0]
del x_new['preco_total']
x_new = x_new.values
df = df[df['preco_total'] > 0]

# split X and Y
y = df['preco_total'].values
del df['preco_total']
X = df.values

# shuffle sample order
X, y = shuffle(X, y, random_state = random_state)

# instantiate model
model = GradientBoostingRegressor(loss = 'quantile', alpha = 0.5, n_estimators = 1000, random_state = random_state)

# train model
yhat = cross_val_predict(model, X, y, cv = 10)

# put estimated prices back in R$
yhat_reais = np.exp(yhat)

# put observed prices back in R$
y_reais = np.exp(y)

# compute errors
erros = yhat_reais - y_reais

# compute median absolute error
erro_absoluto_mediano = np.median(np.absolute(erros))
print('erro absoluto mediano:', erro_absoluto_mediano)

# compute proportional error (error / asking price)
erros_relativos = erros / y_reais
erro_relativo_mediano = np.median(np.absolute(erros_relativos))
erro_relativo_medio = np.mean(np.absolute(erros_relativos))
print('erro relativo mediano:', erro_relativo_mediano)
print('erro relativo medio:', erro_relativo_medio)

This gives me a median absolute error of R$ 46k. In proportional terms (i.e., error / asking price) we have a median absolute error of 10% and a mean absolute error of 23%. Which is line with previous work (see here), where the mean absolute error is 25%-30%, and here, where the mean absolute error is 22%.)

We’re not capturing everything here. Say, maybe the property is next door to a police station or to a church or to a loud bar. Maybe there was a murder in the premises. Etc etc. My point is not that these estimates should be final. My point is simply that these estimates are probably closer to the truth than the ones being produced today by professional appraisers all over Brazil.


Alright then, time to appraise our Noroeste apartment. Just append the following lines to the previous code block and run it.

# train models for lower bound, point estimate, and upper bound
model_lower = GradientBoostingRegressor(loss = 'quantile', alpha = 0.25, n_estimators = 1000, random_state = random_state)
model_mid = GradientBoostingRegressor(loss = 'quantile', alpha = 0.5, n_estimators = 1000, random_state = random_state)
model_upper = GradientBoostingRegressor(loss = 'quantile', alpha = 0.75, n_estimators = 1000, random_state = random_state), y), y), y)

# appraise
yhat_lower = model_lower.predict(x_new)
yhat_mid = model_mid.predict(x_new)
yhat_upper = model_upper.predict(x_new)
print(np.exp(yhat_lower), np.exp(yhat_mid), np.exp(yhat_upper))

And voilà, we have our point estimate: R$ 978k (That’s about US$ 254k).

We also have a prediction interval with lower and upper bounds: [788k, 1060k]. To produce this interval I used something similar to quantile regression. The lower bound is an estimate of the 25th percentile of the distribution. The upper bound is an estimate of the 75th percentile. The point estimate is an estimate of the 50th percentile (i.e., the median). As we have three different models, the lower and upper bounds are not centered around the point estimate (we actually have three point estimates). More details here.


Here I’m scanning the property descriptions for words like “A/C”, “barbecue”, etc, and featurizing them as dummies. But you can use the texts themselves in the model. Just insert the following code between the line where you shuffle the samples and the line where you instantiate the model:

# little function to make corpus
def make_corpus(df):
    for row in df.iterrows():
        yield str(row[1]['fulltext'])

# vetorize corpus and create TFIDF matrix
vectorizer = TfidfVectorizer(strip_accents = 'unicode', lowercase = True, ngram_range = (1, 2))
tfidf = vectorizer.fit_transform(make_corpus(df))

# reduce dimensions (i.e., extract topics)
svd = TruncatedSVD(n_components = 400)
topics = svd.fit_transform(tfidf)

# add topics to the dataframe
for n in range(0, topics.shape[1]):
    df[str(n)] = topics[:,n]

# rescale topics to make them compatible with the m2 scale
scaler = MinMaxScaler(feature_range = (df['area_util'].min(), df['area_util'].max()))
for col in df.columns:
    if col.isdigit():
        df[col] = scaler.fit_transform(df[col].values.reshape(-1, 1))

The TFIDF matrix is too big - we end up with more columns than samples. So we don’t use it directly, we use LSA to reduce the TFIDF matrix to a documentsXtopics matrix of 400 topics.

This improves the performance of the model a bit. But it’s ultimately nonsensical: when you’re appraising a new property you could keep tweaking the text until you get the price you want. So I did this just to inspect which topic vectors would be more relevant (see here), and then which words had more weight in these topics. This helped me decide which words to look for in the text fields (sauna, pool, etc).


I’m still figuring out the best way to use the pictures. I tried using the metadata first: number of pictures, height and width of the pictures, etc. That didn’t improve the model. (I know, shocking. But I like to try the simple things first.)

I also checked whether the dominant colors help us predict the price. To do that I clustered the pixels of every picture of the property. Each pixel is defined by three values: R, G, B, each of which can vary from 0 to 255 and represents the intensity of each of the three primary colors (red, green, and blue). So the pixels exist in the same tridimensional space and therefore we can cluster them. The centroid of each cluster is a dominant color.

Ideally we’d use DBSCAN for this, as the clusters may have widely different sizes and shapes and we don’t even know a priori how many clusters each picture has. But DBSCAN just takes forever to run. So I used k-means instead. I used the elbow technique to find the ideal number of clusters and it turns out that for most images that number was two or three.

That was a massive waste of time. K-means is faster but it still took almost a week to run. And in the end those centroids didn’t improve the model one bit.

A friend who knows a lot more about images than I do suggested that I try something along these lines. I.e., having a branched neural network where I can input both structured features (m2, location, etc) and image features. So that’s what I’m trying right now. It’s tricky though because the number of pictures varies across samples, the pictures are of different sizes, and the pictures aren’t standardized in any way (some listings have five pictures of the kitchen and none of the bedrooms, others have no pictures of kitchen, etc).

The same friend also suggested that I use some pre-trained neural network capable of identifying objects like “window”, “A/C unit”, and so on, and then use the identified objects as features. That’s the next item on my to-do list.

All that said, the truth is that I’m not sure the images will be useful in the end. It’s like with the texts: you could keep taking new pictures of the property until you get the “right” price. I think that’s harder to do with pictures than with texts, but who knows. I need to think more about it. Suggestions are welcome.

“someone must be doing it already!”

You bet. ZAP is doing it. Which makes sense: they have tons of data, so why not use it? In fact just last month they announced the next step: they’ll start buying property themselves, picking the ones that their model suggests are underpriced.

In the US Zillow is doing it. I bet that there are plenty of similar initiatives all over the world.

So I’m not proposing anything new here. Which makes it hard to understand why the heck Brazilian banks are not doing it yet. They have billions at stake.

incentives matter

I know better than to second guess the choices of people with skin in the game. But 70% of all mortgages in Brazil are concentrated in a state-owned bank - Caixa Econômica Federal (CEF). And when the state is involved the incentives are different. It’s not about delivering results but about broadening your mandate, blame-shifting, and securing resources. (If you want to have an idea of how government works, watch HBO’s Chernobyl.)

So it’s not a market failure that we have here, but by and large a state failure. CEF’s bureaucrats do not get punished for using the wrong tool to do property valuation. In a state-owned company there is a lot of noise between your actions and your punishments/rewards. Which helps explain CEF’s long history of incompetence and corruption.

Not to mention that an entire ecosystem has evolved around the status quo. You see, to be contracted by CEF as an appraiser you need to be certified by the Federal Council of Realtors (yes, that exists). There’s an exam for that and there are schools that charge good money to help prepare you for that exam. So, there are lots of people who benefit from the current system, which makes it harder to change.

so long

I guess this is it. I hope this post has inspired you to roll up your sleeves and do your own valuations from now on.

Happy appraisals!

how much is your apartment worth?

I’m buying an apartment and in the process I’ve learned a lot about property valuation. I’ve learned that there are people who do that for a living, that there are academic researchers who specialize in the subject, and that there is even regulation and licensing requirements.

In particular, I’ve learned that people are using statistical models - like linear regression - to appraise properties. I thought that was super cool, so I got curious and read a bunch of model-based appraisal reports. And that, dear reader, was a fascinating lesson in the misuse of quantitative tools. And by the entities we’d expect to be most proficient in their use: banks.

In what follows I show how Brazilian banks are doing it today and why it’s seriously messed up. Then in part 2 (forthcoming) I’ll show a better way to do it.

(Trigger warning: if you’ve recently bought or sold property in Brazil, and if you paid those obscene appraisal fees that Brazilian banks charge, you may want to save yourself some aggravation and stop reading here.)

regression and Rio

The folks using statistical models to appraise real estate in Brazil are usually banks. If you don’t pay your mortgage the bank gets your property, so the bank needs to know beforehand whether the property is worth enough $.

In Brazil 70% of all mortgages are concentrated in one bank: Caixa Econômica Federal (CEF), which is owned by the state. You’d think that the bank with the most mortgages would be pretty good at estimating the value of real estate. You’d be wrong.

I downloaded dozens of property valuation reports by CEF. Well, they are not actually made by CEF: CEF outsources the appraisals to other companies. But CEF reviews and approves every appraisal report. And in any case the ultimate responsibility rests with CEF.

Let’s look at this report, which is pretty typical.

The property here is a plot of 41.695m2 in a small town not far from Rio de Janeiro (about 100km away). The appraiser started by gathering data on 38 other plots, all in the same small town. For each of the plots he collected four variables: area (in m2), whether the lot is paved, average family income of the area, and price (in R$) per m2. Then he dropped 13 of the 38 samples and used the remaining 25 to run a linear regression: price per m2 ~ area + paved + income. He then used the resulting model to estimate the price of the original plot. The resulting estimate was R$ 1.056.866,78, with an 80% confidence interval of [R$ 898.423,01, R$ 1.215.513,48]. The appraiser saw fit to manually adjust the estimated value to R$ 990.000,00 because, well, there’s some stuff that the model doesn’t capture.

There is a lot that’s wrong here, but the main thing is: the appraiser doesn’t test the model.

Normally, in machine learning tasks like this, we train the model using only a subset of the samples, ask the model to produce estimates for the samples that were left out, then check how close these estimates are to the actual values. Ideally we repeat this several times over, rotating which samples are left out at each iteration.

But here there is no separation between training and testing. The appraiser used 100% of the 25 samples to train the model. There were no samples left for testing the model. So we have absolutely no idea how good or bad the model is. Maybe the actual market value of the plot is indeed close to R$ 1.056.866,78. But maybe it’s R$ 2 million. Or R$ 500k. We have no idea. Since the model isn’t tested, its performance is a complete mystery.

In response to which the appraiser may direct your attention to page 11 of the report, where you see this scatter plot of observed vs estimated values:

Clearly the model has a tremendously good fit: all estimates are super close to the actual values.

Except that that’s cheating: the appraiser used the same 25 samples for both training and testing the model. You don’t know a model’s performance until it has been subjected to samples it hasn’t seen before.

Not that it would make much sense to split training and testing samples with n=25. But if n=25 the thing to do is get more samples (more on this later). A small sample size doesn’t give you a license to simply not test your model.

“but that’s just one report”

Nope, that’s how every single model-based appraisal report I downloaded does it. Every. Single. One. No exceptions. At all. Google ‘CEF laudo imóvel’ or ‘Caixa Econômica laudo imóvel’ or ‘laudo avaliação imóvel’ and check for yourself. The only appraisals that are not like that are the ones that don’t use any statistical model whatsoever.

In other words: billions of R$ in real estate transactions are based on property valuations that are completely worthless.

let’s worry about all the wrong things

If you ask the appraiser why he didn’t test the model he’ll be genuinely shocked and answer that he did test it. And he did run a bunch of tests, as it’s clear on the report: he tested whether the residuals are normally distributed, whether each coefficient is statistically significant, whether the model as a whole is statistically significant, and so on. Other reports go as for as testing for heteroskedasticity and autocorrelation.

None of which makes the slightest sense here. This is not an econometrics problem. We’re not interested in the effect of each additional m2 on the price of the property. This is a machine learning problem. We’re interested in producing price estimates as close as possible to the actual prices. We don’t care about the assumptions behind linear regression in this context.

In fact we don’t care about linear regression at all. The appraiser could have (and probably should have) used boosted trees or any other machine learning algorithm. The way to go about these things is to try different algorithms and pick the one that produced the best estimates. There is no reason to limit yourself to linear regression.

regulation and its unintended consequences

To be fair, this is not entirely the appraiser’s fault. It turns out that there is a set of semi-official guidelines for how to use linear regression to do property valuation. That’s regulation NBR-14653-2. It is not legally binding - you don’t go to jail or lose your license if you violate it. But it ends up being enforced anyway. CEF won’t subcontract your company to do appraisals if you don’t follow it.

Regulation NBR-14653-2 tells appraisers to check for normality of the residuals, heteroskedasticity, autocorrelation, etc. It doesn’t say a word about testing the performance of the model. It’s completely silent on the topic of training vs testing samples, cross validation, accuracy, etc. In other words, regulation NBR-14653-2 recommends an econometric approach to a machine learning problem, which is bonkers.

more wrongness (and a lesson in public policy)

Suppose for a moment that the econometric approach were the right one here. Even then the current appraisals would be worthless.

Take the report we discussed before. The sample size is 25. That’s just not good enough. “Oh, but it’s a small town, there probably aren’t that many plots for sale.” Yes, but Brazil has over five thousand small towns. Your samples don’t need to be in the same town where the property you’re appraising is. Yes, different towns will differ in GDP per capita, homicide rate, etc. But we have data on all that, so we can include those indicators in our models. And/or dummify “name of town”.

Such a small sample is particularly egregious here, since CEF has 70% of all mortgages in Brazil, so they surely have a ton of data they could have used (or shared with the company they contracted to do the job). Imagine having millions of samples and then using only 25.

(I suspect this has to do with appraisers’ lack of programming skills. They use software with graphical interfaces and probably just type every data point manually. So 1 million samples would be of no use to them. But I’m just guessing here.)

Also, the appraiser doesn’t tell us which 25 - out of the original 38 - samples he actually used in the regression. I considered trying to replicate his results but there are 5.414.950.296 possible combinations of 25 elements out of 38, so that might take a while to run.

The appraiser also doesn’t tell us why he dropped 13 of the 38 original samples. Were they outliers? Or maybe dropping them helped produce that incredible R2 of 0.98 we see on page 9…?

At times it feels like the appraiser doesn’t really understand what he is doing. Like when he reports a p-value for the dependent variable (R$ per m2). Only independent variables have coefficients. Maybe he is confusing the constant and the dependent variable?

I also don’t know what the variable “average family income in the area” means or where it comes from. What’s “area” here? The neighborhood? The block? The zip code? What’s the source? He says it comes from “senso”, a misspelling of “censo” - census. But I have no idea which census he is talking about.

It’s also weird that he codes “is paved?” as no=1 and yes=2, instead of no=0 and yes=1.

So, just like in the other reports I read, I get the sense that the appraiser doesn’t have any quantitative training. It looks like all he can do is operate the software that produces the estimates (appraisers seem to like SisDEA). You input the data, you press this and that button, the software spits out content that mostly looks Greek to you (some of it is actual Greek), plus an estimate for the property you’re supposed to appraise. You copy and paste everything into a Word document, save it as a PDF file and email it to your client. That’s the job.

The stuff you copied and pasted contains, among other things, tests of the assumptions behind linear regression. You don’t understand any of that, but you see words that also appear on regulation NBR-14653-2 - normality, heteroskedasticity, autocorrelation -, so clearly you’re following the rules. No one can yell at you or fire you, right?

In other words, regulation substitutes for actual knowledge.

(Let this be a lesson to the perpetually outraged online mobs demanding that algorithms be regulated. They think they’ll get Andrew Ng to write the regulations. In the real world regulations are produced by a mix of bureaucrats with no skin in the game and politicians with too much skin in the game.)

“well, criticizing is easy! why don’t you do something about it instead?”

Fair enough. In part 2 I’ll show how we can improve things.