real estate appraisal in Brazil

New manuscript. Abstract:

Brazilian banks commonly use linear regression to appraise real estate: they regress price on features like area, location, etc, and use the resulting model to estimate the market value of the target property. But Brazilian banks do not test the predictive performance of those models, which for all we know are no better than random guesses. That introduces huge inefficiencies in the real estate market. Here we propose a machine learning approach to the problem. We use real estate data scraped from 15 thousand online listings and use it to fit a boosted trees model. The resulting model has a median absolute error of 8,16%. We provide all data and source code.

putting a price on tenure

New manuscript. Abstract:

Government employees in Brazil are granted tenure after three years of taking their entrance exams. Firing a tenured government employee is all but impossible, so tenure is a big perquisite. But exactly how big is it? No one has ever attempted to estimate the monetary equivalent of tenure for Brazilian government workers. We do that in this paper. We use a modified version of the Sharpe ratio to estimate what the risk-adjusted salaries of government workers should be. The difference between actual salary and risk-adjusted salary gives us an estimate of how much tenure is worth for each employee. We find that the median value of tenure is R$ 4517 for federal government employees, R$ 2560 for state government employees, and R$ 672 for municipal government employees.

Mancur Olson and stock picking

I just finished re-reading Mancur Olson’s “The rise and decline of nations”, published in 1982. (According to my notes I had read parts of it back in grad school, for my general exams, but I have absolutely no memory of that.) Olson offers an elegant, concise explanation for why economic growth varies over time. TL;DR: parasitic coalitions - unions, subsidized industries, licensed professions, etc - multiply, causing distributive conflicts and allocative inefficiency, and those coalitions can’t be destroyed unless there is radical institutional change, like foreign occupation or totalitarianism. In a different life I would ponder the political implications of the theory. But I’ve been in a mercenary mood lately, so instead I’ve been wondering what the financial implications are for us folks trying to grow a retirement fund.

the argument

Small groups organize more easily than large groups. That’s why tariffs exist: car makers are few and each has a lot to gain from restricting competition in the car industry, whereas consumers are many and each has comparatively less to gain from promoting competition in the car industry. This argument is more fully developed in Olson’s previous book “The logic of collective action”. What Olson does in “The rise and decline of nations” is to use that logic to understand why rich countries decline.

Once enacted, laws that benefit particular groups at the expense of the rest of society - like tariff laws - are hard to eliminate. The group that benefits from the special treatment will fight for its continuation. Organizations will be created for that purpose. Ideologies will be fomented to justify the special treatment (dependency theory, for instance, is a handy justification for tariffs and other types of economic malfeasance).

Meanwhile, the rest of society will be mostly oblivious to the existence of that special treatment - as Olson reminds us, “information about collective goods is itself a collective good and accordingly there is normally litle of it” (a tariff is a collective good to the protected industry). We all have better things to do with our lives; and even if we could keep up with all the rent-seeking that goes on we’d be able to do little about it, so the rational thing to do is to stay ignorant (more on this).

Hence interest groups and special treatments multiply; competition is restricted, people invest in the wrong industries, money is redistributed from the many to the few. The returns to lobbying relative to the returns of producing stuff increase (more on this). Economic growth slows down. Doing business gets costlier. As Olson puts it, “the accumulation of distributional coalitions increases the complexity of regulation”. Entropy. Sclerosis.

Until… your country gets invaded by a foreign power. Or a totalitarian regime takes over and ends freedom of association. One way or another the distributional coalitions must be obliterated - that’s what reverses economic decline. That’s the most important take-away from the book. Not that Olson is advocating foreign invasion or totalitarianism. (Though he does quote Thomas Jefferson: “the tree of liberty must be refreshed from time to time with the blood of patriots and tyrants”.) Olson is merely arguing that that’s how things work.

I won’t get into the empirical evidence here. The book is from 1982 so, unsurprisingly, Olson wasn’t too worried about identification strategy, DAGs, RCTs. If the book came out today the reception would probably be way less enthusiastic. There are some regressions here and there but the book is mostly narrative-driven. But since 1982 many people have tested Olson’s argument and a 2016 paper found that things look good:

Overall, the bulk of the evidence from over 50 separate studies favors Olson’s theory of institutional sclerosis. The overall degree of support appears to be independent of the methodological approach between econometric regression analysis on growth rates versus narrative case studies, publication in an economics or a political science journal, location of authorship from an American or European institution, or the year of publication.

can Olson help us make money?

There isn’t a whole lot of actionable knowledge in the book. We don’t have many wars anymore:

I mean, between 2004 and 2019 Iraq’s GDP grew at over twice the rate of Egypt’s or Saudi Arabia’s. And between 2002 and 2019 Afghanistan’s GDP grew at a rate 36% faster than Pakistan’s. But that’s about it. A Foreign-Occupied Countries ETF would have a dangerously small number of holdings.

And there aren’t many totalitarian regimes in place:

Waves of democracy.png

A Dear Leader ETF would also be super concentrated. (Not to mention that it would contain North Korea and Turkmenistan.)

In any case, GDP growth and stock market growth are different things. Over the last five years the S&P500 outgrew the US GDP by 83%. Conversely, the annualized return of the MSCI Spain was lower than 1% between 1958 and 2007 even though the annualized GDP growth was over 3.5% in that same period.

So, country-wise there isn’t a lot we can do here.

What about industry-wise?

Olson’s book is about countries - it’s in the very title. But his argument extends to industries. Olson himself says so when he talks about the work of economist Peter Murrell:

Murrell also worked out an ingenious set of tests of the hypothesis that the special-interest groups in Britain reduced the country’s rate of growth in comparison with West Germany’s. If the special-interest groups were in fact causally connected with Britain’s slower growth, Murrell reasoned, this should put old British industries at a particular disadvantage in comparison with their West German counterparts, whereas in new industries where there may not yet have been time enough for special-interest organizations to emerge in either country, British and West German performance should be more nearly comparable.

In other words, new industries will have fewer barriers to entry and other organizational rigidities. This seemingly banal observation, which Olson saw as nothing more than a means to test his theory, may actually explain why we don’t have flying cars.

Back in 1796 Edward Jenner, noting that milkmaids were often immune to smallpox, and that they often had cowpox, took aside his gardener’s eight-year-old son and inoculated the boy with cowpox. Jenner later inoculated the boy with smallpox, to see what would happen, and it turned out that the boy didn’t get sick - the modern vaccine was invented. Now imagine if something like the FDA existed back then. We probably wouldn’t have vaccines today, or ever.

It’s the same with the internet, social media, Uber, just about any new industry or technology: at first the pioneers can do whatever they want. But then parasitic coalitions form. They can form both outside the new sector and inside it. Taxi drivers will lobby against Uber. Uber will lobby for the creation of entry barriers, to avoid new competitors - incumbents love (the right type of) regulation. Both sources of pressure will reduce efficiency and slow growth.

Investment-wise, what does that mean? That means it will become a lot harder for, say, ARK funds to keep their current growth rate. ARK funds invest in disruptive technologies - everything from genomics to fintech to space exploration. They are all new industries, with few parasitic coalitions, so right now they’re booming. (Well, they may also be booming because we’re in a big bull market.) But - if Olson is right - as those industries mature they will become more regulated and grow slower. ARK will need to continually look for new industries, shedding its holdings on older industries as it moves forward.

In short: Olson doesn’t help us pick particular countries to invest in, and he doesn’t help us pick particular industries to invest in, but he helps us manage our expectations about future returns. Democratic stability means that fewer of us die in wars and revolutions, but it also means that buying index funds/ETFs may not work so well in the future. Democratic stability - and the parasitic coalitions it fosters - means that today’s 20-year-old kids may be forced to pick stocks if they want to grow a retirement fund. (Though if you’re not Jim Simons what are your chances of beating the market? Even Simons is right only 50.75% of the time. Are we all going to become quant traders? Will there be any anomalies left to exploit when that happens?)

Or maybe when things get bad enough we will see revolutions and wars and then economic growth will be restarted? Maybe it’s all cyclical? Or maybe climate change will be catastrophic enough to emulate the effects of war and revolution? Maybe we will have both catastrophic climate change and wars/revolutions?

but what if the country has the right policies?

Olson gives one policy prescription: liberalize trade and join an economic bloc. Free trade disrupts local coalitions, and joining an economic bloc increases the cost of lobbying (it’s easier for a Spanish lobby to influence the national government in Madrid than the EU government in Brussels). But policy is endogenous: the decision to liberate trade or join an economic bloc will be fought by the very parasitic coalitions we would like to disrupt. And if somehow the correct policy is chosen, it is always reversible - as the example of Brexit makes clear. Also, not all economic blocs are market-friendly. Mercosur, for instance, has failed to liberalize trade; instead, it subjects Argentine consumers to tariffs demanded by Brazilian producers and vice-versa. (I have no source to quote here except that I wasted several years of my life in Mercosur meetings, as a representative of Brazil’s Ministry of Finance.)

So it won’t do to look at the Heritage Foundation’s index of economic freedom and find, say, the countries whose index show the most “momentum”. One day Argentina is stabilizing its currency, Brazil is privatizing Telebrás. Latin America certainly looked promising in the 1990s. Then there is a policy reversal and we’re back to import-substitution. You can’t trust momentum. You can’t trust economic reform. You can’t trust any change that doesn’t involve the complete elimination of parasitic coalitions. And even then things may decline sooner than you’d expect - Chile is now rolling back some of its most important advances.

what if ideas matter?

In the last chapter of the book Olson allows himself to daydream:

Suppose […] that the message of this book was then passed on to the public through the educational system and the mass media, and that most people came to believe that the argument in this book was true. There would then be irresistible political support for policies to solve the problem that this book explains. A society with the consensus that has just been described might choose the most obvious and far-reaching remedy: it might simply repeal all special-interest legislation or regulation and at the same time apply rigorous anti-trust laws to every type of cartel or collusion that used its power to obtain prices or wages above competitive level.

That sounds lovely, but it’s hard to see it happening in our lifetimes. Even if ideas do matter the Overton window is just too narrow for that sort of ideas. It’s probably The Great Stagnation from now on.

measuring statism

When it comes to economic freedom, Brazil ranks a shameful 144th - behind former Soviet republics like Ukraine (134th) and Uzbekistan (114th). Down here the recently announced PlayStation 5 is going to sell for twice the price it is sold in the US, because decades ago some academics and bureaucrats decided that heavy taxes on videogames would make industrialists manufacture “useful” stuff instead (cars and whatnot). We even try to regulate the work of people who watch parked cars for a living, if you can believe that (I kid you not).

But how does that interventionist appetite vary across economic activities?

People try to answer that question in all sorts of ways. They look at how much money each industry spends on lobbying. They look at how often lobbists travel to the country’s capital. They look at how well-funded each regulatory agency is. They count the number of words in each industry’s regulations (RegData is a cool example; Letícia Valle is doing something similar for Brazil).

Here I’ll try something else: I’ll look at the companies that get mentioned in the Diário Oficial da União, which is the official gazette where all the acts of the Brazilian government are published. I’ll look up what each company does (mining? banking? retail? etc) so that I can aggregate company mentions by economic sector. That will let me know how much government attention each sector receives.

(Yes, there are lots of caveats. We’ll get to them in due time - chill out, reviewer #2.)

mining the Diário Oficial da União

Each company in Brazil has a unique identifier. It’s like a Social Security Number, but for organizations. It’s a 14-digit number; we call it CNPJ (Cadastro Nacional de Pessoas Jurídicas). Here is an example: 20.631.209/0001-60. When a company is mentioned in the Diário Oficial da União, that company’s CNPJ is there.

I had already downloaded the Diário Oficial da União, for something else. Well, not all of it: the Diário was launched in 1862, but the online version only goes back to 2002. So here we have about 18 years of government publications.

To get all CNPJs mentioned in the Diário I used this regular expression:

'([0-9]{2}[\.]?[0-9]{3}[\.]?[0-9]{3}[\/]?[0-9]{4}[-]?[0-9]{2})'

I didn’t mine the whole Diário. The Diário is divided into three sections: section 1, which has laws and regulations; section 2, which has personnel-related publications (promotions, appointments, etc); and section 3, which has procurement-related publications (invitations for bids, contracts, etc). I only mined section 1, as the other sections would merely add noise to our analysis. Just because your company won a contract to supply toilet paper to some government agency doesn’t mean your industry is regulated.

That regular expression resulted in 1.5 million matches (after dropping matches that were not valid CNPJs). In other words, section 1 of the Diário contains 1.5 million CNPJ mentions between 2002 and mid-2020 (I scraped the Diário back in July and I was too lazy to scrape the rest of it now).

The result was a table that looks like this:

date cnpj
2010-09-28 39302369000194
2010-09-28 39405063000163
2010-09-28 60960994000110
2010-09-29 31376361000160
2010-09-29 76507706000106
2010-09-29 08388911000140

That’s it for the Diário part. Now on to economic sectors.

from CNPJs to CNAEs

The Brazilian government has a big list of economic activities, and assigns to each of them a 5-digit numeric code. Hence 35140 is “distribution of electric energy”, 64212 is “commercial banks”, and so on. We call that taxonomy the CNAE (Classificação Nacional de Atividades Econômicas). There are some 700 CNAE codes in total.

(The CNAE is similar to the International Standard Industrial Classification of All Economic Activities - ISIC. You can find CNAE-ISIC correspondence tables here).

When you start a company in Brazil you’re required to inform the government what type of economic activity your company will be doing, and you do that by informing the appropriate CNAE code. That means each CNPJ is associated with a CNAE code. Fortunately, that data is publicly available.

I parsed that data to create a big CNPJ->CNAE table. If you want to do that yourself you need to download all the 20 zip files, unzip each of them, then run something like this:

import os
import pandas as pd

# path to the unzipped data files
path_to_files = '/Users/thiagomarzagao/Desktop/data/cnaes/'

# position of the CNPJ and CNAE fields
colspecs = [
    (4, 17), # CNPJ
    (376, 382), # CNAE
    ]

# fix indexing
colspecs = [(t[0] - 1, t[1]) for t in colspecs]

# set field names and types
names = {
    'CNPJ': str,
    'CNAE': str,
}

# load and merge datasets
df = pd.DataFrame([])
for fname in os.listdir(path_to_files):
    if ('.csv' in fname) or ('.zip' in fname):
        continue
    df_new = pd.read_fwf(
        path_to_files + fname, 
        skiprows = 1, 
        skipfooter = 1, 
        header = None, 
        colspecs = colspecs, 
        names = list(names.keys()), 
        dtype = names
        )
    df = df.append(df_new)

# drop CNPJS that don't have a valid CNAE
df = df.dropna()
df = df[df['CNAE'] != '0000000']

# drop duplicates
df = df.drop_duplicates()

# save to csv
df.to_csv(path_to_files + 'cnpjs_to_cnaes.csv', index = False)

I joined the table produced by that script with the table I had created before (with dates and CNPJs - see above). The result was something like this:

date cnpj cnae
2010-09-28 39302369000194 85996
2010-09-28 39405063000163 46192
2010-09-28 60960994000110 58115
2010-09-29 31376361000160 80111
2010-09-29 76507706000106 32302
2010-09-29 08388911000140 80111

That’s all we need to finally learn which sectors get the most government attention!

But first a word from my inner reviewer #2.

caveats

Perhaps pet shops appear in the Diário Oficial da União a lot simply because there a lot of pet shops - and not because pet shops are heavily regulated. Also, the Diário Oficial da União only covers the federal government - which means that I am ignoring all state and municipal regulations/interventions. (Some nice folks are trying to standardize non-federal publications; they could use your help if you can spare the time.) Finally, each CNPJ can be associated with multiple CNAE codes; one of them has to be picked as the “primary” one, and that’s the one I’m using here, but it’s possible that using each CNPJ’s secondary CNAE codes might change the results.

This whole idea could be completely bonkers - please let me know what you think.

statism across economic sectors

Here are the 30 economic sectors whose companies most often show up in the Diário. Mouse over each bar to see the corresponding count.

(The description of some sectors is shortented/simplified to fit the chart. Sometimes the full description includes lots of footnotes and exceptions - “retail sale of X except of the A, B, and C types”, that sort of thing.)

Wow. Lots to unpack here.

drugs

The top result - “retail sale of pharmaceutical goods” - is a big surprise to me. I mean, yes, I know that selling drugs to people is a heavily regulated activity. But I thought it was the job of states or municipalities to authorize/inspect each individual drugstore. I thought the federal government only laid out general guidelines.

Well, I was wrong. Turns out if you want to open a drugstore in Brazil, you need the permission of the federal government. And that permission, if granted, is published in the Diário. Also, it falls to the federal government to punish your little pharmacy when you do something wrong - irregular advertising, improper storage of medicine, and a myriad of other offenses. Those punishments also go in the Diário.

Now, we need to keep in mind that there are a lot more drugstores than, say, nuclear power plants. I’m sure that nuclear plants are under super intense, minute regulation, but because they are rare they don’t show up in the Diário very often. So we can’t conclude that selling drugs to people is the most heavily regulated sector of the Brazilian economy.

We could normalize the counts by the number of CNPJs in each economic sector. But I think the raw counts tell an interesting story. The raw counts give us a sense of how “busy” the state gets with each economic sector. I’m sure that nuclear plants are more heavily regulated than drugstores, but I bet that a lot more bureaucrat-hours go into regulating drugstores than into regulating nuclear plants.

NGOs (sort of)

The second most frequent category - “social rights organizations” - corresponds to what most people call non-governmental organizations. Except that this is Brazil, where NGOs are not really NGOs: they receive a lot of funding from the state and they perform all kinds of activities on behalf of the state. Which explains why CNPJs belonging to NGOs (well, “NGOs”) have appeared over 90 thousand times in the Diário since 2002.

I looked into some of the cases. There are NGOs receiving state funding to provide healthcare in remote areas; to provide computer classes to kids in poor neighborhoods; to fight for the interests of disabled people; just about anything you can think of. Brazil has turned NGOs into government agencies. Our non-governmental organizations are independent from the Brazilian government in the same way that the Hitlerjugend was independent from the Reich.

Needless to say, NGOs are often at the center of corruption scandals. People come up with a pretty name and a CNPJ, apply for funding to do some social work, and then just pocket the money.

arts

As if government-funded NGOs weren’t embarassing enough, a good chunk of the performing arts in Brazil is also government-funded. Hence the third and fifth most frequent categories here: “performing arts, concerts, dancing” and “movies, videos, and TV shows”.

You don’t have to be a good producer. As long as you are in the good graces of the government, you can apply for funding from the Ministry of Culture and use it to do your play/movie/concert. That’s why CNPJs belonging to those two categories have appeared in the Diário over 90 thousand times since 2002. I did the math and that’s about fifteen CNPJs a day receiving funding to have people dancing or acting or whatnot. Mind you, that’s just the federal funding.

And that’s just the official funding. Sometimes taxpayers end up funding “cultural” productions through less-than-transparent means. For instance, back in 2009 there was a biopic about Lula da Silva, who happened to be the president at the time. Well, turns out that 12 of the 17 companies that invested in the production of the movie had received hundreds of millions of dollars in government contracts. Neat, right?

Every now and then a good movie or play comes out of it. If you haven’t seen Tropa de Elite yet you should stop whatever you’re doing and go watch it (it’s on Netflix). But nearly all productions are flops. For every Tropa de Elite there are thirty Lula, Filho do Brasil.

If you want to have a taste of how bad most productions are, here is a teaser of Xuxa e os Duendes, which pocketed a few million bucks in taxpayers money. Trust me, you don’t need to understand Portuguese to assess the merits of the thing:

Meanwhile, we have over 40 thousand homicides a year, only a tiny fraction of which get solved. But what do I know, maybe Xuxa e os duendes is the sort of thing Albert Hirschman had in mind when he talked about backward and forward linkages.

I leave the analysis of the other categories as an exercise for the reader. If you want to see the full results, it’s here.

to do

It would be interesting to see how these counts by state or municipality; how these counts correlate with other measures of statism; how they change over the years; and so on.

That’s it. Remember, folks: these are the people in charge of public policy.

word embeddings for bureaucratese

You can find pre-trained word embeddings for hundreds of different languages - FastText alone has pre-trained embeddings for 157 languages (see here). But a single language can come in multiple “flavors”. I’m not talking about dialects, but about the different vocabulary and writing styles you find in news articles vs social media vs academic papers, etc. Most word embeddings come from a limited number of sources, with Wikipedia and news articles being the most common ones. If you have, say, social media texts, using word embeddings trained on Wikipedia entries may not yield good results.

So I decided to train my own Brazilian Portuguese word embeddings on the source that interests me the most: government publications. Decrees, invitations for bids, contracts, appointments, all that mind-numbingly boring stuff that makes up the day-to-day life of the public sector. Those embeddings might help me in future text-related tasks, like classifying government decrees and identifying certain types of contracts. I imagine it could be useful for other folks working with Brazilian government publications, so here’s how I did that.

I started by scraping the official bulleting where all the acts of the Brazilian government are published: the Diário Oficial da União. To give you an idea of how much text that is, the Diário’s 2020-07-06 issue has a total of 344 pages - with a tiny font and single spaces. (The Brazilian state is humongous and the size of the Diário reflects that.) The Diário is available online going as far back as 2002-01-01 and I scraped all of it. That amounted to about 8GB of zip files. Here is how to scrape it yourself (I used Python for everything):

import os
import requests
from bs4 import BeautifulSoup

months = [
    # month names in Portuguese
    'janeiro',
    'fevereiro',
    'marco',
    'abril',
    'maio',
    'junho',
    'julho',
    'agosto',
    'setembro',
    'outubro',
    'novembro',
    'dezembro'
]

# path to the folder that will store the zip files
basepath = '/path/to/diario/' # create the folder first

# loop through years and months
for year in range(2002, 2021): # change end year if you're in the future
    for month in months:
        print(year, month)
        url = 'http://www.in.gov.br/dados-abertos/base-de-dados/publicacoes-do-dou/{}/{}'.format(str(year), month)
        r = requests.get(url)
        soup = BeautifulSoup(r.content)
        tags = soup.find_all('a', class_ = 'link-arquivo')
        urls = ['http://www.in.gov.br' + e['href'] for e in tags]
        fnames = [e.text for e in tags]
        for url, fname in zip(urls, fnames):
            if not os.path.isfile(basepath + fname):
                try:
                    r = requests.get(url)
                    with open(basepath + fname, mode = 'wb') as f:
                        f.write(r.content)
                except:
                    continue

After the scraping is done you can unzip each of those 400+ files manually or you can automate the job:

import os
import zipfile

ipath = '/path/to/diario/'
opath = '/path/to/diario_xml/' # create folder first
for fname in os.listdir(ipath):
    if '.zip' in fname:
        year = fname[5:9]
        month = fname[3:5]
        section = fname[2:3]
        print(year, month, section, fname)
        destination = opath + year + '/' + month + '/' + section + '/'
        os.makedirs(destination)
        try:
            with zipfile.ZipFile(ipath + fname) as zip_ref:
                    zip_ref.extractall(destination)
        except:
            print('error; moving on') # some zip files are corrupted

This won’t give you all the text in the Diário Oficial da União since 2002-01-01. Some zip files are corrupted and most issues are incomplete. For 2016, for instance, only the May issues are available. And for all years except 2019 and 2020 one of the sections (section 3) is missing entirely (the Diário is divided in three sections - 1, 2, and 3). Also, after you unzip the files you find out that in many cases the text is not in XML but in JPEG format. I wasn’t in the mood to do OCR so I just ignored the JPEG files.

If you want to get in touch with the Diário’s publisher to discuss those problems be my guest. Here I don’t care much about those problems because all I need to train my word embeddings is a ton of data, not all of the data. And with the XML files that I got I have over 4 million government acts, which is probably way more than I need here.

After unzipping everything I trained my word embeddings. I chose to go with gensim’s implementation of word2vec. The beauty of gensim’s implementation is that you can stream the texts one by one straight from disk, without having to keep everything in memory. Now, that’s a little tricky to accomplish. Gensim’s documentation says that instead of a list of documents you can use a generator, but I found that not to be the case. I got this error: TypeError: You can't pass a generator as the sentences argument. Try an iterator. But I googled around and found a nifty workaround that tricks gensim into using a generator by wrapping it inside an iterator. So here I have a generator (yield_docs) that yields one document at a time and then I wrap it inside an iterator (SentencesIterator) so that gensim won’t complain.

About the documents, I have some 4.2 million XML files in total. In theory all these XML files should be easily parsable - they have tags with metadata, main content, etc. But in reality many are invalid. They have unclosed quotation marks and other problems that trip BeautifulSoup’s parser. So I ignored all the metadata and just focused on the stuff inside the <Texto> (text) tags, which is always a collection of <p> tags. Now, different paragraphs of the same publication can talk about entirely different issues, so instead of treating each publication (i.e., each XML file) as a document I’m treating each <p> content as a document. That should yield more coherent word associations. So while I have 4.2 million XML files, in the end I have 72 million documents, one corresponding to each <p> tag. That’s… a lot of text.

Back to word2vec. I don’t really know the ideal number of dimensions here. I found a nice paper that provides a way to estimate the ideal number of dimensions for any dimensionality reduction algorithm. But it’s too computationally expensive: you need to create a graph where each unique token is a node and the co-occurrences are represented by edges. I tried it but the thing got impossibly slow at around 200k nodes - and I have over 1M unique tokens. By my estimates it would take about half a year for the nodes to be created, and then I would need to find the graph’s maximum clique, which is also computationally expensive. So… no. If I had a specific text classification task in mind I would just try different numbers of dimensions and see what works best, but that’s not what I’m doing right now. So instead of relying on any theoretical or empirical approaches I just went with 300 dimensions because that’s a nice round number and I’ve seen it used in other word embeddings.

I’m discarding all words that appear in fewer than 1000 paragraphs (probably too rare to matter) and I’m using a short window of 5 (maximum distance between current and predicted word in a sentence).

Here’s the code:

import os
from bs4 import BeautifulSoup
from string import punctuation
from gensim.models import Word2Vec

def tokenize(raw_text):
    '''
    'Hey, dogs are awesome!' -> ['hey', 'dogs', 'are', 'awesome']

    using `re` would probably make it run faster but I got lazy
    '''

    # lowercase everything
    text = raw_text.lower()

    # remove punctuation
    for p in punctuation:
        text = text.replace(p, ' ')

    # remove extra whitespaces
    while '  ' in text:
        text = text.replace('  ', ' ')

    # tokenize
    tokens = text.strip().split()

    # remove digits
    tokens = [e for e in tokens if not e.isdigit()]

    return tokens

def yield_docs():
    '''
    crawl XML files, split each one in paragraphs
    and yield one tokenized paragraph at a time
    '''
    n = 0
    path = '/path/to/diario_xml/'
    for root, dirpath, fnames in os.walk(path):
        if not dirpath:
            for fname in fnames:
                if '.xml' in fname:
                    filepath = root + '/' + fname
                    with open(filepath, mode = 'r') as f:
                        content = f.read()
                    soup = BeautifulSoup(content, features = 'lxml')
                    paragraphs = soup.find_all('p')
                    for p in paragraphs:
                        print(n)
                        n += 1
                        tokens = tokenize(p.text)
                        if len(tokens) > 1:
                            yield tokens

class SentencesIterator():
    '''
    this tricks gensim into using a generator,
    so that I can stream the documents from disk
    and not run out of memory; I stole this code
    from here: 

    https://jacopofarina.eu/posts/gensim-generator-is-not-iterator/
    '''
    def __init__(self, generator_function):
        self.generator_function = generator_function
        self.generator = self.generator_function()

    def __iter__(self):
        # reset the generator
        self.generator = self.generator_function()
        return self

    def __next__(self):
        result = next(self.generator)
        if result is None:
            raise StopIteration
        else:
            return result

# train word2vec
model = Word2Vec(
    SentencesIterator(yield_docs), 
    size = 300, 
    window = 10, 
    min_count = 1000, 
    workers = 6
    )

# save to disk
model.save('word2vec.model')

And voilà, we have our word embeddings. We have a total of 27198 unique tokens (remember, we ignored any tokens that appeared in fewer than 1000 paragraphs) and 300 dimensions, so our word embeddings are a 27198x300 matrix. If you’re not familiar with word2vec Andrew Ng explains it here. The TL;DR is that word2vec’s output is a matrix where each unique token is represented as a vector - in our case, a 300-dimensional vector. That allows us to do a bunch of interesting stuff with that vocabulary - for instance, we can compute the cosine similarity between any two words to see how related they are. In gensim there is a neat method for that. For instance, suppose we want to find the words most related to “fraude” (fraud):

model.wv.most_similar(positive = ['fraude'])
>>> [('fraudes', 0.5694327354431152),
>>> ('conluio', 0.5639076232910156),
>>> ('superfaturamento', 0.5263874530792236),
>>> ('irregularidade', 0.4860353469848633),
>>> ('dolo', 0.47721606492996216),
>>> ('falsidade', 0.47426754236221313),
>>> ('suspeita', 0.47147220373153687),
>>> ('favorecimento', 0.4686395227909088),
>>> ('ilícito', 0.4681907892227173),
>>> ('falha', 0.4664713442325592)]

We can see that bid rigging (“conluio”) and overpricing (“superfaturamento”) are the two most fraud-related words in government publications (“fraudes” is just the plural form of “fraude”). Kinda cool to see it. You can also cluster the word embeddings to find groups of inter-related words; use t-SNE to reduce dimensionality to 2 so you can plot the embeddings on an XY plot; and try a bunch of other fun ideas.

Here I trained the word embeddings from scratch but you could also take pre-trained Brazilian Portuguese embeddings and use the Diário to fine-tune them. You could also tweak the parameters, changing the window (10 here) and the number of dimensions (300). Whatever works best for the task you have at hand.

That’s all for today! And remember, bureaucratese is bad writing - don’t spend too long reading those texts, lest you start emulating them. The best antidote to bureaucratese (or to any bad writing) that I know is William Zinsser.