text-analyzing Brazilian music

People have text-analyzed American songs to exhaustion. We’ve learned, among other things, that pop music has become dumber and that Aesop Rock beats Shakespeare in vocabulary size. I thought it would be fun to make similar comparisons for Brazilian artists. Alas, as far as I could google no one has done that yet. So, I did it myself. Here’s my report.

the data

Our data source is Vagalume, from which I scraped the lyrics. (I give all the code in the end.) In total we have eight genres, 576 artists, and 77,962 lyrics (that’s after eliminating artists whose combined lyrics summed up to less than 10,000 words; more on this later). These are the genres:

  • Rock.
  • Rap.
  • Samba. The music they play at the big Carnival parade in Rio.
  • Pagode. Derived from samba but more popular, especially in the periphery of Rio de Janeiro.
  • MPB. That slow, relaxing music you often hear in elevators.
  • Sertanejo. Country music meets salsa.
  • Axé. Think Macarena, but with racier choreographies. This is what they play at the Carnival parades in Bahia.
  • Forró. Like salsa, but faster.

Here’s how the data are distributed:

(The genre assignment - which is Vagalume’s, not mine - is not exclusive: some artists belong to more than one genre. In these cases I arbitrarily picked one of the genres, to avoid double counting in the pie charts.)

how to measure vocabulary size?

At first glance vocabulary size is a simple metric: the number of (unique) words each artist uses. But some artists have only 2-3 songs while others have 400+ songs. Naturally the more lyrics you have (and the longer your lyrics) the more unique words you will tend to use (up to the point where you exhaust your vocabulary).

Can’t we divide the number of unique words by the total number of words, for each artist? No, we can’t. Say you know a total of 10,000 words and you have written 100 lyrics with 100 words each, without ever repeating a word. In this case your “unique words / total words” ratio is 1. But if you suddenly write another batch of 100 lyrics with 100 words each your ratio will fall to 0.5 - even though your vocabulary has remained the same. No good.

So, one measure (unique words) is overly kind to prolific artists while the other measure (unique words / total words) is overly kind to occasional artists. To get around that I: a) discard all artists whose combined lyrics sum up to less than 10,000 words (that’s how we got to 576 artists and 77,962 lyrics, down from 809 artists and 95,851 lyrics); b) for each non-discarded artist, randomly select 1,000 samples of 10,000 words each and compute the average number of unique words. In other words, I truncate my data and then use bootstrapped samples.

One final issue before we dive into the data. Naturally, the person singing the song is not necessarily the person who wrote the song. So when I talk about artist XYZ’s vocabulary that’s not really his or her personal vocabulary; that’s just short for “the vocabulary we find in the songs that artist XYZ sings”, which is too cumbersome to say. (And heck, no one is forcing people to go on stage and sing dumb songs. If they do it then they should bear the reputational cost.)

Enough talk, let’s see the data.

mandatory descriptive statistics

The mean vocabulary is 1,758 words and the 95% confidence interval is [1,120, 2,458]. N = 576. Here’s the distribution:

the 30 largest vocabularies

Ok, time to see the winners (mouse over each bar to see the corresponding number):

What do you know! The largest vocabulary of Brazilian music is that of a rap band: 2,961 words. That’s almost twice the mean (1,758) and almost four times the lowest vocabulary (804). Not bad. Street music beats highfalutin MPB: Chico Buarque would need to learn some 200 new words to reach Facção Central.

Rap didn’t just take the first prize: rap bands occupy 16 of the top 30 positions. And that’s despite rap representing only 7% of our lyrics. Rap folks, take a bow.

Here, check Diário de um Detento, by Racionais Mc’s. It’s one of their most famous songs. And it’s pretty representative of Brazilian rap music, as it talks about life in the periphery of the big cities - violence, drugs, poverty, etc.

Another big surprise (to me at least) was sertanejo, which occupies 4 of the top 30 positions. When I think about sertanejo what comes to mind are cheesy duos like Zezé di Camargo & Luciano or Jorge & Mateus. The sertanejo singers we see above were completely unfamiliar to me.

I googled around and turns out these are all old-timers, mostly retired. So, it seems that what has now degenerated into tche tcherere tche tche was once a vocabulary-rich genre. (Unfortunately Vagalume doesn’t have lyrics’ release dates, so we can’t do a proper time series analysis.)

Same with Luiz Gonzaga: when I think forró I imagine tacky bands like Banda Calypso or Aviões do Forró. But Luiz Gonzaga is an old-timer. So, yet another vocabulary-rich genre has degenerated - this time into “suck it ‘cause it tastes like grapes”.

Rap, vintage sertanejo, and Luiz Gonzaga aside, what we see above are MPB’s biggest stars. These are world-renowned artists with long and established careers. You won’t find their songs among Spotify’s top 50, but they have their public. (Fittingly, Chico Buarque is a distant relative of Aurélio Buarque de Hollanda, the lexicographer who edited the most popular dictionary of Brazilian Portuguese.) I find most MPB mind-numbingly boring but here’s a playlist of Chico Buarque’s songs, so you can check for yourself.

One absence is noteworthy: rock. I was sure folks like Raimundos and Skank would come out on top. Alas, I was wrong. Raimundos’ vocabulary is 2,239 and Skank’s, 2,246 - they rank #81st and #79th respectively. Not exactly bad, but far from top 30. (I keep hearing that Brazilian rock is dead; maybe that’s true after all.) Anyway, here’s a Raimundos’ song, in case you’ve never had a taste of Brazilian rock before.

Here are the most frequent words of the top 30 artists:

No clear dominant theme here. Life (vida), other (outro), brother (irmão), water (água), path (caminho), moment (momento), soul (alma), hour (hora), dream (sonho), star (estrela), son (filho), hand (mão), night (noite), word (palavra), time (tempo), year (ano), stone (pedra), eye (olho), father (pai).

the 30 smallest vocabularies

Now on to the losers:

Sertanejo is by far the most frequent genre here: it takes up 12 of the bottom 30 positions. I can’t say I’m shocked. (I googled around and these are all contemporary sertanejo artists; no old-timers here.)

I can hear sertanejo fans complaining: “wait, maybe it’s just that sertanejo is overrepresented in the data” (though, given what we just saw, one wonders whether many sertanejo fans would know the word “overrepresented”; or “data”). Indeed, sertanejo accounts for 28% of our lyrics - by far the largest piece of the pie. But then how come not a single contemporary sertanejo artist appears in the top 30? Sorry, sertanejo fans: it’s time to acknowledge the misery of your musical taste and look for something better (the top 30 above might be a good place to start).

Rock is the second most represented category here. A bit of a surprise to me. I mean, fine, there isn’t a single rock artist in the top 30; but dammit, must rock also account for a fifth of the bottom 30? If Brazilian rock isn’t dead yet then maybe it’s time we euthanize it.

The rest is forró and pagode, about which no one could seriously have had high expectations (if you did then your musical taste is beyond hope; just give up on music and download some podcasts instead).

I expected axé to be in the bottom 30. Maybe Carnival in Bahia is not the nightmarish experience I picture after all.

The amplitude of the spectrum is large. The average of the top 30 vocabularies is 2.6 times larger than the average of the bottom 30 vocabularies. Maybe this will help you grasp the abyss: the combined vocabulary of Victor & Vinicius, Abril, Lipstick, Agnela, Lucas & Felipe, Forró Lagosta Bronzeada, Leva Nóiz, Hevo84, Forró Boys, Drive, Cacio & Marcos, Marcos & Claudio, Banda Djavú, Renan & Ray, Raffael Machado, TNT, Sambô, and Roberta Campos (after accounting for the words that they all have in common) is still smaller than the vocabulary of Racionais Mc’s alone.

In short, it’s official: Brazilian popular music is garbage. And now we have data to back up that claim.

If I’m allowed a short digression, the problem is not just the poverty of the vocabulary but the poverty of the underlying sentiments and ideas as well. Let me give you a taste. What you see below is Michel Teló’s “Oh, if I catch you” (I translated it for your benefit):

Wow. Wow.
This way you’ll kill me.
Oh, if I catch you.
Oh, if I catch you.

Delicious. Delicious.
This way you’ll kill me.
Oh, if I catch you.
Oh, if I catch you.

Saturday in the club.
People started dancing.
Then the prettiest girl passed by.
I got bold and went talk to her.


So: guy is in the club, sees pretty girl, goes talk to her. Next to Michel Teló Nicki Minaj is Chaucer.

No, I didn’t pick some little-known, abnormally bad outlier song just to make things appear worse than they are: “Oh, if I catch you” reached #1 in 23 European and Latin American countries. There is even a The Baseballs version, if you can believe that. Michel Teló is export-grade garbage.

Here are the most frequent words of the bottom 30 artists:

Romantic words are a lot more frequent here than in the top 30: kiss (beijo), love (amor), hurt/hurts (dói), together (junto), “missing someone/something” (saudade; kinda hard to translate this one), heart (coração), and so on. We also see that “thing” (coisa) is possibly the most frequent word here, which tells us that these artists are not even trying (they have no concept of le mot juste).

the middle of the scale

The extremes tell a pretty coherent story: rap, MPB, and vintage sertanejo on one end, forró, pagode, and contemporary sertanejo on the other. But things get fuzzier as we move away from the extremes. Here are some surprises:

  • Chiclete com Banana, an axé band whose best-known chorus is “aê aê aê aê aê”, beats MPB god, Grammy-winner, Girl from Ipanema co-author Tom Jobim.
  • Daniel, a corny sertanejo singer, beats Maria Rita and Toquinho, two revered favorites of MPB fans.
  • Legião Urbana, a mediocre rock band that is insufferably popular in Brasília (where I live), only appears in the #188th position - behind forró band Mastruz com Leite.
  • And my favorite finding: Molejo, Aviões do Forró, and Wesley Safadão, respectively the trashiest pagode band, forró band, and sertanejo singer of all times, all beat Grammy-winner, celebrated MPB singer and composer Maria Gadú.

Here’s the whole data if you’re interested.

to do

There is a lot more we could do with these data. For instance, we could do some sentiment analysis. There is no word->sentiment dictionary in Portuguese, but we could use automatic translation and then use SentiWordNet. I suspect that axé and sertanejo will be at opposite ends of the happy-sad spectrum.

We could also use co-sine similarity to check how “repetitive” each genre is. To me all axé songs sound the same, so I suspect there is little textual variation in them. We could, for each genre, take each possible pair of songs and compute the average co-sine similarity of all pairs

Just because you have a large vocabulary doesn’t mean you pick the right words. A plausible observable implication of careful word choice is the use of rarer, lesser-known words. So it might be worthwhile to compute the average inverse document frequency (IDF) of each artist and compare them.

I read somewhere that sertanejo, forró, and pagode are beginning to merge into one single genre (I shudder at the thought of a song that is simultaneously sertanejo, forró, and pagode). I wonder if that may already show in our data, so it migh be interesting to clusterize the lyrics (say, using k-means) and check whether the resulting clusters correspond to genre labels.

Finally, we could use recurrent neural networks (RNN) to automate some artists (like people have done with Obama), just for the fun of it. As with any RNN the more training texts the better, so Chico Buarque, with his 416 lyrics, would be a great candidate for automation. I bet that most of his fans wouldn’t be able to tell human Chico Buarque from bot Chico Buarque.


Here’s the Python code I wrote to scrape Vagalume:

scrape lyrics from vagalume.com.br
(author: thiagomarzagao.com)

import json
import time
import pickle
import requests
from bs4 import BeautifulSoup

# get each genre's URL
basepath = 'http://www.vagalume.com.br'
r = requests.get(basepath + '/browse/style/')
soup = BeautifulSoup(r.text)
genres = [u'Rock',
genre_urls = {}
for genre in genres:
    genre_urls[genre] = soup.find('a', class_ = 'eA', text = genre).get('href')

# get each artist's URL, per genre
artist_urls = {e: [] for e in genres}
for genre in genres:
    r = requests.get(basepath + genre_urls[genre])
    soup = BeautifulSoup(r.text)
    counter = 0
    for artist in soup.find_all('a', class_ = 'top'):
        counter += 1
        print 'artist {} \r'.format(counter)
        artist_urls[genre].append(basepath + artist.get('href'))
    time.sleep(2) # don't reduce the 2-second wait (here or below) or you get errors

# get each lyrics, per genre
api = 'http://api.vagalume.com.br/search.php?musid='
genre_lyrics = {e: {} for e in genres}
for genre in artist_urls:
    print len(artist_urls[genre])
    counter = 0
    artist1 = None
    for url in artist_urls[genre]:
        success = False
        while not success: # foor loop in case your connection flickers
                r = requests.get(url)
                success = True
        soup = BeautifulSoup(r.text)
        hrefs = soup.find_all('a')
        for href in hrefs:
            if href.has_attr('data-song'):
                song_id = href['data-song']
                print song_id
                success = False
                while not success:
                        song_metadata = requests.get(api + song_id).json()
                        success = True
                if 'mus' in song_metadata:
                    if 'lang' in song_metadata['mus'][0]: # discard if no language info
                        language = song_metadata['mus'][0]['lang']
                        if language == 1: # discard if language != Portuguese
                            if 'text' in song_metadata['mus'][0]: # discard if no lyrics
                                artist2 = song_metadata['art']['name']
                                if artist2 != artist1:
                                    if counter > 0:
                                        print artist1.encode('utf-8') # change as needed
                                        genre_lyrics[genre][artist1] = artist_lyrics
                                    artist1 = artist2
                                    artist_lyrics = []
                                lyrics = song_metadata['mus'][0]['text']
                                counter += 1
                                print 'lyrics {} \r'.format(counter)

    # serialize
    with open(genre + '.json', mode = 'wb') as fbuffer:
        json.dump(genre_lyrics[genre], fbuffer)

So, I loop through genres, artists, and songs. I get each song’s id and use Vagalume’s nice API to check whether the lyrics is available and whether it’s in Portuguese. For each genre I create a dict where each artist is a key and then I save the dict as a JSON file.

You’ll need the requests and BeautifulSoup packages to run this code.

Vagalume says that in the future you will need credentials to use their API (that’s how it works with Twitter, for instance). So, if you’re in the future you may have to change the code.

Here’s the code I wrote to measure the vocabularies:

count vocabularies of lyrics scraped off vagalume.com.br
(author: thiagomarzagao.com)

import os
import json
import nltk
import random
import pandas as pd

# organize lyrics by artist
basepath = '/path/to/JSON/files/'
rawdata = {}
for fname in os.listdir(basepath):
    if '.json' in fname:
        with open(basepath + fname, mode = 'rb') as fbuffer:
            genre = json.load(fbuffer)
            for artist in genre:
                rawdata[artist] = genre[artist]

# compute statistics per artist
data = {}
for artist in rawdata:
    print artist.encode('utf-8')
    total_lyrics = 0
    words = []
    for lyrics in rawdata[artist]:
        print(str(total_lyrics) + ' \r'),
        lowercased = lyrics.lower()
        tokens = nltk.word_tokenize(lowercased)
        if len(tokens) < 2:
        total_lyrics += 1
        for word in tokens:
    total_words = float(len(words))
    unique_words = float(len(set(words)))

    # discard if artist has less than 10,000 words
    if len(words) >= 10000:

        # bootstrap
        uniques = []
        for i in range(1000):
            sample = random.sample(words, 10000)
        vocabulary = float(sum(uniques)) / len(uniques)

        # store data
        data[artist] = {'total_lyrics': total_lyrics,
                        'total_words': total_words,
                        'unique_words': unique_words,
                        'vocabulary': vocabulary}

df = pd.DataFrame(data).T.sort(['vocabulary'], ascending = False)

You’ll need NLTK and pandas to run this code.

To produce the word clouds I used Andreas Mueller’s nifty word_cloud package.

Well, it’s a wrap. Say no to dumb music, kids!

classifying public procurement

New manuscript and app. Abstract:

The Brazilian government often misclassifies the goods it buys. That makes it hard to audit government expenditures. We cannot know whether the price paid for a ballpoint pen (code #7510) was reasonable if the pen was misclassified as a technical drawing pen (code #6675) or as any other good. This paper shows how we can use machine learning to reduce misclassification. I trained a support vector machine (SVM) classifier that takes a product description as input and returns the most likely category codes as output. I trained the classifier using 20 million goods purchased by the Brazilian government between 1999-04-01 and 2015-04-02. In 83.3% of the cases the correct category code was one of the three most likely category codes identified by the classifier. I used the trained classifier to develop a web app that might help the government reduce misclassification. I open sourced the code on GitHub; anyone can use and modify it.

saving TfidfVectorizer without pickles

As promised, here’s how to save a trained instance of scikit-learn’s TfidfVectorizer without using pickles - in other words, how to save it as human-readable, shareable data.

The general idea is in my previous post: a model is a set of coefficients so you just extract them and save them as you would save any other data (like the very data you used to train the model). That way you avoid the security and maintainability problems of using pickles. You extract the coefficients, save them as data, then later you load them and plug them back in.

Now, that’s easier to do with some models than with others. With scikit-learn’s SGDClassifier, for instance, that’s a breeze. But with TfidfVectorizer that’s a bit tricky. I had to do it anyway so I thought I should write a how-to of sorts.

First we instantiate our TfidfVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(lowercase = False,
                             min_df = 2,
                             norm = 'l2',
                             smooth_idf = True)

Once we’ve trained the vectorizer it will contain two important attributes: idf_, a numpy array that contains the inverse document frequencies (IDFs); and vocabulary_, a dictionary that maps each unique token to its column number on the TF-IDF matrix.

To extract the IDF array you can just print it to the screen and then copy and paste it to a .py file. The file will look like this:

import numpy as np

idfs = np.array([7.35689028227,

To extract the vocabulary you can do the same, but depending on how many tokens you have this may not be practical. An alternative is to use JSON. Like this:

import json

json.dump(vectorizer.vocabulary_, open('vocabulary.json', mode = 'wb'))

The vocabulary is now saved in the vocabulary.json file.

That’s it, we’ve disassembled our vectorizer. So far so good.

Now, it’s when we try to put everything back together that things get tricky.

We start by importing the TfidfVectorizer class. But we can’t instantiate the class right away. Here’s the problem: we are not allowed to assign arbitrary values to the idf_ attribute. If you instantiate the class and then try something like vectorizer.idf_ = idfs you get an AttributeError exception.

from idfs import idfs # numpy array with our pre-computed idfs
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(lowercase = False,
                             min_df = 2,
                             norm = 'l2',
                             smooth_idf = True)

vectorizer.idf_ = idfs
AttributeError: can't set attribute

The problem is that the idf_ attribute is kind of “read-only”. I say “kind of” because that’s not exactly true: if you train the vectorizer then idf_ will change (it’ll have the IDFs). But idf_ behaves as read-only if you try to plug the IDFs directly, without training the vectorizer.

That happens because idf_ is defined with a @property decorator and has no corresponding setter method - check TfidfVectorizer’s source code.

I can’t imagine why the scikit-learn folks made that choice. That’s a bunch of smart people with a lot of programming experience, so I imagine they had good reasons. But that choice is getting in the way of proper model persistence, so here’s how we get around it:

from idfs import idfs # numpy array with our pre-computed idfs
from sklearn.feature_extraction.text import TfidfVectorizer

# subclass TfidfVectorizer
class MyVectorizer(TfidfVectorizer):
    # plug our pre-computed IDFs
    TfidfVectorizer.idf_ = idfs

# instantiate vectorizer
vectorizer = MyVectorizer(lowercase = False,
                          min_df = 2,
                          norm = 'l2',
                          smooth_idf = True)

So, what’s happening here? We are creating a new class - MyVectorizer -, which inherits all attributes (and everything else) that TfidfVectorizer has. And we are plugging our IDFs into the MyVectorizer class. When we instantiate MyVectorizer our pre-computed IDFs are already there, in the idf_ attribute. Problem solved.

But we’re not done yet. If you try to use the vectorizer now you’ll get an error:

vectorizer.transform(['hey macarena'])
ValueError: idf vector not fitted

So, we’re being told that our vectorizer hasn’t been trained, even though we’ve plugged our pre-computed IDFs. What’s going on?

When we try to use our vectorizer there is a function check_is_fitted that checks, well, whether we have fitted the vectorizer. You’d think it checks the idf_ attribute but it doen’t. Instead it checks the attribute of an attribute: ._tfidf._idf_diag, which is a sparse matrix made from the IDFs. So we need to plug that matrix into the vectorizer.

We can extract ._tfidf._idf_diag from the trained vectorizer, save it as data, then load and plug it - just like we did with the other attributes. But an easier alternative is to simply compute ._tfidf._idf_diag from our IDFs, using scipy.

import scipy.sparse as sp
from idfs import idfs # numpy array with our pre-computed idfs
from sklearn.feature_extraction.text import TfidfVectorizer

# subclass TfidfVectorizer
class MyVectorizer(TfidfVectorizer):
    # plug our pre-computed IDFs
    TfidfVectorizer.idf_ = idfs

# instantiate vectorizer
vectorizer = MyVectorizer(lowercase = False,
                          min_df = 2,
                          norm = 'l2',
                          smooth_idf = True)

# plug _tfidf._idf_diag
vectorizer._tfidf._idf_diag = sp.spdiags(idfs,
                                         diags = 0,
                                         m = len(idfs),
                                         n = len(idfs))

Problem solved. All we need to do now is plug the vocabulary.

vocabulary = json.load(open('vocabulary.json', mode = 'rb'))
vectorizer.vocabulary_ = vocabulary

Now our vectorizer works:

vectorizer.transform(['hey macarena'])
<1x505938 sparse matrix of type '<type 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>

And we’re done.

model persistence without pickles

So, I trained this SVM classifier and I wanted to use it in a web app I built. I used Python for everything, so it seemed straightforward at first: just use the pickle module to save the classifier to disk, then have the app load the pickle. But things got complicated. In the end I found a better way to achieve model persistence, so I thought I should share the experience.

The fundamental problem is that the classifier turned out huge. Not surprising: it was trained with 20 million documents and intended to pick one of 560 possible document categories. The resulting coefficient matrix has dimensions 560 (categories) by 505,938 (unique tokens). That’s a matrix with 283,325,280 cells. When pickled to a file it takes up 8GB of disk space.

I didn’t mind that at first. I thought “fine, so the app will take a few seconds to be ready after I deploy it, no problem”. But the app can’t load an 8GB pickle if there is only, say, 1GB of RAM. I did some tests and realized that I would need a server with at least 16GB of RAM to (barely!) host the app. I looked up server prices on Amazon Web Services and on Google Compute Engine. It would cost me some US$ 200 a month to keep the app alive. Not happening. (Have I mentioned that I live in Brazil and that our currency was massive devalued this year?)

So I gave up on hosting the app. I decided to open source the code instead and let users download and host the app themselves. But that turned my 8GB pickle into a problem. It’s ok to consume your own pickles (well, not really) but it’s not ok to expect other people to consume your pickles. Pickles can have malicious code. And pickles are not guaranteed to work across different versions of the same Python packages.

Now, a model is basically a bunch of coefficients - so why not store it as data? We shouldn’t have to store a model in a pickle or in any format that is not human readable. We can store a model as we store the very data that we used to estimate the model. And that’s what I propose we do.

I used scikit-learn’s stochastic gradient descent class to train my SVM classifier, which I instantiated with the following paramters:

from sklearn import linear_model

clf = linear_model.SGDClassifier(loss = 'modified_huber',
                                         penalty = 'l2',
                                         alpha = 0.0001,
                                         fit_intercept = True,
                                         n_iter = 60,
                                         shuffle = True,
                                         random_state = None,
                                         n_jobs = 4,
                                         learning_rate = 'optimal',
                                         eta0 = 0.0,
                                         power_t = 0.5,
                                         class_weight = None,
                                         warm_start = False)

Once the model is trained the coefficients are stored in the clf.coef_ attribute as a numpy array of dimensions 560 (classes) by 505,938 (unique tokens).

array([[ -9.86766641e-03,  -6.35323529e-03,  -2.78639093e-05, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [ -1.08928684e-03,  -5.49497974e-04,  -1.72031659e-08, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [ -1.13471047e-05,  -8.34983019e-06,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [ -4.71493271e-06,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [ -1.51909841e-03,  -1.58582663e-03,  -1.53207693e-04, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [ -8.46919968e-07,  -3.21041555e-07,  -3.67708175e-10, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00]])

As you can see, extracting the coefficients is trivial: just get clf.coef_. But how do we store them as data? I toyed with a couple of ideas and in the end I chose HDF5. If you haven’t used it before, an HDF5 file is a “container” inside which you can store arrays. I had used HDF5 before and it’s great for fast retrieval of large amounts of data. To use it from Python you must have pytables installed. You don’t need to call pytables though - pandas has a nice interface to it. Here’s how I did it:

import pandas as pd

store = pd.HDFStore('coefficients.h5')
counter = 0
for row in clf.coef_:
    # annoyingly I couldn't store each row of the
    # numpy array directly, I had to convert each
    # row into a pandas DataFrame, as you see here,
    # or else I got an error message
    store['row' + str(counter)] = pd.DataFrame(row)
    counter += 1

That’s it - we have extracted our coefficients and stored them in an HDF5 file. Here I had 560 categories and 505,938 unique tokens, so my HDF5 file contains 560 pandas DataFrames, each of length 505,938.

We are not done though. Each of the 560 classes has not only 505,938 coefficients but also one intercept. These are stored in the clf.intercept_ attribute. You can store them with HDF5 as well but with only 560 intercepts I didn’t bother doing that. I just printed clf.intercept_ to the screen and then copied and pasted it into a .py file. Dirty, I know, but quick and easy. The file looks like this:

import numpy as np

intercepts = np.array([-1.0378634 , -1.00160445, -1.00561022, -1.35181801, -1.00007832,
                       -1.00009826, -1.00010426, -1.00985818, -1.00165959, -1.00020922,
                       -1.00417335, -1.003692  , -1.00178602, -1.00047299, -1.00073538,
                       -1.00008621, -1.00021723, -1.00037028, -1.00055338, -1.09941216,
                       -1.00037199, -1.00204897, -1.03388095, -1.00933133, -1.02132215,
                       -1.04021753, -1.00289487, -1.00191766, -1.00168548, -1.00053103,

Finally we need to extract our class labels. They are in clf.classes_. Same as with the intercepts: I just printed the array to the screen and then copied and pasted it into a .py file.

import numpy as np

class_labels = np.array(['11005', '11010', '11015', '11020', '11025', '11030', '11035',
                         '11040', '11045', '11055', '11080', '11090', '11095', '11110',
                         '11125', '11135', '11190', '11240', '11280', '11305', '11310',
                         '11315', '11325', '11330', '11340', '11370', '11375', '11385',

Now we have our model nicely stored as data. People can inspect the HDF5 and .py files without (much) risk of executing arbitrary code. Our model is human readable and shareable. Now my app is indeed open source.

Ok, so much for disassembling the model. How do we put it back together?

Quick and easy. Instantiate the model, load the class labels, the coefficients and the intercepts, and plug everything in:

import numpy as np
import pandas as pd
from sklearn import linear_model

# load class labels
from class_labels import class_labels # or however you named the file and array

# load intercepts
from intercepts import intercepts # or however you named the file and array

# instantiate model
clf = linear_model.SGDClassifier(loss = 'modified_huber',
                                 penalty = 'l2',
                                 alpha = 0.0001,
                                 fit_intercept = True,
                                 n_iter = 60,
                                 shuffle = True,
                                 random_state = None,
                                 n_jobs = 4,
                                 learning_rate = 'optimal',
                                 eta0 = 0.0,
                                 power_t = 0.5,
                                 class_weight = None,
                                 warm_start = False)

# load coefficients
store = pd.HDFStore('coefficients.h5')

# convert from pandas DataFrame back to numpy array
coefs = np.array([np.array(store['row' + str(i)]).T[0] for i in range(len(class_labels)]))

# close HDF5 file

# plug class labels
clf.classes_ = class_labels

# plug intercepts
clf.intercept_ = intercepts

# plug coefficients
clf.coef_ = coefs

And voilà, we have reconstructed our model. The labels, intercepts and coefficients are in their proper places (i.e., assigned to the proper clf attributes) and the model is ready to use. And everything runs a lot faster than if we were loading pickles.

Some models are more easily “datafied” than others. “Datafying” an instance of scikit-learn’s TfidfVectorizer’s class, for instance, is a bit tricky. I’ll cover that in another post.

from political science to data science

Today it’s been a year since I started working as a data scientist. Before that I was doing my Ph.D., in political science. I wonder what other people who’ve made this sort of transition - from some social science to data science - have learned. Here’s what I’ve found out (so far). Maybe this will encourage others to share their own experiences.

Causality doesn’t matter

Political scientists want to know what causes what. Does democracy increase GDP per capita? Does oil make a country more authoritarian? Can trade prevent war? And so on. But causality is hard to establish. You need to run controlled experiments or convince the reviewers that you chose the right instruments or that you matched on the right confounders. And Gary King is always around the corner waiting to stab you in the heart.

So I was overwhelmed with joy when I found out that causality is not a big deal in the world of data science. Take text classification, for instance. You’re mapping word counts to categories. Yes, in a way, the categories do cause the word counts - “let me be clear” appears a lot in Obama’s speeches because they are Obama’s. But we just don’t care about the effect of Obama on the frequency of “let me be clear”. We only care about classifying the text correctly.

That’s liberating. Think of the time and effort that political scientists spend defending their causal claims. Or debunking others’ causal claims. Or trying to accurately measure causal effects. When you remove causality from the equation you suddenly have a lot more time to work on other, potentially more interesting aspects of data analysis.

Debates have clear winners

Most political scientists have no notion of test set or validation set. They take a bunch of data and use 100% of it as a training set. They run their regressions and interpret the equations that map dependent variables to independent variables. And that’s it. Pages and pages of articles and books and blog posts dedicated to training sets.

Which is of course ok - political scientists are trying to understand how the world works - but is also frustrating: debates never end. When you limit yourself to the training data you just don’t have a clear metric of success. It’s largely about how convincingly you can defend your theoretical and methodological choices. That leaves ample room for subjectivity, so it’s never entirely clear who are the winners and losers.

In the world of data science, however, the training data is just the beginning. Once you’ve trained your model you use it on data it hasn’t seen before and you observe how well it performs. And there are clear metrics for that: accuracy, recall, F-measure, etc. You can compare different models and immediately spot the winners and losers. You can argue all you want that your model is inherently “sounder” because of XYZ but that doesn’t matter. What matters is whether your model misclassifies dogs as cats less often than other models do.

There is less ideological bias

Hopefully I don’t need to convince you that social scientists are overwhemingly on the left of the ideology spectrum. To give you but one number, the Democratic:Republican ratio is 5.6:1 among political scientists. The ideological distribution is so skewed that there are survival guides for non-leftists who want to thrive in academia.

Reviewers being humans, sometimes they dislike your instruments or matching or sample because they dislike your paper’s conclusions. If you are in political science you probably lean left, so you’re unlikely to have seen this happen to one of your papers. But try to argue that Latin America’s left turn in the 2000s has eroded democracy in the region. Boy, brace yourself for some angry reviews. It’s not that the methodological criticisms will be unfounded. It’s just that they would likely not exist if your conclusions were reversed.

In data science, on the other hand, you have Kaggle competitions. Winners and losers are not decided on the basis of subjective evaluations of your methods or sample or theory. Winners and losers are decided on the basis of who gets the higher F-measure. It’s a fair fight. So, it’s not just that debates don’t linger forever (see above), but also that they resolved in a much more rigorous way. Sometimes debates do end in political science - but not always as they should.

You want more error, not less

Political scientists want a good model fit. But as I mentioned before all they have is a training set. They have no notion of prediction or out-of-sample testing, so they end up overfitting. The fit is too good - the model doesn’t generalize.

It’s not that political scientists don’t know about overfitting. They do. But if all your data are training data then how the heck can you assess overfitting?

Political scientists believe that you can avoid overfitting by avoiding kitchen-sink regression. If I include only theoretically relevant regressors then everything should be ok, right?

Except we can always twist theory to accomodate whatever variables we need in order to get what we want. Maybe if I square this variable then I’ll get the proper p-values. And then I come up with some creative explanation of why its effect is not linear. Reviewers (ideally) assess theory consistency, of course, but then we’re back to the subjectivity and bias problems I discussed before.

This was perhaps my biggest methodological shock when I started doing data science. At first I struggled with the idea of a regularization term. What? So you want me to bias my coefficients toward zero? Then Andrew Ng’s machine learning course taught me that there is data you use to train your model and data you use to test it. And then regularization made sense. A bit more error is good.

Programming matters

Political scientists’ code is rigid: each script is meant to produce a pre-determined set of estimates, using a pre-determined dataset. The code always takes the same input and always return the same output. It does one thing and one thing alone. It doesn’t matter much if the code could be faster or prettier: as long as it replicates what you did in your paper you have fulfilled your duty.

Data scientists’ code is flexible: your code needs to accept a variety of inputs and return a variety of outputs. That’s because you’re not writing code for yourself, but for others. For instance: you have data in MySQL and other people in your organization want insights from those data. You then write a web app where people can, say, fit a regression line through some subset of the data. You don’t know what subsets people will choose. And you don’t know what the regression lines will look like. So your code needs to handle many different scenarios. What if the user chooses a non-valid subset? Say, years 2012-2014 when the data end in 2010? What if the user chooses all the data and that overloads the server? What if the regression tool you’re using under the hood (say, R’s lm() function) returns an error because the chosen subset is too small? In short: data scientists’ code has a lot more IF/THEN/ELSE statements than political scientists’ code.

So, programming matters in data science. It’s about both depth and breadth: you need a firmer grasp of basic programming concepts, like conditionals and functions (that’s the depth) and you need to learn web development, SQL, NoSQL, database administration, messaging, server maintenance, security, testing (that’s the breadth).

Some of my political scientist friends would rather poke their own eyes out than endure that sort of technical work. Some even consider technical work to be beneath them. I suppose I understand. You survived Social Origins of Dictatorship and Democracy’s 592-page discussion of regime change, class, and modernization - surely you’re destined to higher purposes. You are here to understand the world, not center HTML tables.

But if you don’t mind getting your hands dirty then programming can be a lot of fun - and at least as intelectually rewarding than political science. With programming there is no “arguing your way out” of anything: either your code works or it doesn’t. And when it doesn’t you have to figure out why and that requires a lot of sinapses. As in political science, it’s largely about hypothesis testing - if the code isn’t working because of XYZ then if I try ABC I should get this result, otherwise I should get that other result. Except that there is a finish line: you’ll never really know what makes democracy work but you’ll eventually figure out which regex matches the string you’re looking for. And you will get to the finish line or die trying - you can’t just declare regex to be a meaningless social construct and move on. You can’t get away with vague wording or selective omissions. The machine is not a journal editor - you can’t bullshit your way through it.

You lose freedom

Wait, don’t quit grad school just yet - there’s a lot to miss about academia. First and foremost, I miss the freedom to work on whatever I wanted to.

Don’t get the wrong idea: I thoroughly enjoy what I’m doing right now (I help automate cartel detection; you know, when companies that should compete against each other decide to shake hands instead). I’m learning a lot and it’s by far the most rewarding job I’ve ever had. But I can’t just wake up tomorrow and decide to work on face recognition for the next six months. I can do whatever I want to do in my spare time, but I still need to show up Mon-Fri and work on cartel detection. Sure, I could find another job. But in academia you don’t need to find another job just because you’ve got a new research interest. You stay in the same job, you just start doing something else. And that is awesome. I’m ok with foregoing that freedom, but I imagine not everyone is.

This is it. If you too are in data science as a refugee from the social sciences I would love to hear from you. How did you get here? How is the new life playing out for you? Do you intend to go back some day?