doing data science when you live in a failed state

Brazil is the undisputed world leader in homicides: over 50 thousand a year, which is more than Europe, Oceania, United States, Russia, and China combined. Yes, combined. Yes, the whole freaking Europe. Yes, the supposedly gun-loving United States. Yes, China with its 1.3 billion people. Brazil beats these continents and countries by 4,473 homicides, which is roughly equivalent to Uganda or to ten Canadas. No, I’m not making these numbers up. Take a moment to let that sink in.

As you might guess, a country with lots of homicides also tends to have lots of robbery. I’d love to take my MacBook Pro to a coffee shop and work there all day like I used to when I was in grad school - back when I lived in lovely, safe, Columbus, Ohio. But if I do that in Brasília I’ll probably come back home empty handed (if I come back home at all). You can’t parade Apple gear around when you live in a failed state.

I finally got tired of working from home all weekend, so I decided to enable SSH and HTTP connections into my home network, so I can use my Mac remotely as if it were an AWS server. That way I can go to the coffee shop with my old, cheap Lenovo - or even a tablet or smartphone - and use it to connect to my MacBook, which will remain safe and sound back home. It took some doing and I imagine others may be going through the same problem (i.e., wanting to work at a coffee shop but living in an episode of The Walking Dead), so here’s a how-to.

My setup is: Humax HG100R-L2 modem (that’s what most clients of NET - Brazil’s largest cable company - have), AirPort Extreme Base Station router, MacBook Pro. Your setup will likely differ, but you can probably tweak the instructions here to fit whatever you have.

step 1: your modem

If you have both a modem and a router then the easiest way to go about this is to put your modem in ‘bridge mode’. That means disabling your modem’s advanced settings and delegating them to your router. That way you only need to worry about router settings. You won’t need to worry about complex interactions between your modem settings and router settings.

Head to http://192.168.0.1/ on your browser. You should see the page below.

If you’ve never changed them, your id and password are ‘admin’ and ‘password’ respectively. Sign in. You should see the following, except with your WiFi network name and password shown under “SSID(2.4GHz)” and “Senha” respectively. (Your password will be shown in plain characters, not as a bunch of dots, so don’t let your neighbors peek.) (Yes, Humax’ settings are in a mix of Portuguese and English. It beats me too.)

Click “Advanced Network Settings” (lower right corner). You should see something like this:

Click on “Definir” (between “Status” and “Back Up”, second column from the left). You should see a page with a bit more options than the following one (that’s because your modem is not in bridge mode yet).

On the “Modo Switch” menu, choose “Bridge”, then click “Aplicar”. Click “ok” on whatever confirmation pop up appears. This will make you go offline for a couple of minutes, as your modem resets itself. Wait until it’s back up online again and voilà, your modem is now in bridge mode.

(If you ever need to tweak your modem settings again, it’s no longer http://192.168.0.1/ but http://192.168.100.1)

step 2: your router

On to your router now. We need to tell it to accept incoming SSH and HTTP connections. In order to do that we need to tell your router to map those types of connections to specific ports.

On your Mac, open the AirPort Utility app.

Click on the AirPort Extreme picture to go into your routers’ settings and go to the ‘Network’ tab. You should see something like this:

We’ll make a lot of changes here. First, on the “Router Mode” dropdown menu, choose “DHCP and NAT” if that’s not the chosen value already. Then click the “+” button near “DHCP Reservations”. That will open a small page. You’ll make it look like the one below by selecting the exact same choices. (To do that you’ll need to know your MAC address, which you can find out in your Mac by going into “System Preferences”, “Network”, “Advanced”; it’s the combination of digits you see right next to “Wi-Fi Address”.) When everything matches, click “Save”.

Now you’re back to this:

Click the “+” button near “Port Settings”. A small page will pop up. Tweak all the fields so that it looks exactly like this:

Click “Save”. Then click the “+” button near “Port Settings” again. The same small page will pop up. Make it look exactly like this:

Click “Save”. Then click “Update”. Your router will go crazy for a moment as it does its magic. Wait until it comes back up online and voilà, you have allowed SSH and HTTP connections into your home network. SSH connections will be forwarded to port 22 and HTTP connections will be forwarded to port 8080.

step 3: your Mac

This part is simple. Go to “System Preferences”, “Sharing”, and enable Remote Login:

If your firewall is active then you need to tell it to allow incoming traffic through ports 22 and 8080. This can be a bit tricky and it depends on your OS version. This may help. Alternatively, you can take the lazy and insecure path of simply disabling your firewall altogether (“System Preferences”, “Security and Privacy”, “Firewall”).

step 4: your IP address

You need to know your MacBook’s public IP address so you can access it from the outside. This should tell you. Write it down.

My experience with NET in Brazil (and with TimeWarnerCable in the US) is that IP addresses don’t change that often. But they do sometimes. If that bothers you you may ask that your cable provider give you a static IP address (they may charge a small fee for that). (EDIT: alternatively, you can use a Dynamic DNS service - like this; h/t Thompson Marzagão.)

step 5: your coffee shop

Take whatever cheap, inconspicous piece of hardware you have at hand to your favorite coffee shop. Launch a terminal and do ssh myusername@myipaddress, where myusername is the username you normally use to log into your Mac and myipaddress is the IP address you wrote down in step 4. Enter your password and that’s it, you are now inside your Mac. You can cd into different directories, run code, do whatever you want.

If your coffee shop hardware is a tablet or smartphone, Termius is a terrific SSH client for mobile devices.

step 6 (optional): your data science

Wondering why I made you enable HTTP connections? Well, here comes the really fun part: Jupyter notebooks. You can start a Jupyter server in your Mac and then, with your coffee shop cheapoware, use your browser to write code interactively and have it run on your Mac. Jupyter’s default language is Python but you can install kernels for an increasingly large number of languages, like R and Julia.

On your Mac, do pip install jupyter to install Jupyter and then do jupyter notebook --ip='0.0.0.0' --port='8080' --no-browser to start the Jupyter server. You’ll be given a url. Something like http://0.0.0.0:8080/?token=sfdsfs90809809s8dfs0df8sdf. Replace 0.0.0.0 by myipaddress (see step 4). That’s the address you’ll use at the coffee shop to launch Jupyter notebooks.

(If your cheapoware is a laptop things should work right out-of-the-box. If it’s an iOS device then you have some additional steps to take - see here.)

step 7: your venti caramel macchiato

That’s it! You have now reduced your likelihood of getting mugged and minimized your losses in case you do get mugged. Time to grab your katana and go mingle with the locals.

using deep learning to detect fake exports

New paper. Abstract:

Normally exports of goods and products are transactions encouraged by the governments of countries. Typically these incentives are promoted by tax exemptions or lower tax collections. However, exports fraud may occur with objectives not related to tax evasion, for example money laundering. This article presents the results obtained in implementing the unsupervised Deep Learning model to classify Brazilian exporters regarding the possibility of committing fraud in exports. Assuming that the vast majority of exporters have explanatory features of their export volume which interrelate in a standard way, we used the AutoEncoder to detect anomalous situations with regards to the data pattern. The databases used in this work come from exports of goods and products that occurred in Brazil in 2014, provided by the Secretariat of Federal Revenue of Brazil. From attributes that characterize export companies, the model was able to detect anomalies in at least twenty exporters.

Django for Flask users

I’m using Django for a serious project for the first time. I had played with Django a couple of times before, but I’m a long-time Flask fanboy and I invariably gave up in frustration (“why would anyone ever need separate files for settings, urls, and views?!”). Well, turns out Django is pretty cool if you want to put a bunch of apps under the same umbrella. Now, the official tutorial is a bit too verbose if you’re impatient. And if you’re used to Flask’s minimalism, you will get impatient with Django at times. So, here a few potentially useful pointers (largely for my own future consultation).

To get started, just pip install Django, run django-admin startproject mysite, then run python manage.py startapp myapp. (Replace mysite and myapp by whatever names you want.) This should create the essential files and directories you’ll need.

making urls work

In Flask you create your views and map your urls all at once:

@app.route('/')
def index():
    return 'Hello World!'

This is about as simple as it gets (unless you want to get really minimalist).

In Django you can’t do that. You have to define your views in one place and map your urls elsewhere. The usual way to do it is to define your views in your (aptly named) myapp/views.py file, like this:

from django.http import HttpResponse

def index(request):
    return HttpResponse('Hello World!')

Unlike in Flask you can’t just do return 'Hello World!' - the returned object cannot be a string, so we need to import HttpResponse. Also unlike in Flask, we must feed the request to the function - there is no global request object in Django, so we need to pass it around explicitly (more about this in a moment).

Now on to mapping urls. This requires changing two different files. The first is your mysite/urls.py file, wherein you’ll put this:

from django.conf.urls import url

urlpatterns = [url(r'^myapp/', include('myapp.urls'))]

This piece of code tells mysite (the big project inside which your various apps will live) to defer to myapp (one of your various apps) whenever someone hits http://blablabla/myapp/. (That r'^myapp/ thing is a regular expression that matches any url that contains myapp/.)

So, mysite/urls.py is a big dispatcher: it’ll check the url and send the request to the appropriate app. Here we only have one app (myapp), but if you’re using Django you’ll likely have several apps, in which case the urlpatterns list will contain several url() objects.

Now, myapp must be prepared to receive the baton. For that to happen your myapp/urls.py file (not your mysite/urls.py file!) must look like this:

from django.conf.urls import url
from . import views

urlpatterns = [url(r'^$', views.index, name = 'index')]

Here we have another regex: r'^$. This will capture any requests that end in myapp/. (If the request got this far then it must contain myapp/, so you don’t need to repeat it in the regex here.) We’re telling myapp that any such requests should be handled by the view function named index - which you defined before, in your myapp/views.py file (see above).

So, myapp/urls.py is a secondary dispatcher: it’ll check the url and send the request to the appropriate view. Here we only have one view (the app’s index page), but in real life you’ll have several views, in which case the urlpatterns list will contain several url() objects.

That’s it. If you run python manage.py runserver and then open http://127.0.0.1:8000/ in your browser you should be greeted by the Hello World! message.

If you really want to you can have a single-file Django project: check this. But if your project is so small that you can have a single file then maybe you’d be better off using Flask or CherryPy or some other minimalist web framework.

request and session

Accessing request and session data in Flask is a no brainer. There is a global request object and a global session object and, well, you just do whatever you want to do with them.

from flask import request
from flask import session

@app.route('/')
def hello():
    if request.method == 'POST':
        user_input = request.form['user_input']
        session['foo'] = 'bar'
    elif request.method == 'GET':
        session['foo'] = 'macarena'
    return session['foo']

In Django, as I mentioned before, there is no global request object - you need to explicitly pass request around to work with it. There is no global session object either. Instead, session is an attribute of request. This is how the above snippet translates into Django:

from django.http import HttpResponse

def hello(request):
    if request.method == 'POST':
        user_input = request.POST['user_input']
        request.session['foo'] = 'bar'
    elif request.method == 'GET':
        request.session['foo'] = 'macarena'
    return HttpResponse(request.session['foo'])

So, session becomes request.session and request.form becomes request.POST.

templates

You must tell Django where to look for templates. Open mysite/settings.py, locate the TEMPLATES list and edit DIRS.

TEMPLATES = [
    {
        'BACKEND': 'django.template.backends.django.DjangoTemplates',
        'DIRS': ['/path/to/my/templates/folder/',
                 '/path/to/my/other/templates/folder/'],
        'APP_DIRS': True,
        'OPTIONS': {
            # ...
        },
    },
]

There are a few syntax differences between Jinja2 (Flask’s default templating language) and Django’s templating language (DTL). For instance, to access the first element of mylist it’s {% mylist[0] %} in Jinja2 but {% mylist.0 %} in DTL. But most of the syntax is identical. Template inheritance works the same way, with {% extends 'parent.html' %} and {% block blockname %}{% endblock $}. Same with loops, if/elses, and variables: {% for bla in blablabla %}{% endfor %}, {% if something %}{% elif somethingelse %}{% else %}{% endif %}, {{ some_variable }}. If you’re porting something from Flask to Django there is a chance your templates will work just as they are.

You need to change your views though. In Flask you render a template and pass variables to it like this:

from flask import render_template

@app.route('/')
def hello():
    return render_template('mytemplate.html', 
                           some_var = 'foo', 
                           other_var = 'bar')

In Django you do it like this:

from django.shortcuts import render

def hello(request):
    return render(request,
                  'mytemplate.html', 
                  {'some_var' = 'foo', 
                   'other_var' = 'bar'})

So, in Django you must pass the request object to render the template. And your template variables must be passed as a dict.

connections

In both Flask and Django you can use something like pyodbc or pymssql to connect to your databases. But you can put a layer of abstraction on top of that. In Flask there is Flask-SQLAlchemy. Here’s their quickstart snippet:

from flask import Flask
from flask_sqlalchemy import SQLAlchemy

app = Flask(__name__)
app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:////tmp/test.db'
db = SQLAlchemy(app)

class User(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    username = db.Column(db.String(80), unique=True)
    email = db.Column(db.String(120), unique=True)

    def __init__(self, username, email):
        self.username = username
        self.email = email

    def __repr__(self):
        return '<User %r>' % self.username

In Django the connection and the models go into separate scripts. You set up the connection by adding an entry to the DATABASES dict in your mysite/settings.py file:

DATABASES = {

    # ... your other db connections ...

    'my_database_name': {
        'ENGINE': 'django.db.backends.sqlite3', 
        'NAME': 'my_database_name',
        'USER': 'my_username',
        'PASSWORD': 'my_password',
        'HOST': 'my.host.address',
        'PORT': 'my_port'}
    }

Then, in your myapp/models.py, you define your models.

from django.db import models

class User(models.Model):
    id = models.IntegerField()
    username = models.CharField(max_length = 80)
    email = models.CharField(max_length = 120)

You don’t have to use any models though. If you prefer to run raw SQL queries you can do it like this:

from django.db import connections

cursor = connections['my_database_name'].cursor()
cursor.execute('SELECT * FROM sometable')
results = cursor.fetchall()

Just as you would do with pyodbc (except that here you don’t need to .commit() after every database modification).

afterwards

I’m just trying to get you past the initial rage over all the boilerplate code Django requires. :-) This is all just about syntax - I’m merely “translating” Flask to Django. If you’re serious about Django you should invest some time in learning Django’s semantics. Their official tutorial is a good place to start. Have fun!

I need to use Google App Engine to text my girlfriend

This is the story of how I had to build and deploy a freaking app just so I can text my girlfriend when I’m at the office. Perhaps it’ll help others who are also subject to the arbitrary rules of IT departments everywhere. (Dilberts of the world, unite!)

For some two years now my messaging app of choice has been Telegram. It’s lightweight, end-to-end encrypted, well designed, and free; it’s impossible not to love it. Now, I hate typing on those tiny on-screen keyboards, so most of the time what I actually use is Telegram’s desktop app. Problem is, I can’t use it when I’m at work. My organization’s IT department blocks access to Telegram’s servers (dont’ ask). I can install the app, but it doesn’t connect to anything; it can’t send or receive messages.

So, I looked into Telegram’s competitors. I tried WhatsApp, but its desktop version is blocked as well at my organization. And in any case I tried it at home and it’s sheer garbage: the desktop app needs your phone to work (!) and it crashes every ~15 minutes. (I keep pestering my friends to switch from WhatsApp to Telegram but WhatsApp is hugely popular in Brazil and network externalities get in the way.)

Then it hit me: why not Slack? The IT department doesn’t block it and I already use Slack for professional purposes. Why not use it to talk to my girlfriend too? I created a channel, got her to sign up, and we tried it for a couple of days.

Turns out Slack solved the desktop problem at the cost of creating a mobile problem. I don’t have any issues with Slack’s web interface - I keep my channels open on Chrome at all times and that works just fine. But when I switch to mobile… boy, that’s one crappy iOS app. Half the time it just doesn’t launch. Half the time it takes forever to sync. Granted, my iPhone 5 is a bit old. But the Telegram iOS app runs as smooth and fast as it did two years ago, so the hardware is not at fault here.

As an aside, turns out Slack’s desktop app is also ridiculously heavy. I don’t really use it - I use Slack’s web interface instead -, but that’s dispiriting nonetheless.

I tried Facebook’s Messenger. Blocked. I tried a bunch of lesser-known alternatives. Blocked.

Eventually I gave up on trying different messaging apps and asked the IT department to unblock access to Telegram’s servers. They said no - because, well, reasons. (In the words of Thomas Sowell, “You will never understand bureaucracies until you understand that for bureaucrats procedure is everything and outcomes are nothing”.)

The IT guys told me I could appeal to a higher instance - some committee or another -, but I’ve been working in the government for a while and I’ve learned to pick my fights. Also, I believe in Balaji Srinivasan’s “don’t argue” policy.

So, I rolled up my sleeves and decided to build my own solution.

I don’t need to build a full-fledged messaging app. What I need is extremely simple: a middleman. Something that serves as a bridge between my office computer and Telegram’s servers. I need a web app that my office computer can visit and to which I can POST strings and have those strings sent to my girlfriend’s Telegram account.

That app needs to be hosted somewhere, so the first step is choosing a platform. I briefly considered using my personal laptop for that, just so I didn’t have to deal with commercial cloud providers. But I worry about exposing to the world my personal files, laptop camera, browser history, and the like. Also, I want 24/7 availability and sometimes I have to bring my laptop to the office.

I settled on Google App Engine. I used it before (to host an app that lets people replicate my Ph.D. research) and I liked the experience. And, more importantly, it has a free tier. GAE has changed quite a bit since the last time I used it (early 2014), but it has an interactive tutorial that got me up to speed in a matter of minutes.

You can choose a number of programming languages on GAE. I picked Python because that’s what I’m fastest at. (In hindsight, perhaps I should’ve used this as a chance to learn some basic Go.)

Instead of starting from scratch I started with GAE’s default “Hello, world!” Python app. The underlying web framework is Flask. That’s my go-to framework for almost all things web and that made things easier. Using Flask, this is how you control what happens when a user visits your app’s homepage:

# this is all in the main.py file of GAE's default "Hello, world!" Python app
from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello():
    return 'Hello, world!'

I don’t want a static webpage though, I want to communicate with Telegram’s servers. In order to do that I use a Python module called telepot. This is how it works: you create a Telegram bot account and then you use telepot to control that bot. (In other words, the sender of the messages will not be you, it will be the bot.

When you create your bot you receive a token credential, which you will then pass to telepot.

import telepot
bot = telepot.Bot('YOUR_TOKEN')
bot.getMe()

You can now make your bot do stuff, like sending messages. Now, Telegram enforces a sort of Asimovian law: a bot cannot text a human unless it has been texted by that human first. In other words, bots can’t initiate conversations. So I created my bot, told my girlfriend its handle (@bot_username), and had her text it. That message (like all Telegram messages) came with metadata (see here), which included my girlfriend’s Telegram ID. That’s all I need to enable my bot to text her.

girlfriend_id = 'SOME_SEQUENCE_OF_DIGITS'
bot.sendMessage(girlfriend_id, 'How you doing?')

Now let’s merge our web app code and our telepot code in our main.py file:

import telepot
from flask import Flask
app = Flask(__name__)

bot = telepot.Bot('YOUR_TOKEN')
bot.getMe()
girlfriend_id = 'SOME_SEQUENCE_OF_DIGITS'

@app.route('/')
def textGirlfriend():
    bot.sendMessage(girlfriend_id, 'How you doing?')
    return 'message sent!'

(This can be misused in a number of ways. You could, say, set up a cron job to text ‘thinking of you right now!’ to your significant other at certain intervals, several times a day. Please don’t.)

The rest of the default “Hello, world!” Python app remains the same except for two changes: a) you need to install telepot; use pip install with the -t option to specify the lib directory in your repository; and b) you need to add ssl under the libraries header of your app.yaml file.

So, I created a web app that my IT department does not block and that texts my girlfriend when visited. But I don’t want to text ‘How you doing?’ every time. So far, the app doesn’t let me choose the content of the message.

Fixing that in Flask is quite simple. We just have to: a) add a text field to the homepage; b) add a ‘submit’ button to the homepage; and c) tell the app what to do when the user clicks ‘submit’. (We could get fancy here and create HTML templates but let’s keep things simple for now.)

import telepot
from flask import Flask
from flask import request # so that we can get the user's input
app = Flask(__name__)

bot = telepot.Bot('YOUR_TOKEN')
bot.getMe()
girlfriend_id = 'SOME_SEQUENCE_OF_DIGITS'

@app.route('/')
def getUserInput():
    return '<form method="POST" action="/send"><input type="text" name="msg" size="150"><br><input type="submit" value="submit"></form>'

@app.route('/send', methods = ['POST'])
def textGirlfriend():
    bot.sendMessage(girlfriend_id, request.form['msg'])
    return 'message sent!'

And voilà, I can now web-text my girlfriend.

Yeah, I know, that would hardly win a design contest. But it works.

This is where I’m at right now. I did this last night, so there is still a lot of work ahead. Right now I can send messages this way, but if my girlfriend simply hit ‘reply’ her message goes to the bot’s account and I just don’t see it. I could have the app poll the bot’s account every few seconds and alert me when a new message comes in, but instead I think I’ll just create a Telegram group that has my girlfriend, myself, and my bot; I don’t mind reading messages on my phone, I just don’t like to type on my phone. Another issue is that I want to be able to text-app my family’s Telegram group, which means adding radio buttons or a drop-down menu to the homepage so I can choose between multiple receivers. Finally, I want to be able to attach images to my messages - right now I can only send text. But the core is built; I’m free from the tyranny of on-screen keyboards.

This is it. In your face, IT department.

text-analyzing Brazilian music

People have text-analyzed American songs to exhaustion. We’ve learned, among other things, that pop music has become dumber and that Aesop Rock beats Shakespeare in vocabulary size. I thought it would be fun to make similar comparisons for Brazilian artists. Alas, as far as I could google no one has done that yet. So, I did it myself. Here’s my report.

the data

Our data source is Vagalume, from which I scraped the lyrics. (I give all the code in the end.) In total we have eight genres, 576 artists, and 77,962 lyrics (that’s after eliminating artists whose combined lyrics summed up to less than 10,000 words; more on this later). These are the genres:

  • Rock.
  • Rap.
  • Samba. The music they play at the big Carnival parade in Rio.
  • Pagode. Derived from samba but more popular, especially in the periphery of Rio de Janeiro.
  • MPB. That slow, relaxing music you often hear in elevators.
  • Sertanejo. Country music meets salsa.
  • Axé. Think Macarena, but with racier choreographies. This is what they play at the Carnival parades in Bahia.
  • Forró. Like salsa, but faster.

Here’s how the data are distributed:

(The genre assignment - which is Vagalume’s, not mine - is not exclusive: some artists belong to more than one genre. In these cases I arbitrarily picked one of the genres, to avoid double counting in the pie charts.)

how to measure vocabulary size?

At first glance vocabulary size is a simple metric: the number of (unique) words each artist uses. But some artists have only 2-3 songs while others have 400+ songs. Naturally the more lyrics you have (and the longer your lyrics) the more unique words you will tend to use (up to the point where you exhaust your vocabulary).

Can’t we divide the number of unique words by the total number of words, for each artist? No, we can’t. Say you know a total of 10,000 words and you have written 100 lyrics with 100 words each, without ever repeating a word. In this case your “unique words / total words” ratio is 1. But if you suddenly write another batch of 100 lyrics with 100 words each your ratio will fall to 0.5 - even though your vocabulary has remained the same. No good.

So, one measure (unique words) is overly kind to prolific artists while the other measure (unique words / total words) is overly kind to occasional artists. To get around that I: a) discard all artists whose combined lyrics sum up to less than 10,000 words (that’s how we got to 576 artists and 77,962 lyrics, down from 809 artists and 95,851 lyrics); b) for each non-discarded artist, randomly select 1,000 samples of 10,000 words each and compute the average number of unique words. In other words, I truncate my data and then use bootstrapped samples.

One final issue before we dive into the data. Naturally, the person singing the song is not necessarily the person who wrote the song. So when I talk about artist XYZ’s vocabulary that’s not really his or her personal vocabulary; that’s just short for “the vocabulary we find in the songs that artist XYZ sings”, which is too cumbersome to say. (And heck, no one is forcing people to go on stage and sing dumb songs. If they do it then they should bear the reputational cost.)

Enough talk, let’s see the data.

mandatory descriptive statistics

The mean vocabulary is 1,758 words and the 95% confidence interval is [1,120, 2,458]. N = 576. Here’s the distribution:

the 30 largest vocabularies

Ok, time to see the winners (mouse over each bar to see the corresponding number):

What do you know! The largest vocabulary of Brazilian music is that of a rap band: 2,961 words. That’s almost twice the mean (1,758) and almost four times the lowest vocabulary (804). Not bad. Street music beats highfalutin MPB: Chico Buarque would need to learn some 200 new words to reach Facção Central.

Rap didn’t just take the first prize: rap bands occupy 16 of the top 30 positions. And that’s despite rap representing only 7% of our lyrics. Rap folks, take a bow.

Here, check Diário de um Detento, by Racionais Mc’s. It’s one of their most famous songs. And it’s pretty representative of Brazilian rap music, as it talks about life in the periphery of the big cities - violence, drugs, poverty, etc.

Another big surprise (to me at least) was sertanejo, which occupies 4 of the top 30 positions. When I think about sertanejo what comes to mind are cheesy duos like Zezé di Camargo & Luciano or Jorge & Mateus. The sertanejo singers we see above were completely unfamiliar to me.

I googled around and turns out these are all old-timers, mostly retired. So, it seems that what has now degenerated into tche tcherere tche tche was once a vocabulary-rich genre. (Unfortunately Vagalume doesn’t have lyrics’ release dates, so we can’t do a proper time series analysis.)

Same with Luiz Gonzaga: when I think forró I imagine tacky bands like Banda Calypso or Aviões do Forró. But Luiz Gonzaga is an old-timer. So, yet another vocabulary-rich genre has degenerated - this time into “suck it ‘cause it tastes like grapes”.

Rap, vintage sertanejo, and Luiz Gonzaga aside, what we see above are MPB’s biggest stars. These are world-renowned artists with long and established careers. You won’t find their songs among Spotify’s top 50, but they have their public. (Fittingly, Chico Buarque is a distant relative of Aurélio Buarque de Hollanda, the lexicographer who edited the most popular dictionary of Brazilian Portuguese.) I find most MPB mind-numbingly boring but here’s a playlist of Chico Buarque’s songs, so you can check for yourself.

One absence is noteworthy: rock. I was sure folks like Raimundos and Skank would come out on top. Alas, I was wrong. Raimundos’ vocabulary is 2,239 and Skank’s, 2,246 - they rank #81st and #79th respectively. Not exactly bad, but far from top 30. (I keep hearing that Brazilian rock is dead; maybe that’s true after all.) Anyway, here’s a Raimundos’ song, in case you’ve never had a taste of Brazilian rock before.

Here are the most frequent words of the top 30 artists:

No clear dominant theme here. Life (vida), other (outro), brother (irmão), water (água), path (caminho), moment (momento), soul (alma), hour (hora), dream (sonho), star (estrela), son (filho), hand (mão), night (noite), word (palavra), time (tempo), year (ano), stone (pedra), eye (olho), father (pai).

the 30 smallest vocabularies

Now on to the losers:

Sertanejo is by far the most frequent genre here: it takes up 12 of the bottom 30 positions. I can’t say I’m shocked. (I googled around and these are all contemporary sertanejo artists; no old-timers here.)

I can hear sertanejo fans complaining: “wait, maybe it’s just that sertanejo is overrepresented in the data” (though, given what we just saw, one wonders whether many sertanejo fans would know the word “overrepresented”; or “data”). Indeed, sertanejo accounts for 28% of our lyrics - by far the largest piece of the pie. But then how come not a single contemporary sertanejo artist appears in the top 30? Sorry, sertanejo fans: it’s time to acknowledge the misery of your musical taste and look for something better (the top 30 above might be a good place to start).

Rock is the second most represented category here. A bit of a surprise to me. I mean, fine, there isn’t a single rock artist in the top 30; but dammit, must rock also account for a fifth of the bottom 30? If Brazilian rock isn’t dead yet then maybe it’s time we euthanize it.

The rest is forró and pagode, about which no one could seriously have had high expectations (if you did then your musical taste is beyond hope; just give up on music and download some podcasts instead).

I expected axé to be in the bottom 30. Maybe Carnival in Bahia is not the nightmarish experience I picture after all.

The amplitude of the spectrum is large. The average of the top 30 vocabularies is 2.6 times larger than the average of the bottom 30 vocabularies. Maybe this will help you grasp the abyss: the combined vocabulary of Victor & Vinicius, Abril, Lipstick, Agnela, Lucas & Felipe, Forró Lagosta Bronzeada, Leva Nóiz, Hevo84, Forró Boys, Drive, Cacio & Marcos, Marcos & Claudio, Banda Djavú, Renan & Ray, Raffael Machado, TNT, Sambô, and Roberta Campos (after accounting for the words that they all have in common) is still smaller than the vocabulary of Racionais Mc’s alone.

In short, it’s official: Brazilian popular music is garbage. And now we have data to back up that claim.

If I’m allowed a short digression, the problem is not just the poverty of the vocabulary but the poverty of the underlying sentiments and ideas as well. Let me give you a taste. What you see below is Michel Teló’s “Oh, if I catch you” (I translated it for your benefit):

Wow. Wow.
This way you’ll kill me.
Oh, if I catch you.
Oh, if I catch you.

Delicious. Delicious.
This way you’ll kill me.
Oh, if I catch you.
Oh, if I catch you.

Saturday in the club.
People started dancing.
Then the prettiest girl passed by.
I got bold and went talk to her.

(repeat)

So: guy is in the club, sees pretty girl, goes talk to her. Next to Michel Teló Nicki Minaj is Chaucer.

No, I didn’t pick some little-known, abnormally bad outlier song just to make things appear worse than they are: “Oh, if I catch you” reached #1 in 23 European and Latin American countries. There is even a The Baseballs version, if you can believe that. Michel Teló is export-grade garbage.

Here are the most frequent words of the bottom 30 artists:

Romantic words are a lot more frequent here than in the top 30: kiss (beijo), love (amor), hurt/hurts (dói), together (junto), “missing someone/something” (saudade; kinda hard to translate this one), heart (coração), and so on. We also see that “thing” (coisa) is possibly the most frequent word here, which tells us that these artists are not even trying (they have no concept of le mot juste).

the middle of the scale

The extremes tell a pretty coherent story: rap, MPB, and vintage sertanejo on one end, forró, pagode, and contemporary sertanejo on the other. But things get fuzzier as we move away from the extremes. Here are some surprises:

  • Chiclete com Banana, an axé band whose best-known chorus is “aê aê aê aê aê”, beats MPB god, Grammy-winner, Girl from Ipanema co-author Tom Jobim.
  • Daniel, a corny sertanejo singer, beats Maria Rita and Toquinho, two revered favorites of MPB fans.
  • Legião Urbana, a mediocre rock band that is insufferably popular in Brasília (where I live), only appears in the #188th position - behind forró band Mastruz com Leite.
  • And my favorite finding: Molejo, Aviões do Forró, and Wesley Safadão, respectively the trashiest pagode band, forró band, and sertanejo singer of all times, all beat Grammy-winner, celebrated MPB singer and composer Maria Gadú.

Here’s the whole data if you’re interested.

to do

There is a lot more we could do with these data. For instance, we could do some sentiment analysis. There is no word->sentiment dictionary in Portuguese, but we could use automatic translation and then use SentiWordNet. I suspect that axé and sertanejo will be at opposite ends of the happy-sad spectrum.

We could also use co-sine similarity to check how “repetitive” each genre is. To me all axé songs sound the same, so I suspect there is little textual variation in them. We could, for each genre, take each possible pair of songs and compute the average co-sine similarity of all pairs

Just because you have a large vocabulary doesn’t mean you pick the right words. A plausible observable implication of careful word choice is the use of rarer, lesser-known words. So it might be worthwhile to compute the average inverse document frequency (IDF) of each artist and compare them.

I read somewhere that sertanejo, forró, and pagode are beginning to merge into one single genre (I shudder at the thought of a song that is simultaneously sertanejo, forró, and pagode). I wonder if that may already show in our data, so it migh be interesting to clusterize the lyrics (say, using k-means) and check whether the resulting clusters correspond to genre labels.

Finally, we could use recurrent neural networks (RNN) to automate some artists (like people have done with Obama), just for the fun of it. As with any RNN the more training texts the better, so Chico Buarque, with his 416 lyrics, would be a great candidate for automation. I bet that most of his fans wouldn’t be able to tell human Chico Buarque from bot Chico Buarque.

code

Here’s the Python code I wrote to scrape Vagalume:

'''
scrape lyrics from vagalume.com.br
(author: Thiago Marzagao)
'''

import json
import time
import pickle
import requests
from bs4 import BeautifulSoup

# get each genre's URL
basepath = 'http://www.vagalume.com.br'
r = requests.get(basepath + '/browse/style/')
soup = BeautifulSoup(r.text)
genres = [u'Rock',
          u'Ax\u00E9',
          u'Forr\u00F3',
          u'Pagode',
          u'Samba',
          u'Sertanejo',
          u'MPB',
          u'Rap']
genre_urls = {}
for genre in genres:
    genre_urls[genre] = soup.find('a', class_ = 'eA', text = genre).get('href')

# get each artist's URL, per genre
artist_urls = {e: [] for e in genres}
for genre in genres:
    r = requests.get(basepath + genre_urls[genre])
    soup = BeautifulSoup(r.text)
    counter = 0
    for artist in soup.find_all('a', class_ = 'top'):
        counter += 1
        print 'artist {} \r'.format(counter)
        artist_urls[genre].append(basepath + artist.get('href'))
    time.sleep(2) # don't reduce the 2-second wait (here or below) or you get errors

# get each lyrics, per genre
api = 'http://api.vagalume.com.br/search.php?musid='
genre_lyrics = {e: {} for e in genres}
for genre in artist_urls:
    print len(artist_urls[genre])
    counter = 0
    artist1 = None
    for url in artist_urls[genre]:
        success = False
        while not success: # foor loop in case your connection flickers
            try:
                r = requests.get(url)
                success = True
            except:
                time.sleep(2)
        soup = BeautifulSoup(r.text)
        hrefs = soup.find_all('a')
        for href in hrefs:
            if href.has_attr('data-song'):
                song_id = href['data-song']
                print song_id
                time.sleep(2)
                success = False
                while not success:
                    try:
                        song_metadata = requests.get(api + song_id).json()
                        success = True
                    except:
                        time.sleep(2)
                if 'mus' in song_metadata:
                    if 'lang' in song_metadata['mus'][0]: # discard if no language info
                        language = song_metadata['mus'][0]['lang']
                        if language == 1: # discard if language != Portuguese
                            if 'text' in song_metadata['mus'][0]: # discard if no lyrics
                                artist2 = song_metadata['art']['name']
                                if artist2 != artist1:
                                    if counter > 0:
                                        print artist1.encode('utf-8') # change as needed
                                        genre_lyrics[genre][artist1] = artist_lyrics
                                    artist1 = artist2
                                    artist_lyrics = []
                                lyrics = song_metadata['mus'][0]['text']
                                artist_lyrics.append(lyrics)
                                counter += 1
                                print 'lyrics {} \r'.format(counter)

    # serialize
    with open(genre + '.json', mode = 'wb') as fbuffer:
        json.dump(genre_lyrics[genre], fbuffer)

So, I loop through genres, artists, and songs. I get each song’s id and use Vagalume’s nice API to check whether the lyrics is available and whether it’s in Portuguese. For each genre I create a dict where each artist is a key and then I save the dict as a JSON file.

You’ll need the requests and BeautifulSoup packages to run this code.

Vagalume says that in the future you will need credentials to use their API (that’s how it works with Twitter, for instance). So, if you’re in the future you may have to change the code.

Here’s the code I wrote to measure the vocabularies:

'''
count vocabularies of lyrics scraped off vagalume.com.br
(author: Thiago Marzagao)
'''

import os
import json
import nltk
import random
import pandas as pd

# organize lyrics by artist
basepath = '/path/to/JSON/files/'
rawdata = {}
for fname in os.listdir(basepath):
    if '.json' in fname:
        with open(basepath + fname, mode = 'rb') as fbuffer:
            genre = json.load(fbuffer)
            for artist in genre:
                rawdata[artist] = genre[artist]

# compute statistics per artist
data = {}
for artist in rawdata:
    print artist.encode('utf-8')
    total_lyrics = 0
    words = []
    for lyrics in rawdata[artist]:
        print(str(total_lyrics) + ' \r'),
        lowercased = lyrics.lower()
        tokens = nltk.word_tokenize(lowercased)
        if len(tokens) < 2:
            continue
        total_lyrics += 1
        for word in tokens:
            words.append(word)
    total_words = float(len(words))
    unique_words = float(len(set(words)))

    # discard if artist has less than 10,000 words
    if len(words) >= 10000:

        # bootstrap
        uniques = []
        for i in range(1000):
            sample = random.sample(words, 10000)
            uniques.append(len(set(sample)))
        vocabulary = float(sum(uniques)) / len(uniques)

        # store data
        data[artist] = {'total_lyrics': total_lyrics,
                        'total_words': total_words,
                        'unique_words': unique_words,
                        'vocabulary': vocabulary}

df = pd.DataFrame(data).T.sort(['vocabulary'], ascending = False)

You’ll need NLTK and pandas to run this code.

To produce the word clouds I used Andreas Mueller’s nifty word_cloud package.

Well, it’s a wrap. Say no to dumb music, kids!