model persistence without pickles

So, I trained this SVM classifier and I wanted to use it in a web app I built. I used Python for everything, so it seemed straightforward at first: just use the pickle module to save the classifier to disk, then have the app load the pickle. But things got complicated. In the end I found a better way to achieve model persistence, so I thought I should share the experience.

The fundamental problem is that the classifier turned out huge. Not surprising: it was trained with 20 million documents and intended to pick one of 560 possible document categories. The resulting coefficient matrix has dimensions 560 (categories) by 505,938 (unique tokens). That’s a matrix with 283,325,280 cells. When pickled to a file it takes up 8GB of disk space.

I didn’t mind that at first. I thought “fine, so the app will take a few seconds to be ready after I deploy it, no problem”. But the app can’t load an 8GB pickle if there is only, say, 1GB of RAM. I did some tests and realized that I would need a server with at least 16GB of RAM to (barely!) host the app. I looked up server prices on Amazon Web Services and on Google Compute Engine. It would cost me some US$ 200 a month to keep the app alive. Not happening. (Have I mentioned that I live in Brazil and that our currency was massive devalued this year?)

So I gave up on hosting the app. I decided to open source the code instead and let users download and host the app themselves. But that turned my 8GB pickle into a problem. It’s ok to consume your own pickles (well, not really) but it’s not ok to expect other people to consume your pickles. Pickles can have malicious code. And pickles are not guaranteed to work across different versions of the same Python packages.

Now, a model is basically a bunch of coefficients - so why not store it as data? We shouldn’t have to store a model in a pickle or in any format that is not human readable. We can store a model as we store the very data that we used to estimate the model. And that’s what I propose we do.

I used scikit-learn’s stochastic gradient descent class to train my SVM classifier, which I instantiated with the following paramters:

from sklearn import linear_model

clf = linear_model.SGDClassifier(loss = 'modified_huber',
                                         penalty = 'l2',
                                         alpha = 0.0001,
                                         fit_intercept = True,
                                         n_iter = 60,
                                         shuffle = True,
                                         random_state = None,
                                         n_jobs = 4,
                                         learning_rate = 'optimal',
                                         eta0 = 0.0,
                                         power_t = 0.5,
                                         class_weight = None,
                                         warm_start = False)

Once the model is trained the coefficients are stored in the clf.coef_ attribute as a numpy array of dimensions 560 (classes) by 505,938 (unique tokens).

array([[ -9.86766641e-03,  -6.35323529e-03,  -2.78639093e-05, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [ -1.08928684e-03,  -5.49497974e-04,  -1.72031659e-08, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [ -1.13471047e-05,  -8.34983019e-06,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       ..., 
       [ -4.71493271e-06,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [ -1.51909841e-03,  -1.58582663e-03,  -1.53207693e-04, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [ -8.46919968e-07,  -3.21041555e-07,  -3.67708175e-10, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00]])

As you can see, extracting the coefficients is trivial: just get clf.coef_. But how do we store them as data? I toyed with a couple of ideas and in the end I chose HDF5. If you haven’t used it before, an HDF5 file is a “container” inside which you can store arrays. I had used HDF5 before and it’s great for fast retrieval of large amounts of data. To use it from Python you must have pytables installed. You don’t need to call pytables though - pandas has a nice interface to it. Here’s how I did it:

import pandas as pd

store = pd.HDFStore('coefficients.h5')
counter = 0
for row in clf.coef_:
    # annoyingly I couldn't store each row of the
    # numpy array directly, I had to convert each
    # row into a pandas DataFrame, as you see here,
    # or else I got an error message
    store['row' + str(counter)] = pd.DataFrame(row)
    counter += 1
store.close() 

That’s it - we have extracted our coefficients and stored them in an HDF5 file. Here I had 560 categories and 505,938 unique tokens, so my HDF5 file contains 560 pandas DataFrames, each of length 505,938.

We are not done though. Each of the 560 classes has not only 505,938 coefficients but also one intercept. These are stored in the clf.intercept_ attribute. You can store them with HDF5 as well but with only 560 intercepts I didn’t bother doing that. I just printed clf.intercept_ to the screen and then copied and pasted it into a .py file. Dirty, I know, but quick and easy. The file looks like this:

import numpy as np

intercepts = np.array([-1.0378634 , -1.00160445, -1.00561022, -1.35181801, -1.00007832,
                       -1.00009826, -1.00010426, -1.00985818, -1.00165959, -1.00020922,
                       -1.00417335, -1.003692  , -1.00178602, -1.00047299, -1.00073538,
                       -1.00008621, -1.00021723, -1.00037028, -1.00055338, -1.09941216,
                       -1.00037199, -1.00204897, -1.03388095, -1.00933133, -1.02132215,
                       -1.04021753, -1.00289487, -1.00191766, -1.00168548, -1.00053103,
                       ...

Finally we need to extract our class labels. They are in clf.classes_. Same as with the intercepts: I just printed the array to the screen and then copied and pasted it into a .py file.

import numpy as np

class_labels = np.array(['11005', '11010', '11015', '11020', '11025', '11030', '11035',
                         '11040', '11045', '11055', '11080', '11090', '11095', '11110',
                         '11125', '11135', '11190', '11240', '11280', '11305', '11310',
                         '11315', '11325', '11330', '11340', '11370', '11375', '11385',
                         ...

Now we have our model nicely stored as data. People can inspect the HDF5 and .py files without (much) risk of executing arbitrary code. Our model is human readable and shareable. Now my app is indeed open source.

Ok, so much for disassembling the model. How do we put it back together?

Quick and easy. Instantiate the model, load the class labels, the coefficients and the intercepts, and plug everything in:

import numpy as np
import pandas as pd
from sklearn import linear_model

# load class labels
from class_labels import class_labels # or however you named the file and array

# load intercepts
from intercepts import intercepts # or however you named the file and array

# instantiate model
clf = linear_model.SGDClassifier(loss = 'modified_huber',
                                 penalty = 'l2',
                                 alpha = 0.0001,
                                 fit_intercept = True,
                                 n_iter = 60,
                                 shuffle = True,
                                 random_state = None,
                                 n_jobs = 4,
                                 learning_rate = 'optimal',
                                 eta0 = 0.0,
                                 power_t = 0.5,
                                 class_weight = None,
                                 warm_start = False)

# load coefficients
store = pd.HDFStore('coefficients.h5')

# convert from pandas DataFrame back to numpy array
coefs = np.array([np.array(store['row' + str(i)]).T[0] for i in range(len(class_labels)]))

# close HDF5 file
store.close()

# plug class labels
clf.classes_ = class_labels

# plug intercepts
clf.intercept_ = intercepts

# plug coefficients
clf.coef_ = coefs

And voilà, we have reconstructed our model. The labels, intercepts and coefficients are in their proper places (i.e., assigned to the proper clf attributes) and the model is ready to use. And everything runs a lot faster than if we were loading pickles.

Some models are more easily “datafied” than others. “Datafying” an instance of scikit-learn’s TfidfVectorizer’s class, for instance, is a bit tricky. I’ll cover that in another post.

from political science to data science

Today it’s been a year since I started working as a data scientist. Before that I was doing my Ph.D., in political science. I wonder what other people who’ve made this sort of transition - from some social science to data science - have learned. Here’s what I’ve found out (so far). Maybe this will encourage others to share their own experiences.

Causality doesn’t matter

Political scientists want to know what causes what. Does democracy increase GDP per capita? Does oil make a country more authoritarian? Can trade prevent war? And so on. But causality is hard to establish. You need to run controlled experiments or convince the reviewers that you chose the right instruments or that you matched on the right confounders. And Gary King is always around the corner waiting to stab you in the heart.

So I was overwhelmed with joy when I found out that causality is not a big deal in the world of data science. Take text classification, for instance. You’re mapping word counts to categories. Yes, in a way, the categories do cause the word counts - “let me be clear” appears a lot in Obama’s speeches because they are Obama’s. But we just don’t care about the effect of Obama on the frequency of “let me be clear”. We only care about classifying the text correctly.

That’s liberating. Think of the time and effort that political scientists spend defending their causal claims. Or debunking others’ causal claims. Or trying to accurately measure causal effects. When you remove causality from the equation you suddenly have a lot more time to work on other, potentially more interesting aspects of data analysis.

Debates have clear winners

Most political scientists have no notion of test set or validation set. They take a bunch of data and use 100% of it as a training set. They run their regressions and interpret the equations that map dependent variables to independent variables. And that’s it. Pages and pages of articles and books and blog posts dedicated to training sets.

Which is of course ok - political scientists are trying to understand how the world works - but is also frustrating: debates never end. When you limit yourself to the training data you just don’t have a clear metric of success. It’s largely about how convincingly you can defend your theoretical and methodological choices. That leaves ample room for subjectivity, so it’s never entirely clear who are the winners and losers.

In the world of data science, however, the training data is just the beginning. Once you’ve trained your model you use it on data it hasn’t seen before and you observe how well it performs. And there are clear metrics for that: accuracy, recall, F-measure, etc. You can compare different models and immediately spot the winners and losers. You can argue all you want that your model is inherently “sounder” because of XYZ but that doesn’t matter. What matters is whether your model misclassifies dogs as cats less often than other models do.

There is less ideological bias

Hopefully I don’t need to convince you that social scientists are overwhemingly on the left of the ideology spectrum. To give you but one number, the Democratic:Republican ratio is 5.6:1 among political scientists. The ideological distribution is so skewed that there are survival guides for non-leftists who want to thrive in academia.

Reviewers being humans, sometimes they dislike your instruments or matching or sample because they dislike your paper’s conclusions. If you are in political science you probably lean left, so you’re unlikely to have seen this happen to one of your papers. But try to argue that Latin America’s left turn in the 2000s has eroded democracy in the region. Boy, brace yourself for some angry reviews. It’s not that the methodological criticisms will be unfounded. It’s just that they would likely not exist if your conclusions were reversed.

In data science, on the other hand, you have Kaggle competitions. Winners and losers are not decided on the basis of subjective evaluations of your methods or sample or theory. Winners and losers are decided on the basis of who gets the higher F-measure. It’s a fair fight. So, it’s not just that debates don’t linger forever (see above), but also that they resolved in a much more rigorous way. Sometimes debates do end in political science - but not always as they should.

You want more error, not less

Political scientists want a good model fit. But as I mentioned before all they have is a training set. They have no notion of prediction or out-of-sample testing, so they end up overfitting. The fit is too good - the model doesn’t generalize.

It’s not that political scientists don’t know about overfitting. They do. But if all your data are training data then how the heck can you assess overfitting?

Political scientists believe that you can avoid overfitting by avoiding kitchen-sink regression. If I include only theoretically relevant regressors then everything should be ok, right?

Except we can always twist theory to accomodate whatever variables we need in order to get what we want. Maybe if I square this variable then I’ll get the proper p-values. And then I come up with some creative explanation of why its effect is not linear. Reviewers (ideally) assess theory consistency, of course, but then we’re back to the subjectivity and bias problems I discussed before.

This was perhaps my biggest methodological shock when I started doing data science. At first I struggled with the idea of a regularization term. What? So you want me to bias my coefficients toward zero? Then Andrew Ng’s machine learning course taught me that there is data you use to train your model and data you use to test it. And then regularization made sense. A bit more error is good.

Programming matters

Political scientists’ code is rigid: each script is meant to produce a pre-determined set of estimates, using a pre-determined dataset. The code always takes the same input and always return the same output. It does one thing and one thing alone. It doesn’t matter much if the code could be faster or prettier: as long as it replicates what you did in your paper you have fulfilled your duty.

Data scientists’ code is flexible: your code needs to accept a variety of inputs and return a variety of outputs. That’s because you’re not writing code for yourself, but for others. For instance: you have data in MySQL and other people in your organization want insights from those data. You then write a web app where people can, say, fit a regression line through some subset of the data. You don’t know what subsets people will choose. And you don’t know what the regression lines will look like. So your code needs to handle many different scenarios. What if the user chooses a non-valid subset? Say, years 2012-2014 when the data end in 2010? What if the user chooses all the data and that overloads the server? What if the regression tool you’re using under the hood (say, R’s lm() function) returns an error because the chosen subset is too small? In short: data scientists’ code has a lot more IF/THEN/ELSE statements than political scientists’ code.

So, programming matters in data science. It’s about both depth and breadth: you need a firmer grasp of basic programming concepts, like conditionals and functions (that’s the depth) and you need to learn web development, SQL, NoSQL, database administration, messaging, server maintenance, security, testing (that’s the breadth).

Some of my political scientist friends would rather poke their own eyes out than endure that sort of technical work. Some even consider technical work to be beneath them. I suppose I understand. You survived Social Origins of Dictatorship and Democracy’s 592-page discussion of regime change, class, and modernization - surely you’re destined to higher purposes. You are here to understand the world, not center HTML tables.

But if you don’t mind getting your hands dirty then programming can be a lot of fun - and at least as intelectually rewarding than political science. With programming there is no “arguing your way out” of anything: either your code works or it doesn’t. And when it doesn’t you have to figure out why and that requires a lot of sinapses. As in political science, it’s largely about hypothesis testing - if the code isn’t working because of XYZ then if I try ABC I should get this result, otherwise I should get that other result. Except that there is a finish line: you’ll never really know what makes democracy work but you’ll eventually figure out which regex matches the string you’re looking for. And you will get to the finish line or die trying - you can’t just declare regex to be a meaningless social construct and move on. You can’t get away with vague wording or selective omissions. The machine is not a journal editor - you can’t bullshit your way through it.

You lose freedom

Wait, don’t quit grad school just yet - there’s a lot to miss about academia. First and foremost, I miss the freedom to work on whatever I wanted to.

Don’t get the wrong idea: I thoroughly enjoy what I’m doing right now (I help automate cartel detection; you know, when companies that should compete against each other decide to shake hands instead). I’m learning a lot and it’s by far the most rewarding job I’ve ever had. But I can’t just wake up tomorrow and decide to work on face recognition for the next six months. I can do whatever I want to do in my spare time, but I still need to show up Mon-Fri and work on cartel detection. Sure, I could find another job. But in academia you don’t need to find another job just because you’ve got a new research interest. You stay in the same job, you just start doing something else. And that is awesome. I’m ok with foregoing that freedom, but I imagine not everyone is.


This is it. If you too are in data science as a refugee from the social sciences I would love to hear from you. How did you get here? How is the new life playing out for you? Do you intend to go back some day?

measuring academese

We all know that academic writing stinks, but exactly how bad is it? And in what disciplines does academic writing stink the most? To answer these questions I scraped 174,527 academic articles, came up with a few indicators of “academese” and compared these indicators across disciplines.

our data source

I scraped those articles from SciELO. In case you’ve never heard of SciELO, it’s like JSTOR, but focused on Latin American journals. I would rather have scraped JSTOR, but it’s gated. I could scrape it using my OSU credentials, but the Aaron Swartz brouhaha is not an encouraging precedent. Also, JSTOR articles are usually in PDF, which sucks for text mining (especially when the PDF is the scan of a document, which is often the case in JSTOR). SciELO, by contrast, is ungated (it doesn’t even have a “terms of use” page, so I can’t possibly be breaking any rules) and its articles are in HTML, which is much easier to parse than PDF.

Most articles on SciELO are in Portuguese or Spanish but that’s not a problem here. The flaws of academic writing - cluttered sentences, meaningless words, plenty of jargon - are not language-specific. And thanks to the Normands the mumbo jumbo barely changes from one language to another: in English you problematize the discursive paradigm and in Portuguese você problematiza o paradigma discursivo.

summary statistics

The 174,527 articles I scraped are all the articles in Portuguese from all the 281 active journals on SciELO. Here’s how they are distributed:

We’ll look into specific disciplines in a minute.

(SciELO doesn’t categorize its journals, so the categorizations above are my own, partly based on Latindex’s. In case you are curious, here are the journal ISSN codes and respective areas.)

academese and how to measure it

…the secret of good writing is to strip every sentence to its cleanest components. Every word that serves no function, every long word that could be a short word, every adverb that carries the same meaning that’s already in the verb, every passive construction that leaves the reader unsure of who is doing what - these are the thousand and one adulterants that weaken the strength of a sentence. And they usually occur in proportion to education and rank. (William Zinsser, On Writing Well.)

As we see, academese is not a particular type of bad writing. The same set of flaws that we’ve been calling academese could very well be called journalese or bureaucratese. The specific jargon may change (academics critique, bureaucrats impact) but the underlying sins are the same: clutter and vagueness.

Now that we know what bad writing is we can measure it. Here are my indicators:

  • Word length.
  • Sentence length.
  • Adverbs/total number of words. To identify adverbs I use the 1,000 most frequent adverbs in the Portuguese Corpus.
  • Gerunds/total number of words. I count as a gerund any word that ends in ‘ndo’, which is how gerunds end in Portuguese - as in fazendo (doing), escrevendo (writing), etc.
  • Inane words/total number of words. I don’t have an exhaustive list of inane words. Instead I selected any words that match one of the following regular expressions:
    • '^problematiz.*': problematizar (to problematize), problematização (problematization), etc.
    • '^paradigma.*': paradigma (paradigme), paradigmático (paradigmatic), etc.
    • '^discursiv.*': discursivo (discursive), discursividade (discursivity), etc.
    • '^dial\u00E9tic.*': dialética, dialético (both mean dialectic), etc.
    • '^critic.*' and '^cr\u00EDtic.*': criticar (to criticize), crítico (critical, critic), criticado (criticized), etc. You may have qualms about this but think: are people supposed to do anything a-critically in academia? If not then critical is inane unless we’re talking about a specific disagreement (as in “Hayek criticized Keynes’ monetary theory”). Elsewhere - “critical theory”, “a critical look at”, etc - critical is pointless.

I picked these words because they are not specific to any disciplines. Sociologists and physicists alike can talk about paradigmes, problematize phenomena, or take a critical look at theories and experiments. Sociologists and physicists may differ in how often they use such inane words (that’s what I want to measure) but they both have the same opportunities to use them.

I considered enriching my list of inane words with words from Sokal’s hoax and from the Bad Writing Contest. But that might stack the deck against the humanities. Physicists probably don’t have many chances to write entelechy, otherness, or essentiality.

I’d rather use a machine learning approach but I couldn’t find a corpora of academic articles pre-labeled as academese/not academese (let alone one such corpora that happens to be in Portuguese). Hence what’s left is this sort of “dictionary” approach, which is much weaker - I’m certainly missing A LOT of relevant features (and possibly using some irrelevant ones). But it’ll have to do for now.

so, who stinks the most?

Here’s the ratio inane words / total number of words:

Adverbs / total number of words:

Gerunds / total number of words:

Average sentence length:

Average word length:

Clearly it’s in the humanities and social sciences that academese really thrives. Not exactly a shocking finding, but it’s nice to finally have some numbers. I didn’t expect the difference in inane words usage to be so large. Conversely, I expected word length to vary a lot more (maybe scientific names are inflating word size in the hard sciences?).

zooming in

Let’s zoom in on some of the humanities and social sciences. Here’s our articles:

Now let’s see how these different disciplines compare. Here’s the ratio inane words / total number of words:

Adverbs / total number of words:

Gerunds / total number of words:

Average sentence length:

Average word length:

Well, these are some disappointing data. I expected big differences - especially between pairs like economics and anthropology, or business and history. Turns out I was wrong. Sadly, there isn’t a clear “worst offender” for us to shame. Philosophy wins (well, loses) when it comes to poor word choices, but not by much, and philosophers don’t use longer sentences or words than the other social scientists.

what’s next

This analysis is not language-specific - what makes bad writing bad is the same set of flaws in English and in Portuguese. But perhaps this analysis is culture-specific? For instance, what passes for economics in Brazil is much different than what we find on the American Economic Review - there is a lot more “problematization of paradigms” in the tropics, so to speak. So I wonder how things would change if I used American articles instead. I’m not going to scrape any gated databases, lest I get in legal trouble. But maybe Google Scholar searches might contain enough ungated results? Or maybe there is some (large enough) ungated database that I don’t know of?

the gory details

Here’s the code I used to scrape SciELO, in case it may be useful to you. (Please, be considerate. SciELO is super nice to us scrapers: it’s ungated, there are no captchas, everything is in plain HTML, and - crucially - they don’t prohibit scraping. So don’t overparallelize or otherwise disrupt their servers.)

import os
import re
import requests
from bs4 import BeautifulSoup

# get URL of each journal (active journals only)
startURL = 'http://www.scielo.br/scielo.php?script=sci_subject&lng=pt&nrm=iso'
session = requests.Session()
startHTML = session.get(startURL).text
soup = BeautifulSoup(startHTML)
regex = re.compile(soup.find_all('p')[4].text) # regex to filter out non-active journals
sections = soup.find_all(text = regex)
journalURLs = []
for i in range(len(sections)):
    newSection = sections[i].next_element.find_all('a')
    newJournalURLs = [URL.get('href') for URL in newSection]
    journalURLs += newJournalURLs
basepath = '/Users/thiagomarzagao/Desktop/articles/' # change as needed

# go to each journal
for i, journalURL in enumerate(journalURLs):
    print 'journal num:', i + 1, 'de', len(journalURLs)
    k = 1
    IDstart = journalURL.index('pid=')
    IDend = journalURL.index('&lng')
    journalID = journalURL[IDstart+4:IDend]
    print 'journal id:', journalID    
    journalPath = basepath + journalID + '/'
    if os.path.exists(journalPath):
        continue
    else:
        os.makedirs(journalPath)
    journalStartHTML = session.get(journalURL.replace('serial', 'issues')).text
    journalSoup = BeautifulSoup(journalStartHTML)

    # go to each issue
    issuesURLs = [URL.get('href') for URL in journalSoup.find_all('a')[10:-3]][:-2]
    for n, issueURL in enumerate(issuesURLs):
        print 'issue num:', n + 1, 'de', len(issuesURLs)
        if issueURL:
            if 'pid=' in issueURL:
                issueStartHTML = session.get(issueURL).text
                issueSoup = BeautifulSoup(issueStartHTML)
                articlesURLs = [URL.get('href') for URL in issueSoup.find_all('a', text = u'texto em  Portugu\u00EAs')]
                if len(articlesURLs) > 0:

                    # go to each article and scrape it
                    for articleURL in articlesURLs:
                        articleHTML = session.get(articleURL).text
                        articleSoup = BeautifulSoup(articleHTML)
                        articleTextList = [p.text for p in articleSoup.find_all('p')]
                        articleText = ''
                        for piece in articleTextList:
                            articleText += piece

                        # save article to disk
                        with open(journalPath + str(k) + '.txt', mode = 'wb') as fbuffer:
                            fbuffer.write(articleText.encode('utf8'))
                        k += 1

The code above will yield a little garbage: a few articles not in Portuguese and some texts that are not articles (copyright notices mostly). Not enough garbage to be worth fixing though.

To parse the sentences I used the following regular expression, which I stole from here.

regex = re.compile("[^.!?\\s]" +
                   "[^.!?]*" +
                   "(?:" +
                   "  [.!?]" +
                   "  (?!['\"]?\\s|$)" +
                   "  [^.!?]*" +
                   ")*" +
                   "[.!?]?" +
                   "['\"]?" +
                   "(?=\\s|$)", 
                   re.MULTILINE)

To make the pie charts I used d3pie, which is amazing tool: you make the charts interactively and d3pie generates the code for you. (That sort of defeats the purpose of these posts, which is to help me practice D3, but it was just so quick and easy that I couldn’t resist.) To make the bar charts I just used plain D3. Here’s the code for the first bar chart (the code is almost the same for the others):

<!DOCTYPE html>
<meta charset="utf-8">
<style>

.axis path, .axis line {
    fill: none;
    stroke: black;
}

.axis text {
    font-family: sans-serif;
    font-size: 16px;
}

#tooltip {
    position: absolute;
    width: auto;
    height: auto;
    padding: 10px;
    background-color: white;
    -webkit-box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.4);
    -moz-box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.4);
    box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.4);
    pointer-events: none;
}

#tooltip.hidden {
    display: none;
}

#tooltip p {
    margin: 0;
    font-family: sans-serif;
    font-size: 16px;
    line-height: 20px;
}

</style>
<body>

<div id="tooltip" class="hidden">
    <p><span id="value">100</span>
    </p>
</div>

<script src="d3/d3.min.js"></script>
<script>

var margins = {
    top: 12,
    left: 150,
    right: 24,
    bottom: 24
},
legendPanel = {
    width: 0
},
width = 500 - margins.left - margins.right - legendPanel.width + 150,
    height = 136 - margins.top - margins.bottom,
    dataset = [{
        data: [{
            month: 'humanities & social',
            count: 0.00080
        }, {
            month: 'biological & earth',
            count: 0.00012
        }, {
            month: 'exact',
            count: 0.00017
        }],
        name: ''
    }

    ],
    series = dataset.map(function (d) {
        return d.name;
    }),
    dataset = dataset.map(function (d) {
        return d.data.map(function (o, i) {
            // Structure it so that your numeric
            // axis (the stacked amount) is y
            return {
                y: o.count,
                x: o.month
            };
        });
    }),
    stack = d3.layout.stack();

stack(dataset);

var dataset = dataset.map(function (group) {
    return group.map(function (d) {
        // Invert the x and y values, and y0 becomes x0
        return {
            x: d.y,
            y: d.x,
            x0: d.y0
        };
    });
}),
    svg = d3.select('body')
        .append('svg')
        .attr('width', width + margins.left + margins.right + legendPanel.width)
        .attr('height', height + margins.top + margins.bottom)
        .append('g')
        .attr('transform', 'translate(' + margins.left + ',' + margins.top + ')'),
    xMax = d3.max(dataset, function (group) {
        return d3.max(group, function (d) {
            return d.x + d.x0;
        });
    }),
    xScale = d3.scale.linear()
        .domain([0, xMax])
        .range([0, width]),
    months = dataset[0].map(function (d) {
        return d.y;
    }),
    yScale = d3.scale.ordinal()
        .domain(months)
        .rangeRoundBands([0, 100], .1),
    xAxis = d3.svg.axis()
        .ticks(5)
        .scale(xScale)
        .orient('bottom'),
    yAxis = d3.svg.axis()
        .scale(yScale)
        .orient('left'),
    colours = d3.scale.category10(),
    groups = svg.selectAll('g')
        .data(dataset)
        .enter()
        .append('g')
        .style('fill', function (d, i) {
        return colours(i);
    }),
    rects = groups.selectAll('rect')
        .data(function (d) {
        return d;
    })
        .enter()
        .append('rect')
        .attr('x', function (d) {
        return xScale(d.x0);
    })
        .attr('y', function (d, i) {
        return yScale(d.y);
    })
        .attr('height', function (d) {
        return yScale.rangeBand();
    })
        .attr('width', function (d) {
        return xScale(d.x);
    })
        .on('mouseover', function (d) {
        var xPos = parseFloat(d3.select(this).attr('x')) / 2 + width / 2;
        var yPos = parseFloat(d3.select(this).attr('y')) + yScale.rangeBand() / 2;

        d3.select('#tooltip')
            .style('left', xPos + 'px')
            .style('top', yPos + 'px')
            .select('#value')
            .text(d.x);

        d3.select('#tooltip').classed('hidden', false);
    })
        .on('mouseout', function () {
        d3.select('#tooltip').classed('hidden', true);
    })

    svg.append('g')
        .attr('class', 'axis')
        .attr('transform', 'translate(0,' + height + ')')
        .call(xAxis);

svg.append('g')
    .attr('class', 'axis')
    .call(yAxis);

series.forEach(function (s, i) {
    svg.append('text')
        .attr('fill', 'white')
        .attr('x', width + margins.left + 8)
        .attr('y', i * 24 + 24)
        .text(s);
    svg.append('rect')
        .attr('fill', colours(i))
        .attr('width', 60)
        .attr('height', 20)
        .attr('x', width + margins.left + 90)
        .attr('y', i * 24 + 6);
});

</script>
</body>

This is it!

personal pronouns in Brazilian Portuguese

While English doesn’t have a singular you (unless you count thou), Portuguese has two: tu and você, with tu being preferred in some regions and você being preferred in others. A while ago an American friend who is learning Portuguese asked me where exactly people used one form or the other and, despite Portuguese being my first language, I couldn’t give her a proper answer. So, to atone for my unhelpfulness (and because I needed an excuse to start playing with D3) I scraped 680,045 geolocated tweets from Brazil and made the map below, where the higher the tu/você ratio the “redder” the state and the lower the tu/você ratio the “bluer” the state. Hover the mouse on the map to see the state names and respective ratios and number of tweets.

As we see, all states prefer você over tu - not a single state has a tu/você ratio higher than 1. And that preference is strong: in all states but one você is at least twice as popular as tu (i.e., the ratios are no higher than 0.5). The exception is Rio Grande do Sul (the purple, southernmost state), where the contest is close, with a tu/você ratio of 0.88. Which fits the popular perception of gaúchos (as natives from Rio Grande do Sul are often called) having a culture that is heavily influenced by the surrounding Spanish-speaking countries (tu is singular you in Spanish as well).

Other than in Rio Grande do Sul tu is most popular in Santa Catarina (tu/você ratio of 0.47), Maranhão (0.42), Amapá (0.35), Pará (0.30), and Roraima (0.30). Santa Catarina is in the South (it borders Rio Grande do Sul) while the other four states are in the North and Northeast. Você, on the other hand, is most popular in Sergipe (0.05), Minas Gerais (0.06), Goiás (0.06), Mato Grosso do Sul (0.08), Mato Grosso (0.08), São Paulo (0.08), and Espírito Santo (0.09) - basically the entire Southeast and Midwest regions.

I got curious to know how exactly people use tu. You see, conjugation is a tricky business in Romance languages. In English I go, you go, he/she/it goes, we go, you go, they go - the verb almost never changes. But in Portuguese eu vou (I go), tu vais (“singular you” go), você vai (“singular you” go), ele/ela vai (he/she/it goes), nós vamos (we go), vós ides (“y’all” go), vocês vão (“y’all” go), eles/elas vão (they go). The verb almost always changes. And as this example shows, tu and você call for different verb inflections: tu vais but você vai.

Or so Portuguese grammar dictates. But grammar rules are one thing and what people actually say are another. At least where I’m currently living (Brasília), when people use tu they often conjugate it as você: tu vai instead of tu vais; tu foi (“singular you” went) instead of tu foste; and so on. So, when people use tu do they conjugate it correctly?

Turns out they don’t. Out of the 680,045 tweets I scraped tu vai appears in 6,559 of them while tu vais appears in a mere 77 - too few to even allow state-wise comparisons.

Which is probably ok. By merging the conjugation rules of tu and você we simplify Portuguese grammar. Old English conjugation was hard: ic hæbbe (I have), þū hæfst (thou hast), hē/hit/hēo hæfð (he/it/she has), wē/gē/hīe habbaþ (we/”y’all”/they have) - and that’s just the nominative case; there were also the accusative, dative, and genitive cases, each calling for different verb inflections. But over the centuries English grammar got simplified and now conjugation is pretty straightforward - and I don’t see anyone complaining about that. Portuguese still needs to undergo a similar process. The less school time we spend learning arbitrary verb inflections the more time we can spend learning useful stuff.

In case this discussion has awaken your inner etymologist you may like to know that tu and thou are cognate - they both come from the same Proto-Indo-European word. And both have declined for the same reason: people were looking for less intimate ways to refer to one another. The English solution was to imitate the French by using the plural pronoun (you). The Portuguese solution was to imitate the Spanish by using Your Mercy - Vuestra Merced in Spanish, Vossa Mercê in Portuguese. Vossa Mercê is a bit cumbersome to say, so eventually people slurred it into você (just like Spanish speakers eventually slurred Vuestra Merced into usted).

(Btw, if you are into etymology you should check Kevin Stroud’s The History of English Podcast. It’s amazing to the point where I don’t even mind being stuck in traffic anymore.)

It would be interesting to know how usage of tu and você varies in other Portuguese-speaking countries, but only Brazil has a population big enough to yield 600k geolocated tweets each containing tu and/or você in under a week. It would also be interesting to know how tu and você usage has varied over time - in particular, when exactly você overtook tu as the most popular choice. But there was no Twitter before the 21st century (and Google Books N-Gram Viewer doesn’t have a Portuguese corpus).

Some caveats are in order. First, there may be intra-state variation. But for most cities there are too few tweets and I’d rather not say anything about how the people from this or that city speak based on only a handful of tweets. So, state capitals and other large cities are overrepresented here. Second, you need internet access and a device of sorts (a cell phone at least) in order to tweet, so the really poor are excluded from this analysis. There is some evidence that usage of tu declines with income, so I’m probably underestimating tu usage here. Third, people don’t necessarily tweet the same way they speak. For instance, perhaps the people of Rio Grande do Sul write tu more often than they speak it.

If you want the gory details, I used Twitter’s sample stream, which is a continuous stream of a small random sample of all tweets, in JSON format. Programming-wise I used Python’s Tweepy. Here’s my code (minus access key, token secret, etc; you can create your own Twitter-scraping credentials here):

import tweepy
 
consumer_key = "YourConsumerKey"
consumer_secret = "YourConsumerSecret"
access_token = "YourAccessToken"
access_token_secret = "YourTokenSecret"
 
class listener(tweepy.streaming.StreamListener):
    def on_data(self, data):
        print data
        return True
        
    def on_error(self, status):
        print status
 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
 
while True:
    try:
        twitterStream = tweepy.Stream(auth, listener())
        twitterStream.filter(track = [u"voc\u00EA", 
                                      u"voce",
                                      u"vc",
                                      u"tu"])
    except:
        continue

This code will simply scrape the tweets and show them on your screen but won’t save them. To save, you need to point the stdout to some output file, as in: python scrapeTwitter.py > output.csv.

As you can see from the code, I limited my scraping to posts containing tu, você, voce, or vc (a common shorthand for você in Portuguese). The Twitter API doesn’t care about case, so that also yielded Tu, TU, vC, etc.

I also used Python to parse the tweets:

import os
import re
import json
import pandas as pd

# folder containing the files of scraped tweets; change as needed
path = '/Users/thiagomarzagao/Desktop/tweets/' 

# regex to remove some non-standard characters
vanilla = u'[^\u0041-\u005A \
              \u0061-\u007A \
              \u00C0-\u00D6 \
              \u00D8-\u00F6 \
              \u00F8-\u00FF \
              \u0100-\u017F \
              \u0020]'
regex = re.compile(vanilla)

# spelling variations of 'voce'
voces = set([u'voce', u'voc\u00EA', u'vc'])

# parse everything
tweets = []
f = 0
n = 0
for fname in os.listdir(path):
    if '.json' in fname:
        f += 1
        with open(path + fname, mode = 'rb') as fbuffer:
            for record in fbuffer:
                try:
                    if len(record) > 1:
                        jsondata = json.loads(record)
                        if 'place' in jsondata:
                            if jsondata['place']:
                                if jsondata['place']['country_code'] == 'BR':
                                    if jsondata['place']['place_type'] == 'city':
                                        n += 1
                                        print 'f:', f, fname
                                        print 'n:', n
                                        lowercased = jsondata['text'].lower()
                                        cleaned = regex.sub(' ', lowercased)
                                        splitted = cleaned.split()
                                        tu = False
                                        voce = False
                                        tuvais = False
                                        tuvai = False
                                        if 'tu' in splitted:
                                            tu = True
                                        for v in voces:
                                            if v in splitted:
                                                voce = True
                                        if 'tu vais' in cleaned:
                                            tuvais = True
                                        if 'tu vai ' in cleaned:
                                            tuvai = True
                                        location = jsondata['place']['full_name']
                                        comma = location.index(',')
                                        city = location[:comma].encode('utf8')
                                        state = location[comma + 2:].encode('utf8')
                                        if state == 'Brasil':
                                            state = city
                                        if state == 'Brazil':
                                            state = city
                                        tweet = (tu, voce, tuvais, tuvai, city, state)
                                        tweets.append(tweet)
                                        print cleaned.encode('utf8')
                except:
                    continue
                    
df = pd.DataFrame(tweets)
colnames = ['tu', 'voce', 'tuvais', 'tuvai', 'city', 'state']
df.to_csv('/Users/thiagomarzagao/Desktop/tweets.csv', # destination of tabulated data; change as needed
          index = False, 
          header = colnames)

(You may want to comment out line 67 - print cleaned.encode('utf8'). Otherwise every tweet you scraped will show on your screen as it is parsed. That can make you lose faith in mankind.)

To make the choropleth (the fancy word for a map whose colors vary according to data) I initially considered plotly, but as it turns out they don’t handle maps natively and I didn’t like the workarounds I found. So I turned to D3. It’s not very intuitive, especially if you’re used to making your plots in R, Matlab, or Stata. You have to juggle three languages - JavaScript, HTML, and CSS - to get your plot to work and that can be confusing in the beginning. But it got the job done. The map itself was originally in shapefile format and I used Carolina Bigonha’s br-atlas to convert it to Topojson format.

Below is the code I used to produce the choropleth. Here I only have 27 observations, so I didn’t mind hardcoding the data for each of them. But if you have more observations (say, counties) you probably want to automate the process by having a function map tu/você ratios to the RGB color spectrum.

<!DOCTYPE html>
<meta charset="utf-8">
<style>

.uf:hover {fill-opacity: 0.5;}

.uf.MS {fill: rgb(19,0,235);}
.uf.SC {fill: rgb(81,0,173);}
.uf.ES {fill: rgb(21,0,233);}
.uf.AC {fill: rgb(43,0,211);}
.uf.RS {fill: rgb(119,0,135);}
.uf.MT {fill: rgb(20,0,234);}
.uf.RR {fill: rgb(59,0,195);}
.uf.PB {fill: rgb(35,0,219);}
.uf.PI {fill: rgb(44,0,210);}
.uf.GO {fill: rgb(16,0,238);}
.uf.RN {fill: rgb(28,0,226);}
.uf.RJ {fill: rgb(34,0,220);}
.uf.PR {fill: rgb(24,0,230);}
.uf.AP {fill: rgb(66,0,188);}
.uf.SP {fill: rgb(20,0,234);}
.uf.DF {fill: rgb(37,0,217);}
.uf.BA {fill: rgb(26,0,228);}
.uf.PA {fill: rgb(59,0,195);}
.uf.PE {fill: rgb(55,0,199);}
.uf.MA {fill: rgb(75,0,179);}
.uf.TO {fill: rgb(31,0,223);}
.uf.CE {fill: rgb(42,0,212);}
.uf.RO {fill: rgb(48,0,206);}
.uf.AM {fill: rgb(57,0,197);}
.uf.SE {fill: rgb(13,0,241);}
.uf.MG {fill: rgb(15,0,239);}
.uf.AL {fill: rgb(27,0,227);}

#tooltip {
    position: absolute;
    width: auto;
    height: auto;
    padding: 5px;
    background-color: white;
    -webkit-box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.4);
    -moz-box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.4);
    box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.4);
    pointer-events: none;
}

#tooltip.hidden {
    display: none;
}

#tooltip p {
    margin: 0;
    font-family: sans-serif;
    font-size: 16px;
    line-height: 20px;
}

</style>
<body>

<div id="tooltip" class="hidden">
    <p><strong></strong></p>
    <p><strong><span id="value"></span></strong></p>
    <p><span id="value2"></span></p>
    <p><span id="value3"></span></p>
</div>

<script src="d3/d3.min.js"></script>
<script src="topojson.v1.min.js"></script>
<script>

uf_dictionary = {"50": ["MS", "Mato Grosso do Sul", "0.08", "5,126"],
                 "42": ["SC", "Santa Catarina", "0.47", "45,222"],
                 "32": ["ES", "Espírito Santo", "0.09", "16,977"],
                 "12": ["AC", "Acre", "0.20", "1,871"],
                 "43": ["RS", "Rio Grande do Sul", "0.88", "83,133"],
                 "51": ["MT", "Mato Grosso", "0.08", "3,202"],
                 "14": ["RR", "Roraima", "0.30", "1,287"],
                 "25": ["PB", "Paraíba", "0.16", "3,941"],
                 "22": ["PI", "Piauí", "0.20", "1,364"],
                 "52": ["GO", "Goiás", "0.06", "8,512"],
                 "24": ["RN", "Rio Grande do Norte", "0.12", "3,663"],
                 "33": ["RJ", "Rio de Janeiro", "0.15", "152,458"],
                 "41": ["PR", "Paraná", "0.10", "43,169"],
                 "16": ["AP", "Amapá", "0.35", "1,672"],
                 "35": ["SP", "São Paulo", "0.08", "175,077"],
                 "53": ["DF", "Distrito Federal", "0.17", "15,218"],
                 "29": ["BA", "Bahia", "0.11", "10,649"],
                 "15": ["PA", "Pará", "0.30", "8,544"],
                 "26": ["PE", "Pernambuco", "0.27", "19,362"],
                 "21": ["MA", "Maranhão", "0.42", "3,540"],
                 "17": ["TO", "Tocantins", "0.14", "961"],
                 "23": ["CE", "Ceará", "0.20", "9,864"],
                 "11": ["RO", "Rondônia", "0.23", "2,017"],
                 "13": ["AM", "Amazonas", "0.29", "10,864"],
                 "28": ["SE", "Sergipe", "0.05", "2,198"], 
                 "31": ["MG", "Minas Gerais", "0.06", "47,396"],
                 "27": ["AL", "Alagoas", "0.11", "2,767"]};

var width = 960,
    height = 650

var projection = d3.geo.mercator()
    .translate([325, 250])
    .center([-53.12, -10.20])
    .scale(900);
    
var path = d3.geo.path()
    .projection(projection);

var svg = d3.select("body").append("svg")
    .attr("width", width)
    .attr("height", height);

function idToStr(d) {
        return uf_dictionary[d];
};

d3.json("br-states.json", function(error, mymap) {
    svg.selectAll(".uf")
        .data(topojson.feature(mymap, mymap.objects.states).features)
        .enter()
        .append("path")
        .attr("class", function(d) {return "uf " + idToStr(d.id)[0];})
        .attr("d", path)
        .on("mouseover", function(d) {
            var coordinates = [0, 0];
            coordinates = d3.mouse(this);
            var x = coordinates[0] + 10;
            var y = coordinates[1] + 10;
            d3.select("#tooltip")
                .style("left", x + "px")
                .style("top", y + "px")
                .select("#value")
                .text(idToStr(d.id)[1])
            d3.select("#tooltip")
                .style("left", x + "px")
                .style("top", y + "px")
                .select("#value2")
                .text("tu/você ratio: " + idToStr(d.id)[2])
            d3.select("#tooltip")
                .style("left", x + "px")
                .style("top", y + "px")
                .select("#value3")
                .text("tweets: " + idToStr(d.id)[3])
            d3.select("#tooltip").classed("hidden", false);
            })
       .on("mouseout", function() {
            d3.select("#tooltip").classed("hidden", true);
            });
});

</script>
</body>

That’s it. More map-based posts should follow.

Excel for statistics

Excel 2007, like its predecessors, fails a standard set of intermediate-level accuracy tests in three areas: statistical distributions, random number generation, and estimation.

Engineers use Excel for many numerical tasks. They don’t use if for finite element computations (although I suppose you could coax Excel to do that too) because it wasn’t intended for that purpose. They shouldn’t use it for statistics either, and for the same reason - Excel doesn’t do a very good job at tasks for which it was not intended.

Each of the nine numbers given above is incorrect! The slope estimate has the wrong sign, the estimated standard errors of the coefficients are zero (making it impossible to construct t–statistics), and the values of R2, F and the regression sum of squares are negative! It’s obvious here that the output is garbage (even Excel seems to know this, as the #NUM!’s seem to imply), but what if the numbers that had come out weren’t absurd — just wrong?

Teaching statistics is a big challenge, teaching statistics with Excel is an even bigger challenge.

Friends Don’t Let Friends Use Excel for Statistics!