personal pronouns in Brazilian Portuguese

While English doesn’t have a singular you (unless you count thou), Portuguese has two: tu and você, with tu being preferred in some regions and você being preferred in others. A while ago an American friend who is learning Portuguese asked me where exactly people used one form or the other and, despite Portuguese being my first language, I couldn’t give her a proper answer. So, to atone for my unhelpfulness (and because I needed an excuse to start playing with D3) I scraped 680,045 geolocated tweets from Brazil and made the map below, where the higher the tu/você ratio the “redder” the state and the lower the tu/você ratio the “bluer” the state. Hover the mouse on the map to see the state names and respective ratios and number of tweets.

As we see, all states prefer você over tu - not a single state has a tu/você ratio higher than 1. And that preference is strong: in all states but one você is at least twice as popular as tu (i.e., the ratios are no higher than 0.5). The exception is Rio Grande do Sul (the purple, southernmost state), where the contest is close, with a tu/você ratio of 0.88. Which fits the popular perception of gaúchos (as natives from Rio Grande do Sul are often called) having a culture that is heavily influenced by the surrounding Spanish-speaking countries (tu is singular you in Spanish as well).

Other than in Rio Grande do Sul tu is most popular in Santa Catarina (tu/você ratio of 0.47), Maranhão (0.42), Amapá (0.35), Pará (0.30), and Roraima (0.30). Santa Catarina is in the South (it borders Rio Grande do Sul) while the other four states are in the North and Northeast. Você, on the other hand, is most popular in Sergipe (0.05), Minas Gerais (0.06), Goiás (0.06), Mato Grosso do Sul (0.08), Mato Grosso (0.08), São Paulo (0.08), and Espírito Santo (0.09) - basically the entire Southeast and Midwest regions.

I got curious to know how exactly people use tu. You see, conjugation is a tricky business in Romance languages. In English I go, you go, he/she/it goes, we go, you go, they go - the verb almost never changes. But in Portuguese eu vou (I go), tu vais (“singular you” go), você vai (“singular you” go), ele/ela vai (he/she/it goes), nós vamos (we go), vós ides (“y’all” go), vocês vão (“y’all” go), eles/elas vão (they go). The verb almost always changes. And as this example shows, tu and você call for different verb inflections: tu vais but você vai.

Or so Portuguese grammar dictates. But grammar rules are one thing and what people actually say are another. At least where I’m currently living (Brasília), when people use tu they often conjugate it as você: tu vai instead of tu vais; tu foi (“singular you” went) instead of tu foste; and so on. So, when people use tu do they conjugate it correctly?

Turns out they don’t. Out of the 680,045 tweets I scraped tu vai appears in 6,559 of them while tu vais appears in a mere 77 - too few to even allow state-wise comparisons.

Which is probably ok. By merging the conjugation rules of tu and você we simplify Portuguese grammar. Old English conjugation was hard: ic hæbbe (I have), þū hæfst (thou hast), hē/hit/hēo hæfð (he/it/she has), wē/gē/hīe habbaþ (we/”y’all”/they have) - and that’s just the nominative case; there were also the accusative, dative, and genitive cases, each calling for different verb inflections. But over the centuries English grammar got simplified and now conjugation is pretty straightforward - and I don’t see anyone complaining about that. Portuguese still needs to undergo a similar process. The less school time we spend learning arbitrary verb inflections the more time we can spend learning useful stuff.

In case this discussion has awaken your inner etymologist you may like to know that tu and thou are cognate - they both come from the same Proto-Indo-European word. And both have declined for the same reason: people were looking for less intimate ways to refer to one another. The English solution was to imitate the French by using the plural pronoun (you). The Portuguese solution was to imitate the Spanish by using Your Mercy - Vuestra Merced in Spanish, Vossa Mercê in Portuguese. Vossa Mercê is a bit cumbersome to say, so eventually people slurred it into você (just like Spanish speakers eventually slurred Vuestra Merced into usted).

(Btw, if you are into etymology you should check Kevin Stroud’s The History of English Podcast. It’s amazing to the point where I don’t even mind being stuck in traffic anymore.)

It would be interesting to know how usage of tu and você varies in other Portuguese-speaking countries, but only Brazil has a population big enough to yield 600k geolocated tweets each containing tu and/or você in under a week. It would also be interesting to know how tu and você usage has varied over time - in particular, when exactly você overtook tu as the most popular choice. But there was no Twitter before the 21st century (and Google Books N-Gram Viewer doesn’t have a Portuguese corpus).

Some caveats are in order. First, there may be intra-state variation. But for most cities there are too few tweets and I’d rather not say anything about how the people from this or that city speak based on only a handful of tweets. So, state capitals and other large cities are overrepresented here. Second, you need internet access and a device of sorts (a cell phone at least) in order to tweet, so the really poor are excluded from this analysis. There is some evidence that usage of tu declines with income, so I’m probably underestimating tu usage here. Third, people don’t necessarily tweet the same way they speak. For instance, perhaps the people of Rio Grande do Sul write tu more often than they speak it.

If you want the gory details, I used Twitter’s sample stream, which is a continuous stream of a small random sample of all tweets, in JSON format. Programming-wise I used Python’s Tweepy. Here’s my code (minus access key, token secret, etc; you can create your own Twitter-scraping credentials here):

import tweepy
consumer_key = "YourConsumerKey"
consumer_secret = "YourConsumerSecret"
access_token = "YourAccessToken"
access_token_secret = "YourTokenSecret"
class listener(tweepy.streaming.StreamListener):
    def on_data(self, data):
        print data
        return True
    def on_error(self, status):
        print status
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
while True:
        twitterStream = tweepy.Stream(auth, listener())
        twitterStream.filter(track = [u"voc\u00EA", 

This code will simply scrape the tweets and show them on your screen but won’t save them. To save, you need to point the stdout to some output file, as in: python > output.csv.

As you can see from the code, I limited my scraping to posts containing tu, você, voce, or vc (a common shorthand for você in Portuguese). The Twitter API doesn’t care about case, so that also yielded Tu, TU, vC, etc.

I also used Python to parse the tweets:

import os
import re
import json
import pandas as pd

# folder containing the files of scraped tweets; change as needed
path = '/Users/thiagomarzagao/Desktop/tweets/' 

# regex to remove some non-standard characters
vanilla = u'[^\u0041-\u005A \
              \u0061-\u007A \
              \u00C0-\u00D6 \
              \u00D8-\u00F6 \
              \u00F8-\u00FF \
              \u0100-\u017F \
regex = re.compile(vanilla)

# spelling variations of 'voce'
voces = set([u'voce', u'voc\u00EA', u'vc'])

# parse everything
tweets = []
f = 0
n = 0
for fname in os.listdir(path):
    if '.json' in fname:
        f += 1
        with open(path + fname, mode = 'rb') as fbuffer:
            for record in fbuffer:
                    if len(record) > 1:
                        jsondata = json.loads(record)
                        if 'place' in jsondata:
                            if jsondata['place']:
                                if jsondata['place']['country_code'] == 'BR':
                                    if jsondata['place']['place_type'] == 'city':
                                        n += 1
                                        print 'f:', f, fname
                                        print 'n:', n
                                        lowercased = jsondata['text'].lower()
                                        cleaned = regex.sub(' ', lowercased)
                                        splitted = cleaned.split()
                                        tu = False
                                        voce = False
                                        tuvais = False
                                        tuvai = False
                                        if 'tu' in splitted:
                                            tu = True
                                        for v in voces:
                                            if v in splitted:
                                                voce = True
                                        if 'tu vais' in cleaned:
                                            tuvais = True
                                        if 'tu vai ' in cleaned:
                                            tuvai = True
                                        location = jsondata['place']['full_name']
                                        comma = location.index(',')
                                        city = location[:comma].encode('utf8')
                                        state = location[comma + 2:].encode('utf8')
                                        if state == 'Brasil':
                                            state = city
                                        if state == 'Brazil':
                                            state = city
                                        tweet = (tu, voce, tuvais, tuvai, city, state)
                                        print cleaned.encode('utf8')
df = pd.DataFrame(tweets)
colnames = ['tu', 'voce', 'tuvais', 'tuvai', 'city', 'state']
df.to_csv('/Users/thiagomarzagao/Desktop/tweets.csv', # destination of tabulated data; change as needed
          index = False, 
          header = colnames)

(You may want to comment out line 67 - print cleaned.encode('utf8'). Otherwise every tweet you scraped will show on your screen as it is parsed. That can make you lose faith in mankind.)

To make the choropleth (the fancy word for a map whose colors vary according to data) I initially considered plotly, but as it turns out they don’t handle maps natively and I didn’t like the workarounds I found. So I turned to D3. It’s not very intuitive, especially if you’re used to making your plots in R, Matlab, or Stata. You have to juggle three languages - JavaScript, HTML, and CSS - to get your plot to work and that can be confusing in the beginning. But it got the job done. The map itself was originally in shapefile format and I used Carolina Bigonha’s br-atlas to convert it to Topojson format.

Below is the code I used to produce the choropleth. Here I only have 27 observations, so I didn’t mind hardcoding the data for each of them. But if you have more observations (say, counties) you probably want to automate the process by having a function map tu/você ratios to the RGB color spectrum.

<!DOCTYPE html>
<meta charset="utf-8">

.uf:hover {fill-opacity: 0.5;}

.uf.MS {fill: rgb(19,0,235);}
.uf.SC {fill: rgb(81,0,173);}
.uf.ES {fill: rgb(21,0,233);}
.uf.AC {fill: rgb(43,0,211);}
.uf.RS {fill: rgb(119,0,135);}
.uf.MT {fill: rgb(20,0,234);}
.uf.RR {fill: rgb(59,0,195);}
.uf.PB {fill: rgb(35,0,219);}
.uf.PI {fill: rgb(44,0,210);}
.uf.GO {fill: rgb(16,0,238);}
.uf.RN {fill: rgb(28,0,226);}
.uf.RJ {fill: rgb(34,0,220);}
.uf.PR {fill: rgb(24,0,230);}
.uf.AP {fill: rgb(66,0,188);}
.uf.SP {fill: rgb(20,0,234);}
.uf.DF {fill: rgb(37,0,217);}
.uf.BA {fill: rgb(26,0,228);}
.uf.PA {fill: rgb(59,0,195);}
.uf.PE {fill: rgb(55,0,199);}
.uf.MA {fill: rgb(75,0,179);}
.uf.TO {fill: rgb(31,0,223);}
.uf.CE {fill: rgb(42,0,212);}
.uf.RO {fill: rgb(48,0,206);}
.uf.AM {fill: rgb(57,0,197);}
.uf.SE {fill: rgb(13,0,241);}
.uf.MG {fill: rgb(15,0,239);}
.uf.AL {fill: rgb(27,0,227);}

#tooltip {
    position: absolute;
    width: auto;
    height: auto;
    padding: 5px;
    background-color: white;
    -webkit-box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.4);
    -moz-box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.4);
    box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.4);
    pointer-events: none;

#tooltip.hidden {
    display: none;

#tooltip p {
    margin: 0;
    font-family: sans-serif;
    font-size: 16px;
    line-height: 20px;


<div id="tooltip" class="hidden">
    <p><strong><span id="value"></span></strong></p>
    <p><span id="value2"></span></p>
    <p><span id="value3"></span></p>

<script src="d3/d3.min.js"></script>
<script src="topojson.v1.min.js"></script>

uf_dictionary = {"50": ["MS", "Mato Grosso do Sul", "0.08", "5,126"],
                 "42": ["SC", "Santa Catarina", "0.47", "45,222"],
                 "32": ["ES", "Espírito Santo", "0.09", "16,977"],
                 "12": ["AC", "Acre", "0.20", "1,871"],
                 "43": ["RS", "Rio Grande do Sul", "0.88", "83,133"],
                 "51": ["MT", "Mato Grosso", "0.08", "3,202"],
                 "14": ["RR", "Roraima", "0.30", "1,287"],
                 "25": ["PB", "Paraíba", "0.16", "3,941"],
                 "22": ["PI", "Piauí", "0.20", "1,364"],
                 "52": ["GO", "Goiás", "0.06", "8,512"],
                 "24": ["RN", "Rio Grande do Norte", "0.12", "3,663"],
                 "33": ["RJ", "Rio de Janeiro", "0.15", "152,458"],
                 "41": ["PR", "Paraná", "0.10", "43,169"],
                 "16": ["AP", "Amapá", "0.35", "1,672"],
                 "35": ["SP", "São Paulo", "0.08", "175,077"],
                 "53": ["DF", "Distrito Federal", "0.17", "15,218"],
                 "29": ["BA", "Bahia", "0.11", "10,649"],
                 "15": ["PA", "Pará", "0.30", "8,544"],
                 "26": ["PE", "Pernambuco", "0.27", "19,362"],
                 "21": ["MA", "Maranhão", "0.42", "3,540"],
                 "17": ["TO", "Tocantins", "0.14", "961"],
                 "23": ["CE", "Ceará", "0.20", "9,864"],
                 "11": ["RO", "Rondônia", "0.23", "2,017"],
                 "13": ["AM", "Amazonas", "0.29", "10,864"],
                 "28": ["SE", "Sergipe", "0.05", "2,198"], 
                 "31": ["MG", "Minas Gerais", "0.06", "47,396"],
                 "27": ["AL", "Alagoas", "0.11", "2,767"]};

var width = 960,
    height = 650

var projection = d3.geo.mercator()
    .translate([325, 250])
    .center([-53.12, -10.20])
var path = d3.geo.path()

var svg ="body").append("svg")
    .attr("width", width)
    .attr("height", height);

function idToStr(d) {
        return uf_dictionary[d];

d3.json("br-states.json", function(error, mymap) {
        .data(topojson.feature(mymap, mymap.objects.states).features)
        .attr("class", function(d) {return "uf " + idToStr([0];})
        .attr("d", path)
        .on("mouseover", function(d) {
            var coordinates = [0, 0];
            coordinates = d3.mouse(this);
            var x = coordinates[0] + 10;
            var y = coordinates[1] + 10;
                .style("left", x + "px")
                .style("top", y + "px")
                .style("left", x + "px")
                .style("top", y + "px")
                .text("tu/você ratio: " + idToStr([2])
                .style("left", x + "px")
                .style("top", y + "px")
                .text("tweets: " + idToStr([3])
  "#tooltip").classed("hidden", false);
       .on("mouseout", function() {
  "#tooltip").classed("hidden", true);


That’s it. More map-based posts should follow.

Excel for statistics

Excel 2007, like its predecessors, fails a standard set of intermediate-level accuracy tests in three areas: statistical distributions, random number generation, and estimation.

Engineers use Excel for many numerical tasks. They don’t use if for finite element computations (although I suppose you could coax Excel to do that too) because it wasn’t intended for that purpose. They shouldn’t use it for statistics either, and for the same reason - Excel doesn’t do a very good job at tasks for which it was not intended.

Each of the nine numbers given above is incorrect! The slope estimate has the wrong sign, the estimated standard errors of the coefficients are zero (making it impossible to construct t–statistics), and the values of R2, F and the regression sum of squares are negative! It’s obvious here that the output is garbage (even Excel seems to know this, as the #NUM!’s seem to imply), but what if the numbers that had come out weren’t absurd — just wrong?

Teaching statistics is a big challenge, teaching statistics with Excel is an even bigger challenge.

Friends Don’t Let Friends Use Excel for Statistics!

now on GitHub Pages wouldn’t let me have interactive plots, so I just migrated this site to GitHub Pages. I believe no permalinks were harmed in the making of this change. All images, videos, and PDFs, must have been ported. And I tested the new layout on Chrome, Firefox, and Safari (IE users: you don’t have to live like that) and on iPhone and iPad. But please let me know if you find any issues.

There were some casualties: all comments up to now. Here I’m using Disqus to handle comments and, despite heavy googling, I couldn’t find a way to export comments to Disqus (that’s possible with though, by installing a plugin). So, I apologize to all past commenters. And a special apology to all the nice folks who have commented on my webscraping tutorial, which is by far the most popular content here (it accounts for way over half the visits) - I love it when people take the time to say it was useful or to provide feedback.

Migrating was pretty easy. I’m using Jekyll, a Ruby-based generator of static websites. Except for one pesky dependency, installing Jekyll was painless. Theme-wise I forked Hyde (yes, I see it), which I’m using almost unaltered. Most of the work went into migrating the posts. I found Exitwp, which by and large automates the task. But I had several posts with code and they gave me some trouble. For instance, Jekyll’s own syntax highlighter leaves something to be desired, so I had to google around to improve it a bit.

Using GitHub Pages has been a pleasant experience so far. I can now do pretty much whatever I want to with my site - which means it’s JavaScript time! I’m also enjoying being able to deploy the site locally before uploading it to GitHub. Jekyll is great too. It lets me write posts in Markdown, which is obviously good for writing math. I wonder how the Python alternative compares though, so I may try it out some day.

This is it for now. Posts with interactive plots should follow soon.

classifying goods and services

I needed to produce an automated classifier of goods and services: something that takes a short description and returns the corresponding category, according to some taxonomy. For instance, you enter “jelly beans” and the classifier returns “Harmonized System heading 1704 - Sugar Confection”. The end goal is to classify all goods and services purchased with taxpayers’ money in Brazil, in all levels (federal, state, municipal) and branches (executive, legislative, judiciary) of government. I’ve spent the last few weeks working on this and here I summarize what worked and what didn’t.

getting some data

First I needed to get some data. For availability reasons I decided to use the purchases of the Federal District. (The Federal District s kind of like a state, except that it contains Brasília, the nation’s capital, and is subject to some special rules.) So, I scraped all the purchases of the Federal District off their website. That yielded a total of 139,624 purchases, which is about everything the Federal District has bought since 2005 (earlier purchases are not on the website). Some purchases involved a single item and others involved multiple items. For each purchase I got several data fields (price, date, etc), but the only one that interested me was the description of whatever it is that was being bought (“jelly beans”, “aircraft parts”, etc). If you are into that, here’s the script I wrote to do the scraping and here’s the script I wrote to parse the HTML into JSON files. They’re both in Python. (The comments are in Portuguese - sorry!) Naturally, these scripts may become useless if the website changes (I last used them on 12/18/14).


I decided to start with clustering - more specifically, k-means. I thought that if I could cluster similar items together then maybe I wouldn’t even need a taxonomy at all, I could just create my own taxonomy based on whatever the clusters were. In case this is the first you hear about k-means, here’s how it works: you treat every text (in my case, every item description) as a vector of term-frequencies, randomly create K centroids (K being the number of clusters you want to produce), assign each vector to its closest centroid, compute the squared distances between each vector and its centroid, recompute the centroids based on that assignment, reassign the vectors based on the new centroids, and repeat these last two steps until assignments stop changing. In practice you don’t use the vectors of term-frequencies themselves, but a transformation thereof (TF-IDF, which gives more weight to more discriminant words). And you also normalize the vectors, to account for the different sizes of the texts. (See Manning, Raghavan, and Schültze’s Introduction for Information Retrieval, chapter 16, for a proper explanation of k-means.)

I tried tweaking the number of clusters and removing some stopwords. I also tried using different initial centroids and k-means++ (which avoids convergence to local optima).

Programming-wise, I used Python and its scikit-learn package.

The results were a complete mess. The clusters were all over the place: for instance, “surgical gloves” and “buses” were in the same cluster. I briefly considered trying other clustering algorithms (maybe a hierarchical one) but the results were so bad that I just gave up on clustering for good.

cosine similarity

I moved on to cosine similarity, using the Harmonized System. In case you’re not familiar with cosine similarity, it’s just the cosine of the angle between two vectors - in this case, we’re talking about vectors of term-frequencies (or, as discussed before, their normalized TF-IDF transformation). The Harmonized System, in turn, is the taxonomy used in international trade. It’s structured in 96 chapters (ex.: Chapter 74 - Copper and Articles Thereof), which unfold in headings (ex.: 7408 - Copper Wire), which unfold in subheadings (ex.: 7408.21 - Wire of Copper-Zinc Base Alloys-Brass). (I used the Portuguese translation, not the English version.)

So, I took the description of each item purchased by the Federal District, measured its cosine similarity to the description of each of the subheadings of the Harmonized System, and then classified the item in the subheading with highest similarity.

I also tried using the Harmonized System headings instead of the subheadings. I noticed that the descriptions of the headings and subheadings are pretty short, so I also tried merging them together (so as to produce a more informative description). I noticed that the item descriptions were usually in the singular whereas the Harmonized System descriptions were usually in the plural, so I applied the first step of the RSLP-S stemmer in order to bring everything to singular. I noticed that quantity- and measurement-related words and abbreviations - “kg”, “mm”, “caixa” (box) - were causing some confusion, so I also tried adding them to the list of stopwords.

Finally, I tried dimensionality reduction - instead of computing the similarity of vectors of term-frequencies, I computed the similarity of lower-dimensionality vectors, which perhaps could capture the underlying categories behind the descriptions. I tried Latent Semantic Analysis, Latent Dirichlet Allocation, and Hierarchical Dirichlet Process (which unlike the first two doesn’t require us to set the number of dimensions a priori).

Programming-wise, I used Python and its Natural Language Toolkit and gensim packages.

The results were awful. They were not entirely nonsensical but even the best results were still wrong: for instance, “battery charger” was classified under “battery”. You can see that something gets captured correctly, but not enough to yield usable results.

So, I gave up on the vector space model altogether and decided that I needed a supervised learning approach.


To do supervised classification we need labeled data - i.e., government-purchased items that have already been classified. That limits our choices because only the federal government bothers to classify its purchases - the states and municipalities do not (if they did I wouldn’t need to come up with a classifier in the first place).

The federal government uses two taxonomies: the CATMAT (short for Catalog of Materials) and the CATSER (short for Catalog of Services). The CATMAT and CATSER are like the Harmonized System, with numerical codes and hierarchical levels. If we merge the CATMAT and CATSER into a single taxonomy we have 79 groups, 670 classes, a couple thousand headings, and about 200,000 materials/services. To give a concrete example, material “191505 - Aircraft” goes under heading “16733 - Aircraft”, which goes under class “1505 - Aircraft”, which goes under group “15 - Aircraft and its structural components”.

As you see, the CATMAT/CATSER is not exactly a beauty of logic: in the above example (which is not fictional) the first three hierarchical levels all mean the same thing (aircraft) and the more detailed levels do not inherit the numerical code of the less detailed levels (material 191505 goes under heading 16733, which goes under class 1505). Worse still, in group ”15 - Aircraft” we also find class “1510 - Aircraft with fixed pitch propellers”, which clearly should not be a class but a heading under “1505 - Aircraft”.

I googled around but I didn’t find a correspondence table with the Harmonized System, so I had to do with the CATMAT/CATSER (the Harmonized System, on the other hand, doesn’t have services, only goods, so there is that).

labeled data

Ok, so much for the taxonomy - on to the data. There is a public API anyone can use to scrape the purchases of the federal government. The problem is, it doesn’t always return the item descriptions - sometimes it does, sometimes it just says “see documentation”. The API is still in beta, so I guess over time more data will be added, but for now it doesn’t quite serve my purposes.

I found the data in this DW. Now, before you get your hopes high (in case you’re looking for the same data), that DW is gated. If you want access you need to write to the Ministry of Planning and ask them to give you a password. I didn’t do that myself (the agency I work for did), so I don’t know whom to contact there. Also, your username is your CPF number - the CPF is the Brazilian equivalent of the Social Security Number. So I have no idea what happens when you ask for access but you are a foreigner.

Anyway. I scraped all 5,049,563 purchased items I found in the DW, which covers the 1998-2014 period. The DW is poorly documented, but it doesn’t look like all purchases are there, especially for the late 90s and early 00s period. But five million is good enough and for each purchased item I have its description and its CATMAT/CATSER classification - material/service, heading, class, and group.

The 5,049,563 descriptions of purchased items contain a total of 450,680 unique words. That gives us a term-document matrix with 2,275,737,052,840 cells, which is a bit too much. Thus after some pre-processing (I lowercased everything, removed some special characters - but kept all the accented letters used in Portuguese -, TF-IDF’ed, and normalized the data) I split the data into 504 chunks of 10,000 samples each (and one chunk of 9,563 samples). Then I split the chunks into 70% training (353 chunks), 15% validating (76 chunks), 15% testing (76 chunks).

support vector machines

I went for support vector machines (SVM) right off the bat. The math behind SVM is more involved than that of k-means or cosine similarity, but the gist of it is that you want to separate your apples and oranges by the widest possible margin. You don’t just draw a line in the middle, you also draw two other, parallel lines (the “support vectors”), some distance away from the middle line, and you want the space between these two lines to be as empty as possible. You achieve that goal by having your loss function penalize any classification that falls in that region (“soft-margin” classification) and by (sort of) projecting your data onto a higher-dimensional space where no classifications fall in that region (the “kernel trick”; you don’t really need to project your data onto a higher-dimensional space: you use kernels, functions that can get to the solution implicitly, using only the dot prodcuts of your samples). (Check Hastie, Tibshirani, and Friedman’s The Elements of Statistical Learning, chapter 12, for an introduction to SVM.)

I used scikit-learn’s SGDClassifier class. Its partial_fit method allows online learning, i.e., it allows us to use one chunk of the data at a time (as opposed to batch learning, where we use all the data at the same time). The SGDClassifier uses (as the name suggests) stochastic gradient descent as an optimizer (its SGD implementation is influenced by Léon Bottou’s). To do multiclass classification (number of classes > 2, which is the case here) it uses a “one vs all” approach. The SGDClassifier is a flexible class: we can tweak the loss function, the regularization term (the term that penalizes model complexity), and a number of hyperparameters. The downside is that it doesn’t let us use a non-linear kernel.

I tried different loss functions (hinge, squared hinge, modified Huber) and different regularization terms (L1, L2, ElasticNet). I tried classifying based on the CATMAT/CATSER group (79 groups in total) and on the CATMAT/CATSER class (670 classes in total). I tried with and without stopwords and I tried tweaking a number of other hyperparameters (like the number of passes through the data and the learning rate).

I also tried weighing different words based on their position in the description. That’s because the initial words are usually the most relevant: a typical item description would be, for instance, “t-shirt in black, made of cotton, size M”. It’s clear that we want the classifier to pay more attention to “t-shirt” than to the other words. Hence, in one experiment, I gave each word a weight that increases roughly exponentially the closer the word is to the beginning of the description. (I say roughly exponentially because, to keep the implementation simple, I simply repeated the word \( (n \times n) / length(description) \) times, with \( n \) being the inverse of the word’s position. This often resulted in decimal numbers, but we can’t repeat a word 0.3 times, so in these cases I rounded up the result. I could have solved this by removing the \( length(description) \) denominator, but that would take forever to run.) This departs from the bag-of-words model, but I thought it might work. I also tried simply repeating the first and second words a number of times, leaving the other words as they were.

Finally, I tried classifying each sample in both a CATMAT/CATSER group and a CATMAT/CATSER class but ensuring that the group and class were consistent. The goal was to avoid having a sample classified in group “aircraft” and in class “jelly beans”, for instance. To do that I applied a three-stage strategy. First I classified according to CATMAT/CATSER groups. Then instead of classifying according to CATMAT/CATSER classes I estimated the probability of each sample belonging to each of the 670 classes. I then assigned the sample to the CATMAT/CATSER class of highest probability among those classes that belonged to the predicted group.

The whole thing (training, validating, testing) takes about 12 minutes to run (on a late 2013 MacBook Pro with 2.3 GHz Intel Core i7, 16GB of RAM, using Python 2.7.9 and scikit-learn 0.15.2 on an IPython notebook). I’m storing each data chunk as a SciPy sparse matrix saved as a Python pickle - certainly not the most efficient format, but with only 12 minutes of runtime that’s good enough.


I got the best results using the modified Huber function as the loss function, L1 as the regularization term, and CATMAT/CATSER groups as the categories: 73% of correct classifications (76% with the validation data). With CATMAT/CATSER classes (instead of groups) performance degrades to 62% of correct classifications (66% with the validation data). The other choices - stopwords, number of passes through the data, learning rate, weighing words by their positions, ensuring group-class consistency, etc - didn’t make much difference.

I re-ran the algorithm using subsets of the data and here’s how the percentage of correct classifications change:

As we see, performance seems to plateau after 1 million training samples or so. This is bad news: it means that my having 5 million samples is, by and large, just adding computational cost.

In general the more frequent the group or class, the better the corresponding classifications. The correlation between frequency, on the one hand, and percentage of correct classifications, on the other, is 0.5 for groups and 0.35 for classes. But it’s not like the bulk of the misclassifications is concentrated in a few rare groups or classes. The median frequency (in the test data, which has 759,763 samples) is 187 for classes and 3,400 for groups. So, most groups and classes just don’t appear that often and the misclassifications are pretty spread across them.

I’m not sure algorithms - random forest or neural networks - would help (I may give them a try if time permits). (I did try multinomial logit, just for the fun of it, but the results worsened considerably.)

Fortunately, the SGDClassifier class can produce probability estimates - with the predict_proba method - so we have an uncertainty measure. Right now I’m inspecting the probability distributions to find a threshold above which the classifications are sufficiently correct. Then, when using the classifier to query the data, I can instruct it to discard any samples whose classification probability is lower than that threshold.

Suggestions are welcome!

machine learning course with Andrew Ng

Here. It starts on January 19th. I’ve taken it before and it’s excellent in every way.