We all know that academic writing stinks, but exactly how bad is it? And in what disciplines does academic writing stink the most? To answer these questions I scraped 174,527 academic articles, came up with a few indicators of “academese” and compared these indicators across disciplines.

our data source

I scraped those articles from SciELO. In case you’ve never heard of SciELO, it’s like JSTOR, but focused on Latin American journals. I would rather have scraped JSTOR, but it’s gated. I could scrape it using my OSU credentials, but the Aaron Swartz brouhaha is not an encouraging precedent. Also, JSTOR articles are usually in PDF, which sucks for text mining (especially when the PDF is the scan of a document, which is often the case in JSTOR). SciELO, by contrast, is ungated (it doesn’t even have a “terms of use” page, so I can’t possibly be breaking any rules) and its articles are in HTML, which is much easier to parse than PDF.

Most articles on SciELO are in Portuguese or Spanish but that’s not a problem here. The flaws of academic writing - cluttered sentences, meaningless words, plenty of jargon - are not language-specific. And thanks to the Normands the mumbo jumbo barely changes from one language to another: in English you problematize the discursive paradigm and in Portuguese você problematiza o paradigma discursivo.

summary statistics

The 174,527 articles I scraped are all the articles in Portuguese from all the 281 active journals on SciELO. Here’s how they are distributed:

We’ll look into specific disciplines in a minute.

(SciELO doesn’t categorize its journals, so the categorizations above are my own, partly based on Latindex’s. In case you are curious, here are the journal ISSN codes and respective areas.)

academese and how to measure it

…the secret of good writing is to strip every sentence to its cleanest components. Every word that serves no function, every long word that could be a short word, every adverb that carries the same meaning that’s already in the verb, every passive construction that leaves the reader unsure of who is doing what - these are the thousand and one adulterants that weaken the strength of a sentence. And they usually occur in proportion to education and rank. (William Zinsser, On Writing Well.)

As we see, academese is not a particular type of bad writing. The same set of flaws that we’ve been calling academese could very well be called journalese or bureaucratese. The specific jargon may change (academics critique, bureaucrats impact) but the underlying sins are the same: clutter and vagueness.

Now that we know what bad writing is we can measure it. Here are my indicators:

• Word length.
• Sentence length.
• Adverbs/total number of words. To identify adverbs I use the 1,000 most frequent adverbs in the Portuguese Corpus.
• Gerunds/total number of words. I count as a gerund any word that ends in ‘ndo’, which is how gerunds end in Portuguese - as in fazendo (doing), escrevendo (writing), etc.
• Inane words/total number of words. I don’t have an exhaustive list of inane words. Instead I selected any words that match one of the following regular expressions:
• '^problematiz.*': problematizar (to problematize), problematização (problematization), etc.
• '^paradigma.*': paradigma (paradigme), paradigmático (paradigmatic), etc.
• '^discursiv.*': discursivo (discursive), discursividade (discursivity), etc.
• '^dial\u00E9tic.*': dialética, dialético (both mean dialectic), etc.
• '^critic.*' and '^cr\u00EDtic.*': criticar (to criticize), crítico (critical, critic), criticado (criticized), etc. You may have qualms about this but think: are people supposed to do anything a-critically in academia? If not then critical is inane unless we’re talking about a specific disagreement (as in “Hayek criticized Keynes’ monetary theory”). Elsewhere - “critical theory”, “a critical look at”, etc - critical is pointless.

I picked these words because they are not specific to any disciplines. Sociologists and physicists alike can talk about paradigmes, problematize phenomena, or take a critical look at theories and experiments. Sociologists and physicists may differ in how often they use such inane words (that’s what I want to measure) but they both have the same opportunities to use them.

I considered enriching my list of inane words with words from Sokal’s hoax and from the Bad Writing Contest. But that might stack the deck against the humanities. Physicists probably don’t have many chances to write entelechy, otherness, or essentiality.

I’d rather use a machine learning approach but I couldn’t find a corpora of academic articles pre-labeled as academese/not academese (let alone one such corpora that happens to be in Portuguese). Hence what’s left is this sort of “dictionary” approach, which is much weaker - I’m certainly missing A LOT of relevant features (and possibly using some irrelevant ones). But it’ll have to do for now.

so, who stinks the most?

Here’s the ratio inane words / total number of words:

Adverbs / total number of words:

Gerunds / total number of words:

Average sentence length:

Average word length:

Clearly it’s in the humanities and social sciences that academese really thrives. Not exactly a shocking finding, but it’s nice to finally have some numbers. I didn’t expect the difference in inane words usage to be so large. Conversely, I expected word length to vary a lot more (maybe scientific names are inflating word size in the hard sciences?).

zooming in

Let’s zoom in on some of the humanities and social sciences. Here’s our articles:

Now let’s see how these different disciplines compare. Here’s the ratio inane words / total number of words:

Adverbs / total number of words:

Gerunds / total number of words:

Average sentence length:

Average word length:

Well, these are some disappointing data. I expected big differences - especially between pairs like economics and anthropology, or business and history. Turns out I was wrong. Sadly, there isn’t a clear “worst offender” for us to shame. Philosophy wins (well, loses) when it comes to poor word choices, but not by much, and philosophers don’t use longer sentences or words than the other social scientists.

what’s next

This analysis is not language-specific - what makes bad writing bad is the same set of flaws in English and in Portuguese. But perhaps this analysis is culture-specific? For instance, what passes for economics in Brazil is much different than what we find on the American Economic Review - there is a lot more “problematization of paradigms” in the tropics, so to speak. So I wonder how things would change if I used American articles instead. I’m not going to scrape any gated databases, lest I get in legal trouble. But maybe Google Scholar searches might contain enough ungated results? Or maybe there is some (large enough) ungated database that I don’t know of?

the gory details

Here’s the code I used to scrape SciELO, in case it may be useful to you. (Please, be considerate. SciELO is super nice to us scrapers: it’s ungated, there are no captchas, everything is in plain HTML, and - crucially - they don’t prohibit scraping. So don’t overparallelize or otherwise disrupt their servers.)

The code above will yield a little garbage: a few articles not in Portuguese and some texts that are not articles (copyright notices mostly). Not enough garbage to be worth fixing though.

To parse the sentences I used the following regular expression, which I stole from here.

To make the pie charts I used d3pie, which is amazing tool: you make the charts interactively and d3pie generates the code for you. (That sort of defeats the purpose of these posts, which is to help me practice D3, but it was just so quick and easy that I couldn’t resist.) To make the bar charts I just used plain D3. Here’s the code for the first bar chart (the code is almost the same for the others):

This is it!

# personal pronouns in Brazilian Portuguese

While English doesn’t have a singular you (unless you count thou), Portuguese has two: tu and você, with tu being preferred in some regions and você being preferred in others. A while ago an American friend who is learning Portuguese asked me where exactly people used one form or the other and, despite Portuguese being my first language, I couldn’t give her a proper answer. So, to atone for my unhelpfulness (and because I needed an excuse to start playing with D3) I scraped 680,045 geolocated tweets from Brazil and made the map below, where the higher the tu/você ratio the “redder” the state and the lower the tu/você ratio the “bluer” the state. Hover the mouse on the map to see the state names and respective ratios and number of tweets.

As we see, all states prefer você over tu - not a single state has a tu/você ratio higher than 1. And that preference is strong: in all states but one você is at least twice as popular as tu (i.e., the ratios are no higher than 0.5). The exception is Rio Grande do Sul (the purple, southernmost state), where the contest is close, with a tu/você ratio of 0.88. Which fits the popular perception of gaúchos (as natives from Rio Grande do Sul are often called) having a culture that is heavily influenced by the surrounding Spanish-speaking countries (tu is singular you in Spanish as well).

Other than in Rio Grande do Sul tu is most popular in Santa Catarina (tu/você ratio of 0.47), Maranhão (0.42), Amapá (0.35), Pará (0.30), and Roraima (0.30). Santa Catarina is in the South (it borders Rio Grande do Sul) while the other four states are in the North and Northeast. Você, on the other hand, is most popular in Sergipe (0.05), Minas Gerais (0.06), Goiás (0.06), Mato Grosso do Sul (0.08), Mato Grosso (0.08), São Paulo (0.08), and Espírito Santo (0.09) - basically the entire Southeast and Midwest regions.

I got curious to know how exactly people use tu. You see, conjugation is a tricky business in Romance languages. In English I go, you go, he/she/it goes, we go, you go, they go - the verb almost never changes. But in Portuguese eu vou (I go), tu vais (“singular you” go), você vai (“singular you” go), ele/ela vai (he/she/it goes), nós vamos (we go), vós ides (“y’all” go), vocês vão (“y’all” go), eles/elas vão (they go). The verb almost always changes. And as this example shows, tu and você call for different verb inflections: tu vais but você vai.

Or so Portuguese grammar dictates. But grammar rules are one thing and what people actually say are another. At least where I’m currently living (Brasília), when people use tu they often conjugate it as você: tu vai instead of tu vais; tu foi (“singular you” went) instead of tu foste; and so on. So, when people use tu do they conjugate it correctly?

Turns out they don’t. Out of the 680,045 tweets I scraped tu vai appears in 6,559 of them while tu vais appears in a mere 77 - too few to even allow state-wise comparisons.

Which is probably ok. By merging the conjugation rules of tu and você we simplify Portuguese grammar. Old English conjugation was hard: ic hæbbe (I have), þū hæfst (thou hast), hē/hit/hēo hæfð (he/it/she has), wē/gē/hīe habbaþ (we/”y’all”/they have) - and that’s just the nominative case; there were also the accusative, dative, and genitive cases, each calling for different verb inflections. But over the centuries English grammar got simplified and now conjugation is pretty straightforward - and I don’t see anyone complaining about that. Portuguese still needs to undergo a similar process. The less school time we spend learning arbitrary verb inflections the more time we can spend learning useful stuff.

In case this discussion has awaken your inner etymologist you may like to know that tu and thou are cognate - they both come from the same Proto-Indo-European word. And both have declined for the same reason: people were looking for less intimate ways to refer to one another. The English solution was to imitate the French by using the plural pronoun (you). The Portuguese solution was to imitate the Spanish by using Your Mercy - Vuestra Merced in Spanish, Vossa Mercê in Portuguese. Vossa Mercê is a bit cumbersome to say, so eventually people slurred it into você (just like Spanish speakers eventually slurred Vuestra Merced into usted).

(Btw, if you are into etymology you should check Kevin Stroud’s The History of English Podcast. It’s amazing to the point where I don’t even mind being stuck in traffic anymore.)

It would be interesting to know how usage of tu and você varies in other Portuguese-speaking countries, but only Brazil has a population big enough to yield 600k geolocated tweets each containing tu and/or você in under a week. It would also be interesting to know how tu and você usage has varied over time - in particular, when exactly você overtook tu as the most popular choice. But there was no Twitter before the 21st century (and Google Books N-Gram Viewer doesn’t have a Portuguese corpus).

Some caveats are in order. First, there may be intra-state variation. But for most cities there are too few tweets and I’d rather not say anything about how the people from this or that city speak based on only a handful of tweets. So, state capitals and other large cities are overrepresented here. Second, you need internet access and a device of sorts (a cell phone at least) in order to tweet, so the really poor are excluded from this analysis. There is some evidence that usage of tu declines with income, so I’m probably underestimating tu usage here. Third, people don’t necessarily tweet the same way they speak. For instance, perhaps the people of Rio Grande do Sul write tu more often than they speak it.

If you want the gory details, I used Twitter’s sample stream, which is a continuous stream of a small random sample of all tweets, in JSON format. Programming-wise I used Python’s Tweepy. Here’s my code (minus access key, token secret, etc; you can create your own Twitter-scraping credentials here):

This code will simply scrape the tweets and show them on your screen but won’t save them. To save, you need to point the stdout to some output file, as in: python scrapeTwitter.py > output.csv.

As you can see from the code, I limited my scraping to posts containing tu, você, voce, or vc (a common shorthand for você in Portuguese). The Twitter API doesn’t care about case, so that also yielded Tu, TU, vC, etc.

I also used Python to parse the tweets:

(You may want to comment out line 67 - print cleaned.encode('utf8'). Otherwise every tweet you scraped will show on your screen as it is parsed. That can make you lose faith in mankind.)

To make the choropleth (the fancy word for a map whose colors vary according to data) I initially considered plotly, but as it turns out they don’t handle maps natively and I didn’t like the workarounds I found. So I turned to D3. It’s not very intuitive, especially if you’re used to making your plots in R, Matlab, or Stata. You have to juggle three languages - JavaScript, HTML, and CSS - to get your plot to work and that can be confusing in the beginning. But it got the job done. The map itself was originally in shapefile format and I used Carolina Bigonha’s br-atlas to convert it to Topojson format.

Below is the code I used to produce the choropleth. Here I only have 27 observations, so I didn’t mind hardcoding the data for each of them. But if you have more observations (say, counties) you probably want to automate the process by having a function map tu/você ratios to the RGB color spectrum.

That’s it. More map-based posts should follow.

# now on GitHub Pages

Wordpress.com wouldn’t let me have interactive plots, so I just migrated this site to GitHub Pages. I believe no permalinks were harmed in the making of this change. All images, videos, and PDFs, must have been ported. And I tested the new layout on Chrome, Firefox, and Safari (IE users: you don’t have to live like that) and on iPhone and iPad. But please let me know if you find any issues.

There were some casualties: all comments up to now. Here I’m using Disqus to handle comments and, despite heavy googling, I couldn’t find a way to export Wordpress.com comments to Disqus (that’s possible with Wordpress.org though, by installing a plugin). So, I apologize to all past commenters. And a special apology to all the nice folks who have commented on my webscraping tutorial, which is by far the most popular content here (it accounts for way over half the visits) - I love it when people take the time to say it was useful or to provide feedback.

Migrating was pretty easy. I’m using Jekyll, a Ruby-based generator of static websites. Except for one pesky dependency, installing Jekyll was painless. Theme-wise I forked Hyde (yes, I see it), which I’m using almost unaltered. Most of the work went into migrating the posts. I found Exitwp, which by and large automates the task. But I had several posts with code and they gave me some trouble. For instance, Jekyll’s own syntax highlighter leaves something to be desired, so I had to google around to improve it a bit.

Using GitHub Pages has been a pleasant experience so far. I can now do pretty much whatever I want to with my site - which means it’s JavaScript time! I’m also enjoying being able to deploy the site locally before uploading it to GitHub. Jekyll is great too. It lets me write posts in Markdown, which is obviously good for writing math. I wonder how the Python alternative compares though, so I may try it out some day.

This is it for now. Posts with interactive plots should follow soon.

# classifying goods and services

I needed to produce an automated classifier of goods and services: something that takes a short description and returns the corresponding category, according to some taxonomy. For instance, you enter “jelly beans” and the classifier returns “Harmonized System heading 1704 - Sugar Confection”. The end goal is to classify all goods and services purchased with taxpayers’ money in Brazil, in all levels (federal, state, municipal) and branches (executive, legislative, judiciary) of government. I’ve spent the last few weeks working on this and here I summarize what worked and what didn’t.

getting some data

First I needed to get some data. For availability reasons I decided to use the purchases of the Federal District. (The Federal District s kind of like a state, except that it contains Brasília, the nation’s capital, and is subject to some special rules.) So, I scraped all the purchases of the Federal District off their website. That yielded a total of 139,624 purchases, which is about everything the Federal District has bought since 2005 (earlier purchases are not on the website). Some purchases involved a single item and others involved multiple items. For each purchase I got several data fields (price, date, etc), but the only one that interested me was the description of whatever it is that was being bought (“jelly beans”, “aircraft parts”, etc). If you are into that, here’s the script I wrote to do the scraping and here’s the script I wrote to parse the HTML into JSON files. They’re both in Python. (The comments are in Portuguese - sorry!) Naturally, these scripts may become useless if the website changes (I last used them on 12/18/14).

clustering

I decided to start with clustering - more specifically, k-means. I thought that if I could cluster similar items together then maybe I wouldn’t even need a taxonomy at all, I could just create my own taxonomy based on whatever the clusters were. In case this is the first you hear about k-means, here’s how it works: you treat every text (in my case, every item description) as a vector of term-frequencies, randomly create K centroids (K being the number of clusters you want to produce), assign each vector to its closest centroid, compute the squared distances between each vector and its centroid, recompute the centroids based on that assignment, reassign the vectors based on the new centroids, and repeat these last two steps until assignments stop changing. In practice you don’t use the vectors of term-frequencies themselves, but a transformation thereof (TF-IDF, which gives more weight to more discriminant words). And you also normalize the vectors, to account for the different sizes of the texts. (See Manning, Raghavan, and Schültze’s Introduction for Information Retrieval, chapter 16, for a proper explanation of k-means.)

I tried tweaking the number of clusters and removing some stopwords. I also tried using different initial centroids and k-means++ (which avoids convergence to local optima).

Programming-wise, I used Python and its scikit-learn package.

The results were a complete mess. The clusters were all over the place: for instance, “surgical gloves” and “buses” were in the same cluster. I briefly considered trying other clustering algorithms (maybe a hierarchical one) but the results were so bad that I just gave up on clustering for good.

cosine similarity

I moved on to cosine similarity, using the Harmonized System. In case you’re not familiar with cosine similarity, it’s just the cosine of the angle between two vectors - in this case, we’re talking about vectors of term-frequencies (or, as discussed before, their normalized TF-IDF transformation). The Harmonized System, in turn, is the taxonomy used in international trade. It’s structured in 96 chapters (ex.: Chapter 74 - Copper and Articles Thereof), which unfold in headings (ex.: 7408 - Copper Wire), which unfold in subheadings (ex.: 7408.21 - Wire of Copper-Zinc Base Alloys-Brass). (I used the Portuguese translation, not the English version.)

So, I took the description of each item purchased by the Federal District, measured its cosine similarity to the description of each of the subheadings of the Harmonized System, and then classified the item in the subheading with highest similarity.

I also tried using the Harmonized System headings instead of the subheadings. I noticed that the descriptions of the headings and subheadings are pretty short, so I also tried merging them together (so as to produce a more informative description). I noticed that the item descriptions were usually in the singular whereas the Harmonized System descriptions were usually in the plural, so I applied the first step of the RSLP-S stemmer in order to bring everything to singular. I noticed that quantity- and measurement-related words and abbreviations - “kg”, “mm”, “caixa” (box) - were causing some confusion, so I also tried adding them to the list of stopwords.

Finally, I tried dimensionality reduction - instead of computing the similarity of vectors of term-frequencies, I computed the similarity of lower-dimensionality vectors, which perhaps could capture the underlying categories behind the descriptions. I tried Latent Semantic Analysis, Latent Dirichlet Allocation, and Hierarchical Dirichlet Process (which unlike the first two doesn’t require us to set the number of dimensions a priori).

Programming-wise, I used Python and its Natural Language Toolkit and gensim packages.

The results were awful. They were not entirely nonsensical but even the best results were still wrong: for instance, “battery charger” was classified under “battery”. You can see that something gets captured correctly, but not enough to yield usable results.

So, I gave up on the vector space model altogether and decided that I needed a supervised learning approach.

taxonomy

To do supervised classification we need labeled data - i.e., government-purchased items that have already been classified. That limits our choices because only the federal government bothers to classify its purchases - the states and municipalities do not (if they did I wouldn’t need to come up with a classifier in the first place).

The federal government uses two taxonomies: the CATMAT (short for Catalog of Materials) and the CATSER (short for Catalog of Services). The CATMAT and CATSER are like the Harmonized System, with numerical codes and hierarchical levels. If we merge the CATMAT and CATSER into a single taxonomy we have 79 groups, 670 classes, a couple thousand headings, and about 200,000 materials/services. To give a concrete example, material “191505 - Aircraft” goes under heading “16733 - Aircraft”, which goes under class “1505 - Aircraft”, which goes under group “15 - Aircraft and its structural components”.

As you see, the CATMAT/CATSER is not exactly a beauty of logic: in the above example (which is not fictional) the first three hierarchical levels all mean the same thing (aircraft) and the more detailed levels do not inherit the numerical code of the less detailed levels (material 191505 goes under heading 16733, which goes under class 1505). Worse still, in group ”15 - Aircraft” we also find class “1510 - Aircraft with fixed pitch propellers”, which clearly should not be a class but a heading under “1505 - Aircraft”.

I googled around but I didn’t find a correspondence table with the Harmonized System, so I had to do with the CATMAT/CATSER (the Harmonized System, on the other hand, doesn’t have services, only goods, so there is that).

labeled data

Ok, so much for the taxonomy - on to the data. There is a public API anyone can use to scrape the purchases of the federal government. The problem is, it doesn’t always return the item descriptions - sometimes it does, sometimes it just says “see documentation”. The API is still in beta, so I guess over time more data will be added, but for now it doesn’t quite serve my purposes.

I found the data in this DW. Now, before you get your hopes high (in case you’re looking for the same data), that DW is gated. If you want access you need to write to the Ministry of Planning and ask them to give you a password. I didn’t do that myself (the agency I work for did), so I don’t know whom to contact there. Also, your username is your CPF number - the CPF is the Brazilian equivalent of the Social Security Number. So I have no idea what happens when you ask for access but you are a foreigner.

Anyway. I scraped all 5,049,563 purchased items I found in the DW, which covers the 1998-2014 period. The DW is poorly documented, but it doesn’t look like all purchases are there, especially for the late 90s and early 00s period. But five million is good enough and for each purchased item I have its description and its CATMAT/CATSER classification - material/service, heading, class, and group.

The 5,049,563 descriptions of purchased items contain a total of 450,680 unique words. That gives us a term-document matrix with 2,275,737,052,840 cells, which is a bit too much. Thus after some pre-processing (I lowercased everything, removed some special characters - but kept all the accented letters used in Portuguese -, TF-IDF’ed, and normalized the data) I split the data into 504 chunks of 10,000 samples each (and one chunk of 9,563 samples). Then I split the chunks into 70% training (353 chunks), 15% validating (76 chunks), 15% testing (76 chunks).

support vector machines

I went for support vector machines (SVM) right off the bat. The math behind SVM is more involved than that of k-means or cosine similarity, but the gist of it is that you want to separate your apples and oranges by the widest possible margin. You don’t just draw a line in the middle, you also draw two other, parallel lines (the “support vectors”), some distance away from the middle line, and you want the space between these two lines to be as empty as possible. You achieve that goal by having your loss function penalize any classification that falls in that region (“soft-margin” classification) and by (sort of) projecting your data onto a higher-dimensional space where no classifications fall in that region (the “kernel trick”; you don’t really need to project your data onto a higher-dimensional space: you use kernels, functions that can get to the solution implicitly, using only the dot prodcuts of your samples). (Check Hastie, Tibshirani, and Friedman’s The Elements of Statistical Learning, chapter 12, for an introduction to SVM.)

I used scikit-learn’s SGDClassifier class. Its partial_fit method allows online learning, i.e., it allows us to use one chunk of the data at a time (as opposed to batch learning, where we use all the data at the same time). The SGDClassifier uses (as the name suggests) stochastic gradient descent as an optimizer (its SGD implementation is influenced by Léon Bottou’s). To do multiclass classification (number of classes > 2, which is the case here) it uses a “one vs all” approach. The SGDClassifier is a flexible class: we can tweak the loss function, the regularization term (the term that penalizes model complexity), and a number of hyperparameters. The downside is that it doesn’t let us use a non-linear kernel.

I tried different loss functions (hinge, squared hinge, modified Huber) and different regularization terms (L1, L2, ElasticNet). I tried classifying based on the CATMAT/CATSER group (79 groups in total) and on the CATMAT/CATSER class (670 classes in total). I tried with and without stopwords and I tried tweaking a number of other hyperparameters (like the number of passes through the data and the learning rate).

I also tried weighing different words based on their position in the description. That’s because the initial words are usually the most relevant: a typical item description would be, for instance, “t-shirt in black, made of cotton, size M”. It’s clear that we want the classifier to pay more attention to “t-shirt” than to the other words. Hence, in one experiment, I gave each word a weight that increases roughly exponentially the closer the word is to the beginning of the description. (I say roughly exponentially because, to keep the implementation simple, I simply repeated the word $$(n \times n) / length(description)$$ times, with $$n$$ being the inverse of the word’s position. This often resulted in decimal numbers, but we can’t repeat a word 0.3 times, so in these cases I rounded up the result. I could have solved this by removing the $$length(description)$$ denominator, but that would take forever to run.) This departs from the bag-of-words model, but I thought it might work. I also tried simply repeating the first and second words a number of times, leaving the other words as they were.

Finally, I tried classifying each sample in both a CATMAT/CATSER group and a CATMAT/CATSER class but ensuring that the group and class were consistent. The goal was to avoid having a sample classified in group “aircraft” and in class “jelly beans”, for instance. To do that I applied a three-stage strategy. First I classified according to CATMAT/CATSER groups. Then instead of classifying according to CATMAT/CATSER classes I estimated the probability of each sample belonging to each of the 670 classes. I then assigned the sample to the CATMAT/CATSER class of highest probability among those classes that belonged to the predicted group.

The whole thing (training, validating, testing) takes about 12 minutes to run (on a late 2013 MacBook Pro with 2.3 GHz Intel Core i7, 16GB of RAM, using Python 2.7.9 and scikit-learn 0.15.2 on an IPython notebook). I’m storing each data chunk as a SciPy sparse matrix saved as a Python pickle - certainly not the most efficient format, but with only 12 minutes of runtime that’s good enough.

results

I got the best results using the modified Huber function as the loss function, L1 as the regularization term, and CATMAT/CATSER groups as the categories: 73% of correct classifications (76% with the validation data). With CATMAT/CATSER classes (instead of groups) performance degrades to 62% of correct classifications (66% with the validation data). The other choices - stopwords, number of passes through the data, learning rate, weighing words by their positions, ensuring group-class consistency, etc - didn’t make much difference.

I re-ran the algorithm using subsets of the data and here’s how the percentage of correct classifications change:

As we see, performance seems to plateau after 1 million training samples or so. This is bad news: it means that my having 5 million samples is, by and large, just adding computational cost.

In general the more frequent the group or class, the better the corresponding classifications. The correlation between frequency, on the one hand, and percentage of correct classifications, on the other, is 0.5 for groups and 0.35 for classes. But it’s not like the bulk of the misclassifications is concentrated in a few rare groups or classes. The median frequency (in the test data, which has 759,763 samples) is 187 for classes and 3,400 for groups. So, most groups and classes just don’t appear that often and the misclassifications are pretty spread across them.

I’m not sure algorithms - random forest or neural networks - would help (I may give them a try if time permits). (I did try multinomial logit, just for the fun of it, but the results worsened considerably.)

Fortunately, the SGDClassifier class can produce probability estimates - with the predict_proba method - so we have an uncertainty measure. Right now I’m inspecting the probability distributions to find a threshold above which the classifications are sufficiently correct. Then, when using the classifier to query the data, I can instruct it to discard any samples whose classification probability is lower than that threshold.

Suggestions are welcome!