(last updated: July 19th, 2016)
machine learning / natural language processing:
Prof. Ng’s online class. It’s certainly the best possible intro to machine learning you can find. Prof. Ng is an authority on the subject (he is director of the AI lab in Stanford) and a superb instructor.
Hastie, Tibshirani, and Friedman’s “The elements of statistical learning”. Comprehensive and free.
Prof. Shalizi’s lecture notes. They contain great informal presentations of difficult machine learning subjects and thus are a great companion to machine learning textbooks and courses.
Caltech’s machine learning video library. It’s incredibly comprehensive. If you feel overwhelmed start with this video, where Prof. Abu-Mostafa gives a nice overview of the field.
Manning, Raghavan, and Schütze’s “Introduction to information retrieval”. A draft version is available online. This book is an excellent starting point for text mining. If your interest is in text mining only (and not in search algorithms), you can start with chapter 6, which lays some foundations, then proceed to chapters 13-17.
Prof. Grimmer’s lecture notes. They are from his text analysis course at Stanford. He covers the main text analysis methods, giving both the intuition and the math. He also helps you make sense of the field as a whole – what’s related to what, in what way.
Stanford’s Natural Language Processing syllabus. Lots of great slides and readings. Deals with machine translation, deep learning, answering systems, and more. (Stanford, please make this a Coursera course!)
Katharine Jarmul’s PyCon 2014 talk teaches you how to webscrape with Python. It’s by far the best webscraping_with_Python resource I’ve ever seen.
Prof. Caren’s tutorials teach you the basics of webscraping with Python. No previous exposure to Python is assumed, so it’s a great place to start if you are in a hurry (if you are not, learn Python first, then webscraping).
Selenium lets you “remote control” your browser. It’s tremendously useful when you need to scrape difficult sites that don’t like to be scraped (e.g., LexisNexis and Factiva). In my own research Selenium has saved me months of manual, tedious work.
When I first started I couldn’t find any good tutorial on how to webscrape with Selenium, so I wrote one myself. It’s divided in five parts:
Katharine Jarmul’s recent talk (see above) also covers Selenium (it starts about 2h30m in) and she does a great job, so you should check it.
Will Larson’s tutorial teaches you the art of compassionate webscraping — i.e., getting the stuff you want without disrupting the website’s operation.
WordNet. This dataset groups related words — synonyms, antonyms, hyponyms (as in “chair” being a type of “furniture”), and meronyms (as in “leg” being a part of a “chair”). There are also similar datasets for other languages.
GDELT (Global Data on Events, Location, and Tone). It contains over 200 million georeferenced events starting in 1979.
If you are new to programming, Udacity’s online class is the best place to start. Prof. Evans uses Python to teach you the basics of programming – things like hashing, recursion, and computational cost. The course is self-paced (unlike Coursera courses), so you may complete it in a week or two if you clear your schedule.
If you already know the basics of programming but never used Python, “Python in a Nutshell” is my pick. I would recommend reading it from cover to cover (it only takes a day or two). Otherwise you may waste precious time later on trying to google what a “tuple” is, or how to “unpack a list of lists”. It’s better to learn all the essentials upfront.
Here’s a flowchart with the most common newbie mistakes.
Once you get the basic stuff out of the way, Problem Solving with Algorithms and Data Structures Using Python is an excellent introduction to, well, algorithms and data structures. Why learning these? Because by carefully designing your algorithms and by choosing the correct data structures you can make your code run much faster (or run at all, in case your current code exceeds your machine’s processing power or memory size).
To keep up-to-date with what’s going on in terms of Python tools for data analysis check the PyData talks on vimeo.
Here’s a useful list of common mistakes you should avoid when using Python with big data.
Python has great stats libraries. The must-have are NumPy and pandas. NumPy is great for matrix operations — transpose, multiply, etc. Pandas is great for data management — merging tables, naming columns, and so on. So together they give you the matrix tools of R with the dataset tools of Stata. If that’s not good enough for you, NumPy and pandas can run at near-C speeds.
If you know Matlab, here is a neat equivalence table between Matlab and NumPy.
You’ll probably also want to have SciPy and scikit-learn. SciPy gives you regression, otimization, distributions, advanced linear algebra (e.g., matrix decomposition), and much more. Scikit-learn gives you machine learning — Naïve Bayes, support vector machines, k-nearest neighbors, and so on.
If you have large datasets you may want to look into PyTables as well. It has some tools that let you manage “out-of-core” data, i.e., data that doesn’t fit in memory.
To make the most out of NumPy, pandas, SciPy, scikit-learn, and PyTables you need to have some dependencies installed, like HDF5 and MKL. These tools can be a pain to install, but are worth it – your code will run much faster. If you want a quick-and-dirty solution you can simply download and install Anaconda Python or Enthought Canopy. These are “bundles” that come with everything included (Python itself, all the important modules, and all those low-level tools). There are downsides though (Anaconda issues annoying warnings, Canopy is a pain to use in remote machines, both crash sometimes).
If you have non-English texts, it’s worth learning the intricacies of character encoding and here is a nice tutorial on that (and here is a bit of historical context). I’ve have to deal with texts in Spanish, Portuguese, and French, and all those accented letters (‘é’, ‘ã’, ‘ü’, etc) must be handled carefully (so the code doesn’t break or the output doesn’t become unintelligible).
Python’s Natural Language Toolkit (NLTK) gives you some tools for text-processing: tokenizing, chunking, etc. Two alternatives worth mentioning are the Stanford CoreNLP toolkit, which has several Python wrappers, and spaCy. If you’re looking for windows and buttons JFreq is a popular choice, but it chokes on large corpora and I’ve found that it doesn’t handle accented characters well.
If you are into text analysis then you also should check gensim. It gives you TF-IDF, LSA and LDA transformations, which means that you can do dimensionality reduction, handle synonymy and polysemy, and extract topics. And here is the best part: gensim handles huge datasets right out-of-the-box, not need to do any low-level coding yourself. I used gensim for one of my dissertation papers and it has saved me months of coding.
This is domain-specific, but if you’re doing text analysis in political science you should definitely check Prof. Benoit’s from time to time. Also, he just created a website dedicated to text analysis.
In case you are plagued by the curse of too much data, the University of Oklahoma has a workshop series on supercomputing.
Prof. Howe’s online class teaches you MapReduce and some SQL, which may come in very handy if you have tons of data.
next on the list:
Here’s some stuff I haven’t touched yet but am eager to:
Data Structures and Algorithms in Python (book). It seems to be more comprehensive than Problem Solving with Algorithms and Data Structures Using Python (mentioned above).
The Elements of Computing Systems: Building a Modern Computer from First Principles and Operating System Concepts (books). I don’t have any formal training in computer science, so to me the hardware is, by and large, a mystery. I want to learn the basics.
High Performance Scientific Computing (course). I’ve been using high-performance environments for some time (the Ohio Supercomputer Center and Amazon EC2), learning on the fly, but I feel that there is a lot out there that I ignore and that could be useful to me. Particularly when it comes to memory management and parallel code.
Paradigms of Computer Programming (course).