model persistence without pickles07 Dec 2015
So, I trained this SVM classifier and I wanted to use it in a web app I built. I used Python for everything, so it seemed straightforward at first: just use the pickle module to save the classifier to disk, then have the app load the pickle. But things got complicated. In the end I found a better way to achieve model persistence, so I thought I should share the experience.
The fundamental problem is that the classifier turned out huge. Not surprising: it was trained with 20 million documents and intended to pick one of 560 possible document categories. The resulting coefficient matrix has dimensions 560 (categories) by 505,938 (unique tokens). That’s a matrix with 283,325,280 cells. When pickled to a file it takes up 8GB of disk space.
I didn’t mind that at first. I thought “fine, so the app will take a few seconds to be ready after I deploy it, no problem”. But the app can’t load an 8GB pickle if there is only, say, 1GB of RAM. I did some tests and realized that I would need a server with at least 16GB of RAM to (barely!) host the app. I looked up server prices on Amazon Web Services and on Google Compute Engine. It would cost me some US$ 200 a month to keep the app alive. Not happening. (Have I mentioned that I live in Brazil and that our currency was massive devalued this year?)
So I gave up on hosting the app. I decided to open source the code instead and let users download and host the app themselves. But that turned my 8GB pickle into a problem. It’s ok to consume your own pickles (well, not really) but it’s not ok to expect other people to consume your pickles. Pickles can have malicious code. And pickles are not guaranteed to work across different versions of the same Python packages.
Now, a model is basically a bunch of coefficients - so why not store it as data? We shouldn’t have to store a model in a pickle or in any format that is not human readable. We can store a model as we store the very data that we used to estimate the model. And that’s what I propose we do.
I used scikit-learn’s stochastic gradient descent class to train my SVM classifier, which I instantiated with the following paramters:
Once the model is trained the coefficients are stored in the
clf.coef_ attribute as a numpy array of dimensions 560 (classes) by 505,938 (unique tokens).
As you can see, extracting the coefficients is trivial: just get
clf.coef_. But how do we store them as data? I toyed with a couple of ideas and in the end I chose HDF5. If you haven’t used it before, an HDF5 file is a “container” inside which you can store arrays. I had used HDF5 before and it’s great for fast retrieval of large amounts of data. To use it from Python you must have pytables installed. You don’t need to call pytables though - pandas has a nice interface to it. Here’s how I did it:
That’s it - we have extracted our coefficients and stored them in an HDF5 file. Here I had 560 categories and 505,938 unique tokens, so my HDF5 file contains 560 pandas DataFrames, each of length 505,938.
We are not done though. Each of the 560 classes has not only 505,938 coefficients but also one intercept. These are stored in the
clf.intercept_ attribute. You can store them with HDF5 as well but with only 560 intercepts I didn’t bother doing that. I just printed
clf.intercept_ to the screen and then copied and pasted it into a .py file. Dirty, I know, but quick and easy. The file looks like this:
Finally we need to extract our class labels. They are in
clf.classes_. Same as with the intercepts: I just printed the array to the screen and then copied and pasted it into a .py file.
Now we have our model nicely stored as data. People can inspect the HDF5 and .py files without (much) risk of executing arbitrary code. Our model is human readable and shareable. Now my app is indeed open source.
Ok, so much for disassembling the model. How do we put it back together?
Quick and easy. Instantiate the model, load the class labels, the coefficients and the intercepts, and plug everything in:
And voilà, we have reconstructed our model. The labels, intercepts and coefficients are in their proper places (i.e., assigned to the proper
clf attributes) and the model is ready to use. And everything runs a lot faster than if we were loading pickles.
Some models are more easily “datafied” than others. “Datafying” an instance of scikit-learn’s TfidfVectorizer’s class, for instance, is a bit tricky. I’ll cover that in another post.