saving TfidfVectorizer without pickles
08 Dec 2015As promised, here’s how to save a trained instance of scikit-learn’s TfidfVectorizer without using pickles - in other words, how to save it as human-readable, shareable data.
The general idea is in my previous post: a model is a set of coefficients so you just extract them and save them as you would save any other data (like the very data you used to train the model). That way you avoid the security and maintainability problems of using pickles. You extract the coefficients, save them as data, then later you load them and plug them back in.
Now, that’s easier to do with some models than with others. With scikit-learn’s SGDClassifier, for instance, that’s a breeze. But with TfidfVectorizer that’s a bit tricky. I had to do it anyway so I thought I should write a how-to of sorts.
First we instantiate our TfidfVectorizer:
Once we’ve trained the vectorizer it will contain two important attributes: idf_
, a numpy array that contains the inverse document frequencies (IDFs); and vocabulary_
, a dictionary that maps each unique token to its column number on the TF-IDF matrix.
To extract the IDF array you can just print it to the screen and then copy and paste it to a .py file. The file will look like this:
To extract the vocabulary you can do the same, but depending on how many tokens you have this may not be practical. An alternative is to use JSON. Like this:
The vocabulary is now saved in the vocabulary.json
file.
That’s it, we’ve disassembled our vectorizer. So far so good.
Now, it’s when we try to put everything back together that things get tricky.
We start by importing the TfidfVectorizer class. But we can’t instantiate the class right away. Here’s the problem: we are not allowed to assign arbitrary values to the idf_
attribute. If you instantiate the class and then try something like vectorizer.idf_ = idfs
you get an AttributeError
exception.
The problem is that the idf_
attribute is kind of “read-only”. I say “kind of” because that’s not exactly true: if you train the vectorizer then idf_
will change (it’ll have the IDFs). But idf_
behaves as read-only if you try to plug the IDFs directly, without training the vectorizer.
That happens because idf_
is defined with a @property
decorator and has no corresponding setter
method - check TfidfVectorizer’s source code.
I can’t imagine why the scikit-learn folks made that choice. That’s a bunch of smart people with a lot of programming experience, so I imagine they had good reasons. But that choice is getting in the way of proper model persistence, so here’s how we get around it:
So, what’s happening here? We are creating a new class - MyVectorizer -, which inherits all attributes (and everything else) that TfidfVectorizer has. And we are plugging our IDFs into the MyVectorizer class. When we instantiate MyVectorizer our pre-computed IDFs are already there, in the idf_
attribute. Problem solved.
But we’re not done yet. If you try to use the vectorizer now you’ll get an error:
So, we’re being told that our vectorizer hasn’t been trained, even though we’ve plugged our pre-computed IDFs. What’s going on?
When we try to use our vectorizer there is a function check_is_fitted
that checks, well, whether we have fitted the vectorizer. You’d think it checks the idf_
attribute but it doen’t. Instead it checks the attribute of an attribute: ._tfidf._idf_diag
, which is a sparse matrix made from the IDFs. So we need to plug that matrix into the vectorizer.
We can extract ._tfidf._idf_diag
from the trained vectorizer, save it as data, then load and plug it - just like we did with the other attributes. But an easier alternative is to simply compute ._tfidf._idf_diag
from our IDFs, using scipy.
Problem solved. All we need to do now is plug the vocabulary.
Now our vectorizer works:
And we’re done.