So, Brazilian banks are using predictive models to do property valuation but they are doing it wrong. It’s time for us data folks to step in and cause some disruption.
our hypothetical property
Let’s assume that we want to find the market value of an apartment in Brasília. It’s 120m2, it’s in the Noroeste sector (Brasília is a planned city, so it has sectors instead of neighborhoods), it has three bedrooms, three bathrooms, and two parking spots. The condo fee is R$ 580/month and the property tax is R$ 745/month.
It’s 2019, so there’s no need to go knocking on doors asking people how much they paid for their apartments, how many bedrooms they have, etc. We can simply scrape online listing sites like ZAP, wimoveis, or Viva Real. Any of the three would do (and scraping all three of them would be best). Here I’ll scrape wimoveis for the simple reason that it is the easiest one to scrape.
Granted, that just gives us asking prices, not transaction prices. But the alternative is to go knocking on doors asking people about their apartments. (Unless of course you are a bank and you have tons of mortgage data.)
Here’s the (Python) code I wrote to scrape the result pages:
Every page of results contains up to 20 listings. But it only has summary information for each listing. The full data is in each listing’s own URL. So we need to get the URL of every listing from every page. I use BeautifulSoup for that:
Now we’re finally ready to scrape the listings themselves, with all the data we need (price, m2, pictures, etc).
This will take a couple of days to run, because of all the pictures you’re downloading.
In the end we’ll have over 15k samples. That’s up from the current 25-250 samples that real estate appraisers are using.
Ok, what we have now is a huge mess of HTML and JPG files. The data we need is all buried in those files. We need to extract it.
For now I’ll ignore the JPG files and only use the HTML files.
Hooray, now we’ve put all the (non-image) data in a CSV file with proper column names and everything.
throwing data away
Now that we have all that data it’s time to throw some of it away.
You see, people are lazy. When they list their properties on wimoveis they don’t bother to tick all the boxes - “pool”, “playground”, “A/C”, etc. Whatever they consider relevant they’ll write down in the text field (often with lots of adjectives and exclamation marks). The result is that our CSV file is mostly empty: most of its cells are missing data. This varies according to the feature we’re talking about. But the vast majority of the features have simply too many missing data points to be useful. So let’s clean up a bit.
There! Now we have a clean, usable dataset.
train the model
I tried a few different algorithms to estimate the properties’ asking prices: linear regression, SVM, random forest, boosted trees, neural networks. For each of these algorithms (except linear regression) I tweaked the corresponding parameters a bunch of times (and for neural networks I tried lots of different architectures). The clear winner was boosted trees (which won’t be so surprising to Kaggle competitors).
Just a quick note: we discarded lots of features in the previous step because of missing data. Here we’ll add some of them back. People don’t always tick the “barbecue” box when filling out their listings, but they usually mention it in the text field. So the code below scans the text field looking for certain words.
This gives me a median absolute error of R$ 46k. In proportional terms (i.e., error / asking price) we have a median absolute error of 10% and a mean absolute error of 23%. Which is line with previous work (see here), where the mean absolute error is 25%-30%, and here, where the mean absolute error is 22%.)
We’re not capturing everything here. Say, maybe the property is next door to a police station or to a church or to a loud bar. Maybe there was a murder in the premises. Etc etc. My point is not that these estimates should be final. My point is simply that these estimates are probably closer to the truth than the ones being produced today by professional appraisers all over Brazil.
Alright then, time to appraise our Noroeste apartment. Just append the following lines to the previous code block and run it.
And voilà, we have our point estimate: R$ 978k (That’s about US$ 254k).
We also have a prediction interval with lower and upper bounds: [788k, 1060k]. To produce this interval I used something similar to quantile regression. The lower bound is an estimate of the 25th percentile of the distribution. The upper bound is an estimate of the 75th percentile. The point estimate is an estimate of the 50th percentile (i.e., the median). As we have three different models, the lower and upper bounds are not centered around the point estimate (we actually have three point estimates). More details here.
Here I’m scanning the property descriptions for words like “A/C”, “barbecue”, etc, and featurizing them as dummies. But you can use the texts themselves in the model. Just insert the following code between the line where you shuffle the samples and the line where you instantiate the model:
The TFIDF matrix is too big - we end up with more columns than samples. So we don’t use it directly, we use LSA to reduce the TFIDF matrix to a documentsXtopics matrix of 400 topics.
This improves the performance of the model a bit. But it’s ultimately nonsensical: when you’re appraising a new property you could keep tweaking the text until you get the price you want. So I did this just to inspect which topic vectors would be more relevant (see here), and then which words had more weight in these topics. This helped me decide which words to look for in the text fields (sauna, pool, etc).
I’m still figuring out the best way to use the pictures. I tried using the metadata first: number of pictures, height and width of the pictures, etc. That didn’t improve the model. (I know, shocking. But I like to try the simple things first.)
I also checked whether the dominant colors help us predict the price. To do that I clustered the pixels of every picture of the property. Each pixel is defined by three values: R, G, B, each of which can vary from 0 to 255 and represents the intensity of each of the three primary colors (red, green, and blue). So the pixels exist in the same tridimensional space and therefore we can cluster them. The centroid of each cluster is a dominant color.
Ideally we’d use DBSCAN for this, as the clusters may have widely different sizes and shapes and we don’t even know a priori how many clusters each picture has. But DBSCAN just takes forever to run. So I used k-means instead. I used the elbow technique to find the ideal number of clusters and it turns out that for most images that number was two or three.
That was a massive waste of time. K-means is faster but it still took almost a week to run. And in the end those centroids didn’t improve the model one bit.
A friend who knows a lot more about images than I do suggested that I try something along these lines. I.e., having a branched neural network where I can input both structured features (m2, location, etc) and image features. So that’s what I’m trying right now. It’s tricky though because the number of pictures varies across samples, the pictures are of different sizes, and the pictures aren’t standardized in any way (some listings have five pictures of the kitchen and none of the bedrooms, others have no pictures of kitchen, etc).
The same friend also suggested that I use some pre-trained neural network capable of identifying objects like “window”, “A/C unit”, and so on, and then use the identified objects as features. That’s the next item on my to-do list.
All that said, the truth is that I’m not sure the images will be useful in the end. It’s like with the texts: you could keep taking new pictures of the property until you get the “right” price. I think that’s harder to do with pictures than with texts, but who knows. I need to think more about it. Suggestions are welcome.
“someone must be doing it already!”
You bet. ZAP is doing it. Which makes sense: they have tons of data, so why not use it? In fact just last month they announced the next step: they’ll start buying property themselves, picking the ones that their model suggests are underpriced.
In the US Zillow is doing it. I bet that there are plenty of similar initiatives all over the world.
So I’m not proposing anything new here. Which makes it hard to understand why the heck Brazilian banks are not doing it yet. They have billions at stake.
I know better than to second guess the choices of people with skin in the game. But 70% of all mortgages in Brazil are concentrated in a state-owned bank - Caixa Econômica Federal (CEF). And when the state is involved the incentives are different. It’s not about delivering results but about broadening your mandate, blame-shifting, and securing resources. (If you want to have an idea of how government works, watch HBO’s Chernobyl.)
So it’s not a market failure that we have here, but by and large a state failure. CEF’s bureaucrats do not get punished for using the wrong tool to do property valuation. In a state-owned company there is a lot of noise between your actions and your punishments/rewards. Which helps explain CEF’s long history of incompetence and corruption.
Not to mention that an entire ecosystem has evolved around the status quo. You see, to be contracted by CEF as an appraiser you need to be certified by the Federal Council of Realtors (yes, that exists). There’s an exam for that and there are schools that charge good money to help prepare you for that exam. So, there are lots of people who benefit from the current system, which makes it harder to change.
I guess this is it. I hope this post has inspired you to roll up your sleeves and do your own valuations from now on.