A note on real estate appraisal in Brazil

New paper. Abstract:

Brazilian banks commonly use linear regression to appraise real estate: they regress price on features like area, location, etc, and use the resulting model to estimate the market value of the target property. But Brazilian banks do not test the predictive performance of those models, which for all we know are no better than random guesses. That introduces huge inefficiencies in the real estate market. Here we propose a machine learning approach to the problem. We use real estate data scraped from 15 thousand online listings and use it to fit a boosted trees model. The resulting model has a median absolute error of 8,16%. We provide all data and source code.

insider trading in Brazil's stock market

New manuscript. Abstract:

How much insider trading happens in Brazil’s stock market? Previous research has used the model proposed by Easley et al. [1996] to estimate the probability of insider trading (PIN) for different stocks in Brazil. Those estimates have a number of problems: i) they are based on a factorization that biases the PIN downward, especially for high-activity stocks; ii) they fail to account for boundary solutions, which biases most PIN estimates upward (and a few of them downward); and iii) they are a decade old and therefore based on a very different market (for instance, the number of retail investors grew from 600 thousand in 2011 to 3.5 million in 2021). In this paper I address those three problems and estimate the probability of insider trading for 431 different stocks in the Brazilian stock market, for each quarter from October 2019 to March 2021.

getting historical data from MetaTrader

Getting historical intraday financial data can be a pain, especially for non-US markets. If you have deep pockets you can simply buy the data you need, but for retail investors the cost is prohibitive. If you want historical transaction-level data for the Brazilian stock market, for instance, TickData will sell it to you for about US$ 65000. Hard pass. What to do?

I recently learned about an app called MetaTrader. MetaTrader gives you real-time stock price charts. It is popular with people who do technical analysis - people who buy and sell equities based on certain chart patterns (like a “head and shoulder formation”, for instance). I’ve never bothered with technical analysis and I guess that’s why MetaTrader had escaped my radar until now.

The important thing is: many brokers pay MetaTrader so that their clients can access it, and MetaTrader has historical intraday data for whatever market(s) each broker operates in.

I checked and it turns out that my broker in Brazil has one such deal with MetaTrader. I got a username and password, downloaded the app, and started exploring.

If you’re only interested in a couple of equities you can export the data manually. Go to the View menu and click Symbols.

A new window will open:

The “BMF” and “Bovespa” you see in that window are the names of two exchanges in Brazil (they’ve been merged into a single exchange called B3 but I guess MetaTrader is keeping the old names for now). You will see different names, depending on where your broker operates.

Say you’re interested in tick data. Just click on the Ticks tab, enter the ticker for the equity you want, and click Request. Here I am requesting tick data for BOVA11, an ETF that tracks Brazil’s main stock market index:

The Export Ticks button lets you save the data in a CSV format.

If you want time bars then go to the Bars tab instead.

The data you get depends on what your broker provides MetaTrader. I explored a bit and it looks like lower frequency data requests yield longer time series. When I request 1-minute BOVA11 bars I only get a year or so of data, but when I request 5- or 10-minute BOVA11 bars I get a few more years. That’s a far cry from the 13 years of data TickData sells, but for us DIY wannabe quants that’s enough to play around a bit. And, well, it costs zero dollars. (With TickData you’d need to pay US$ 6750 to access a single year of tick data from Brazil).

That’s all there is to it - if you are only interested in a couple of equities. But what if you want data for hundreds of equities? Or for all equities? Fortunately, MetaTrader has a nice Python API that you can access after pip installing the MetaTrader5 package.

Unfortunately, the MetaTrader5 package only works on Windows. If you want to use it on a macOS or Linux machine you’ll need Wine. I have access to a Windows machine so that wasn’t a problem for me.

Here is a minimal working example of how the Python API works. This snippet requests tick data (trade ticks only - no bid/ask ticks) for BOVA11 between 2021-01-01 and 2021-01-07 and then saves the results in a CSV file. Before you run this script you need to launch the MetaTrader app and log into it using the credentials your broker gave you.

import pandas as pd
import MetaTrader5 as mt5
from datetime import datetime, timedelta

# connect to MetaTrader 5
if not mt5.initialize():
    print('initialize() failed')
# request tick data
ticks = mt5.copy_ticks_range(
    datetime(2021, 1, 1), 
    datetime(2021, 1, 7), 
ticks = pd.DataFrame(ticks)
ticks.to_csv('BOVA11_ticks.csv', index = False)

# shut down connection to MetaTrader 5

If you want data for more equities just loop through the corresponding tickers. If that’s a lot of tickers there is a method symbols_get() that will return all tickers. For instance, the tickers for Brazilian ordinary stocks always have four letters followed by the number 3 - PETR3, HYPE3, etc. If you want only ordinary stocks you can use symbols_get() and then filter out any string that doesn’t have five characters or doesn’t end in 3.

One issue I’ve come across is that sometimes the API returns no data even though a manual search (using the GUI) shows that there is data. That doesn’t happen often but when it does it’s always with highly liquid equities, so I’m guessing if there is too much data to return you get nothing instead. So you may want to loop through shorter intervals - months instead of quarters, or weeks instead of months, or days instead of weeks, depending on how much data you’re requesting.

Another thing to keep in mind is that MetaTrader generates temp files for your requests and they are big as heck. It got to the point where my hard drive ran out of space. I had to use Python’s os to delete those temp files on-the-fly. They are saved to AppData/Roaming/MetaQuotes/Terminal/HASH/bases/BrokerID/ by default. If you want them to be saved elsewhere you have to move the MetaTrader installation folder to the new location and then execute it using cmd, with AlternateLocation/terminal64.exe /portable The temp files will be stored in AlternateLocation/Bases/BrokerID/

Here is a more complete example:

import os
import time
import calendar
import pandas as pd
import MetaTrader5 as mt5
from datetime import datetime, timedelta

# connect to MetaTrader 5
if not mt5.initialize():
    print('initialize() failed')
# get connection status and parameters

# get MetaTrader 5 version

# path to CSVs
path_to_csvs = 'C:/Users/YourUserName/Desktop/ticks/'

# path to temp data (so we can delete it)
path_to_tmp = 'C:/Users/YourUserName/AppData/Roaming/MetaQuotes/Terminal/HASH/bases/BrokerID/'

# get all B3 tickers
symbols = mt5.symbols_get()

# keep only tickers for ordinary stocks
tickers = []
for symbol in symbols:
    ticker = symbol.name
    if not ticker[:4].isalpha():
    if (len(ticker) == 5) and (ticker[-1] == '3'):

# month-years to scrape
months = {
    2019: (10, 11, 12),
    2020: range(1, 13),
    2021: (1, 2, 3)

# loop through tickers
start = time.time()
for i, ticker in enumerate(tickers):

    # loop through month-years
    for year in months.keys():
        for month in months[year]:
            print(' ')
            print(i, 'of', len(tickers), ticker, year, month)

            # set date range
            t0 = datetime(year, month, 1)
            last_day = calendar.monthrange(year, month)[1]
            t1 = datetime(year, month, last_day)

            # request tick data
            ticks = mt5.copy_ticks_range(
            ticks = pd.DataFrame(ticks)

            # log if results are empty
            if ticks.shape[0] == 0:
                with open('log.txt', mode = 'a') as f:
                    l = ticker + ',' + str(year) + ',' + str(month) + '\n'
                    print('empty DataFrame:', l)

            # persist
            ticks['time'] = pd.to_datetime(ticks['time'], unit = 's')
            ticks.columns = [
            ticks.to_csv(path_to_csvs + ticker, index = False)

            # don't over-request

    # delete tmp files
    for fname in os.listdir(path_to_tmp + ticker + '/'):
            os.remove(path_to_tmp + ticker + '/' + fname)

    # how long did it take?
    elapsed = time.time() - start
    print('it took', round(elapsed / 60), 'minutes')

# shut down connection to MetaTrader 5

Finally, a word about the “flags” column (you can see it in the BOVA11 screenshot above). That column encodes a lot of information. For instance, it encodes whether the tick refers to a buy-initiated trade or to a sell-initiated trade. But it’s not as simple as “56 means buy-initiated” or anything like that. It took me some doing to learn how to interpret the flags. Turns out they are bit masks. To extract the information encoded in each flag you need to do a bunch of bitwise operations. For instance, to tell whether a given tick represents a buy- or sell-initiated trade you need to pass the flag through a function like this:

def buy_or_sell(flag):
    see https://www.mql5.com/en/forum/75268
    for explanation on MetaTrader flags
    if (flag & 32) and (flag & 64):
        return 'both'
    elif flag & 32:
        return 'buy'
    elif flag & 64:
        return 'sell'

MetaTrader uses 32 to encode “buy-initiated” and 64 to encode “sell-initiated”. Say the flag for a given tick is 56. The bitwise operation 56 AND 64 returns 0. That means the tick does not refer to a sell-initiated trade. But the bitwise operation 56 AND 32 returns 32. That means the tick refers to a buy-initiated trade. (In some rare cases the tick refers to a trade that was simultaneously initiated by buyer and seller.) The flag 212, on the other hand, returns opposite results: 212 AND 32 returns 0 and 212 AND 64 returns 64 - in other words, a tick with the flag 212 refers to a sell-initiated trade.

Here is information on everything else that each flag encodes.

This is it. Let me know if there are other sources of intraday data I should check out. What are the cool kids using?

Unknown Market Wizards

I just finished Jack Schwager’s Unknown Market Wizards. It’s a collection of interviews with retail traders. The interviewees follow different approaches - fundamental, technical, quant, social media mining - but they share similar attitudes and Schwager does a good job hypothesizing how these attitudes might relate to trading outcomes. On top of that, Schwager is a superb interviewer and he shows the world of retail traders in a way that’s neither the aggregate statistics of academic papers nor the get-rich-quick Youtube ads selling day trading courses. I’ve always thought of traders as people who somehow missed Burton Malkiel’s A Random Walk Down Wall Street. Schwager showed me a much more interesting picture. Here are some highlights:

skill vs luck

The eleven traders in the book all have spectacular performances. We’re talking annualized returns of 20%-300%, sometimes for several years. But the book doesn’t give us enough data to know whether those performances are the result of skill or mere luck. If you ride a big bull market it’s possible to have stellar returns with random bets. Bitcoin has yielded an annualized return of 558% since May 2015. That doesn’t make early Bitcoin adopters market wizards (though some of them certainly are).

“The magnitude by which the traders in this book outperformed market benchmarks over long periods (typically, decade-plus) cannot simply be explained away as ‘luck.’”, Schwager writes (p. 348). To give only two examples, Daljit Dhaliwal never had a negative year and he was profitable in 95% of quarters, and Pavel Krejcí had positive returns in 93% of all quarters. But to tell luck from skill we would need to know who was trading what when and what each monthly return was, then follow an approach like Fama & French’s where they compare actual returns with simulated returns. Schwager doesn’t give us a data appendix, so we can’t do that. (The data is not Schwager’s to publicize but he could have run the numbers himself and showed us the results.)

It’s tempting to guess which of the eleven traders are skilled (as opposed to merely lucky) based on the narrative each of them offers. Several of them emphasize the importance of updating your beliefs based on new information (so that you can limit your losses if the market turns against you, for instance), but without overcorrecting. Some also emphasize the importance of sizing your bets according to how confident you are. These attitudes are typical of Philip Tetlock’s superforecasters.

But they are just narratives. Each trader (except for the one quant guy) is describing his mental model of himself. That mental model is probably a better trader than the actual human behind it. And at times the narratives sound wishy-washy: “I am using my feelings as an input to trading.” (p. 110), “The human emotions that we feel can be used as a signal source.” (p. 193), that sort of thing. (Though for the most part the narratives are pretty Tetlockian.)

In short, it is possible that Schwager selected eleven very lucky people and then spent hours listening to after-the-fact rationalizations. But even if that’s the case (which I don’t know) the book is still immensely enjoyable if you are curious about retail trading.

people who are always wrong can make (you) money

My favorite strategy is one of Jason Shapiro’s:

I watch Fast Money religiously at 5 p.m. EST every day. I can’t tell you how much money I’ve made off of that show. […] There is one guy on the program named Brian Kelly, who I have now watched for years. He is wrong by such a larger percentage than random that it is hard to believe. I will never have a position on if he is recommending it. (p. 69)

I don’t know if that really works or if Shapiro is just rationalizing but I love the idea and now I want to find Fast Money transcripts and backtest it.

shrinking anomalies

Several of the interviewees believe that anomalies are disappearing.

Peter Brandt, talking about how things have changed for chartists since the 1980s:

Large long-term patterns no longer work. Trendlines no longer work. Channels no longer work. Symmetrical triangles no longer work. (p. 37)

Richard Bargh:

there are also far fewer opportunities now than back in 2013 and 2014. Back then, many things were still relatively new, like quantitative easing and forward guidance. There was more uncertainty about what the central banks were going to do and how they were going to do it. Whereas now, the markets have central banks so nailed down that there is not as much opportunity to make money on central banks as there once was. The markets now are good at pricing in events before they happen. I make money by pricing in a surprise, and with fewer surprises, there are fewer opportunities. (p. 100)

Amrit Sall:

increased market automation and high-frequency-trading algorithms have made execution harder and eroded my edge on some very short time frame strategies. (p. 142)

Daljit Dhaliwal:

It is no longer possible to trade for the initial move off the headline because the algos will make the trade before I can. (p. 157)

It makes sense. Data is cheaper, processing power is cheaper, machine learning courses are free; it would be odd if anomalies didn’t disappear.


Another common theme is the importance of waiting for the right trade.

Richard Bargh says you should have a side project:

One thing about event-driven trading is that if there isn’t an event to trade, there is nothing to do, and it is really boring. You can feel like you’re wasting your life. […] I have found that I trade much better when I have a side project. If I have nothing else to do but trade events, and there are no events to trade, my mind runs wild. I need to focus my attention on something; otherwise, I will focus on the wrong things. (p. 104)

Bargh again:

Trading to earn a consistent amount steadily may be an admirable goal, but it is not a realistic one. Market opportunities are sporadic. […] If you try to force consistent profitability, you will be prone to take suboptimal trades, which will often end up reducing your overall profitability. (p. 177)

Amrit Sall:

There will be periods in the markets where opportunities dry up, and there will be nothing to do. In those nothing periods, if you are looking for something to do, that is when you can create real damage to your account. (p. 129)

Daljit Dhaliwal:

you don’t have to take every potential trade. You can wait for a trade where everything lines up in your favor. (p. 154)

Chris Camillo articulated it best:

Looking back, I think a big part of my success when I went back to trading was my ability to look past the noise and be patient. I wasn’t in the industry. It wasn’t my job, and I wasn’t under any pressure to trade. I could go six months without a trade, and I didn’t have to answer to anybody. (p. 239)

(The other day I was listening to a podcast interview with a quant fund manager and he emphasized that “we work a lot, we don’t just create a bot and then go to the beach”. It’s odd that people choose to invest in a quant fund but then expect constant human activity - as long as the fund is making you money what do you care that the staff is at the beach? But apparently investors want busy managers.)

What Camillo says ties in with Marsten Parker’s advice:

Don’t quit your day job. (p. 287)

If you have a stable source of income it’ll probably be easier to avoid overtrading.


One of my favorite interviews in the book is the one with Chris Camillo. He mines social media looking for information that might help him anticipate stock price movements. Like when there was an outbreak of E. Coli in Chipotle stores:

Chipotle had become famous for having long lines at lunchtime. Chipotle was such a trendy brand that it was common for people to tweet about having lunch there. They would also frequently tweet about how they were waiting in line at Chipotle. I was able to gauge real-time foot traffic by monitoring word combinations, such as “Chipotle” plus “lunch,” and “Chipotle” plus “line,” in online conversations. Almost overnight, the mentions of these word combinations dropped by about 50%.

I imagine NLP folks at quant shops reading about Camillo and his Chipotle trade. There they are, fine-tuning BERT and whatnot, and then Camillo with his hard-coded bigrams makes a killing.

Camillo made a business out of that: he created TickerTags, a company that tracks word combinations on social media. As Camillo says, tracking conversations is a way to “go even earlier than trasactional data” (like credit card information).

TickerTags doesn’t seem to have been successful with hedge funds though:

Hedge fund managers want something repeatable and systematic. They wanted to know how often this approach would generate tradable information with high conviction. I couldn’t give them a hard answer. (p. 253)

That is a recurrent theme throughout the book. The trader “knows” that the strategy works but somehow can’t put a number on it. Peter Brandt says if the market closes in a new high or in a new low on a Friday then it is likely to keep moving in that direction on Monday and early Tuesday. But when asked if he ever checked that pattern he replies “I have never statistically analyzed it.” Hhmm.

the one quant guy

Marsten Parker is the only quant in the group. He touches on an interesting point:

I’ve always struggled with how to distinguish between a routine drawdown and a system that has stopped working. (p. 278)

Last month I finished reading Marcos López de Prado’s Advances in Financial Machine Learning. de Prado talks about how the death of a strategy can actually be a good thing:

For instance, a mean-reverting pattern may give way to a momentum pattern. As this transition takes place, most market participants are caught off guard, and they will make costly mistakes. This sort of errors is the basis for many profitable strategies, because the actors on the losing side will typically become aware of their mistake once it is too late. Before they accept their losses, they will act irrationally, try to hold the position, and hope for a comeback. Sometimes they will even increase a losing position, in desperation. Eventually they will be forced to stop loss or stop out. Structural breaks offer some of the best risk/rewards. (p. 249)

de Prado suggests a number of tests to identify structural breaks - for instance, testing if the cumulative forecasting errors look random. I wonder if that’s the sort of thing Parker is doing. He doesn’t get into technical details though, other than to mention that he’s learned about overfitting the hard way, that he uses parameter stability as an indicator of robustness, that your errors will be more serially correlated in real life than in your backtest, and that it’s best to have several simple strategies than one over-optimized strategy. (Also: “The more strategies you run, the easier it is emotionally to turn one off.”, p. 286.)

I love this:

if I knew everything I do now, I might never have tried trading in the first place. (p. 286)

I imagine all the state-of-the-art models that RenTec or AQR must be training, the hordes of Mensa Ph.D.s they recruit, the deep pockets they have to buy credit card data or Twitter’s Firehose, I think of Quantopian’s failure, and it sounds impossible to me that any sort of artisanal, DIY quant approach could succeed. But maybe Parker isn’t just lucky. His interview updated my priors a bit.

loving your trade

The appeal of trading is that it is something you can do on your own - no employees, no bosses, no clients, no students, no reviewers, no drama, no politics, no culture wars.

Jason Shapiro:

I had six people working for me, and I hated it. I didn’t like managing people and being responsible for their success. (p. 72)

Pavel Krejcí:

I wanted to find something where the success or failure would depend only on me, not my colleagues, my boss, or anybody else. If I make money, that’s great; if I lose money, it’s my mistake. (p. 320)

That’s happinness almost by definition.

Alright, any more quotes and I’ll probably get a lawsuit so I’m stopping here. Go buy the book.

machine learning for finance - part 2

In this series of posts I’m trying some of the ideas in the book Advances in Financial Machine Learning, by Marcos López de Prado. Here I tackle an idea from chapter 5: fractional differencing.

the problem

Stock prices are nonstationary - their means and variances change systematically over time. Take for instance the price of BOVA11 (an ETF that tracks Brazil’s main stock market index, Ibovespa):

There is a clear upward trend in BOVA11 prices between 2010 and 2020. Unsurprisingly, the Augmented Dickey-Fuller (ADF) test fails to reject the null hypothesis of nonstationarity (p=0.83).

(When I say that the mean and variance change “systematically” in a nonstationary series I don’t mean that in a stationary series the mean and variance are exactly the same at all times. With any stochastic process, stationary or not, the mean and variance will vary - randomly - over time. What I mean is that in a nonstationary series the parameters of the underlying probability distribution - the distribution that is generating the data - change over time. In other words, in a nonstationary series the mean and variance change over time not just randomly, but also systematically. Before the math police come after me, let me clarify that this a very informal definition. In reality there are different types of stationarity - weak, strong, ergodic - and things are more complicated than just “the mean and variance change over time”. Check William Greene’s Econometric Analysis, chapter 20, for a mathematical treatment of the subject.)

Why does stationarity matter? It matters because it can trip your models. If BOVA11 prices are not stationary then whatever your model learns from 2010 BOVA11 prices is not generalizable to, say, 2020. If you just feed the entire 2010-2020 time series into your model its predictions may suffer.

Most people handle nonstationarity by taking the first-order differences of the series. Say the price series is \(\begin{vmatrix}y_{t0}=100 & y_{t1}=105 & y_{t2}=97\end{vmatrix}\). The first-order differences are \(\begin{vmatrix}y_{t1}-y_{t0}=105-100=5 & y_{t2}-y_{t1}=97-105=-8\end{vmatrix}\).

You can take differences of higher order too. Here the second-order differences would be \(\begin{vmatrix}-8-5=-13\end{vmatrix}\). But taking first-order differences often achieves stationarity, so that’s what most people do with price data.

If you take the first-order differences of the BOVA11 series you get this:

This looks a lot more “stable” than the first plot. Here the ADF test rejects the null hypothesis of nonstationarity (p<0.001).

The problem is, by taking first-order differences you are throwing the baby out with the bathwater. Market data has a low signal-to-noise ratio. The little signal there is is encoded in the “memory” of the time series - the accumulation of shocks it has received over the years. When you take first-order differences you wipe out that memory. To give an idea of the magnitude of the problem, the correlation between BOVA11 prices and BOVA11 returns is a mere 0.06. As de Prado puts it (p. 76):

The dilemma is that returns are stationary, however memory-less, and prices have memory, however they are non-stationary.

what to do?

de Prado’s answer is fractional differencing. Taking first-order differences is “integer differencing”: you are taking an integer \(d\) (in this case 1) and you are differencing the time series \(d\) times. But as de Prado shows, you don’t have to choose between \(d=0\) (prices) and \(d=1\) (returns). \(d\) can be any rational number between 0 and 1. And the lower \(d\) is the less memory you wipe out when you difference the time series.

With an integer \(d\) you just subtract yesterday’s price from today’s price and that’s it, you have your first-order difference. But with a non-integer \(d\) you can’t do that. You need a differencing mechanism that works for non-integer \(d\) values as well. I show one such mechanism in the next section.

some matrix algebra

First you need to abandon scalar algebra and switch to matrix algebra. Let’s not worry about fractional differencing just yet. Let’s just answer this: what sorts of matrices and matrix operations would take a vector \(\begin{vmatrix}x_{t0} & x_{t1} & x_{t2} ... x_{tN}\end{vmatrix}\) and return the \(d\)th-order differences of those \(N\) elements?

Say that your price vector is \(X=\begin{vmatrix}5 & 7 & 11\end{vmatrix}\). The first matrix you need here is an identity matrix of order \(N\). If it’s been a while since your linear algebra classes, an identity matrix is a square matrix (a matrix with the same number of rows and columns) that has 1s in its main diagonal and 0s everywhere else. Here your vector has 3 elements, so your identity matrix needs to be of order 3. Here it is:

\[I = \begin{vmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{vmatrix}\]

You also need a second matrix, \(B\). Like \(I\), \(B\) is also a square matrix of order \(N\) with 1s in one diagonal and 0s elsewhere. But unlike in \(I\), in \(B\) the first row is all 0s; the 1s are in the subdiagonal, not in the main diagonal:

\[B = \begin{vmatrix} 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{vmatrix}\]

As it turns out, you can use \(I\) and \(B\) to difference \(X\). The first step is to compute \(I - B\):

\[I - B = \begin{vmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{vmatrix} - \begin{vmatrix} 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{vmatrix} = \begin{vmatrix} 1 & 0 & 0 \\ -1 & 1 & 0 \\ 0 & -1 & 1 \end{vmatrix}\]

Now you raise \(I - B\) to the \(d\)th power. Let’s say you want to find the second-order differences. In that case \(d=2\). Well, if \(d=2\) you could simply multiply \(I - B\) by \(I - B\), like this:

\[(I - B)^2 = \begin{vmatrix} 1 & 0 & 0 \\ -1 & 1 & 0 \\ 0 & -1 & 1 \end{vmatrix} \times \begin{vmatrix} 1 & 0 & 0 \\ -1 & 1 & 0 \\ 0 & -1 & 1 \end{vmatrix} = \begin{vmatrix} 1 & 0 & 0 \\ -2 & 1 & 0 \\ 1 & -2 & 1 \\ \end{vmatrix}\]

But that is not a very general solution - it only works for an integer \(d\). So instead of multiplying \(I - B\) by itself like that you are going to resort to a little trick you learned back in grade 10:

\[(x + y)^n = \sum_{k=0}^{n} \binom{n}{k} x^{n-k}y^k\]

where \(k = 0, 1, ... n\)

That is called the binomial theorem. If you apply it to \((I - B)^d\) you get

\[(I - B)^d = \sum_{k=0}^{d} \binom{d}{k} I^{d-k} (-B)^k = \sum_{k=0}^{d} \binom{d}{k} (-B)^k\]

where \(k = 0, 1, ... d\)

(In case you’re wondering, \(I^{d-k}\) disappeared because an identity matrix raised to any power is itself, and because an identity matrix multiplied by any other matrix is that other matrix. Hence \(I^{d-k} (-B)^k = I(-B)^k = (-B)^k\))

(Before someone calls the math police: the binomial theorem only extends to matrices when the two matrices are commutative, i.e., when \(XY=YX\). In fractional differencing that is always going to be the case, as \(I\) is an identity matrix and that means \(XI=IX\) for any matrix \(X\). But outside the context of fractional differencing you need to take a look at this paper.)

The binomial expands like this:

\[\sum_{k=0}^{d} \binom{d}{k} (-B)^k = I - dB + \dfrac{d(d-1)}{2!}B^{2} - \dfrac{d(d-1)(d-2)}{3!} B^3 + ...\]

If \(d=2\) that means

\[I - 2B + \dfrac{2(2-1)}{2!} B^2 = \begin{vmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{vmatrix} - 2 \begin{vmatrix} 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{vmatrix} + \begin{vmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 1 & 0 & 0 \end{vmatrix} \\ = \begin{vmatrix} 1 & 0 & 0 \\ -2 & 1 & 0 \\ 1 & -2 & 1 \end{vmatrix}\]

…which is the same result we had obtained by simply multiplying \(I - B\) by itself.

Now you multiply \((I - B)^d\) by your original vector, \(X\). Here this means:

\[(I - B)^2 X = \begin{vmatrix} 1 & 0 & 0 \\ -2 & 1 & 0 \\ 1 & -2 & 1 \\ \end{vmatrix} \times \begin{vmatrix}5 & 7 & 11\end{vmatrix} = \begin{vmatrix}5 & -3 & 2\end{vmatrix}\]

Finally, you discard the first \(d\) elements of that product (I’ll get to that in a moment) and what’s left are your second-order differences. That means your vector of second-order differences is \(\begin{vmatrix}2\end{vmatrix}\). Voilà!

That’s a lot of matrix algebra to do what any fourth-grader can do in a matter of seconds. Very well then, let’s use fourth-grade math to check the results. If your vector is \(\begin{vmatrix}5 & 7 & 11\end{vmatrix}\) then the first-order differences are \(\begin{vmatrix}7-5 & 11-7\end{vmatrix}\) \(=\begin{vmatrix}2 & 4\end{vmatrix}\) and the second-order differences are \(\begin{vmatrix}4-2\end{vmatrix}=\begin{vmatrix}2\end{vmatrix}\). Check!

With matrix algebra you get \(\begin{vmatrix}5 & -3 & 2\end{vmatrix}\) instead of \(\begin{vmatrix}2\end{vmatrix}\) because the matrix operations carry the initial \(5\) all the way through, which means that at some point the difference \(2 - 5\) is computed even though it makes no real-world sense. That’s why you need to discard the initial \(d\) elements of the final product.

If you want to reproduce all that in Python here is the code:

import numpy as np

def get_I_and_B(X):
    n = X.shape[0]
    I = np.identity(n)
    row0 = np.zeros((1, n))
    B = np.vstack((row0, I[:n-1]))
    return I, B

def difference_method1(X, d):
    I, B = get_I_and_B(X)
    return np.matmul(np.linalg.matrix_power(I - B, d), X)

def difference_method2(X, d):
    I, B = get_I_and_B(X)
    w0 = I
    w1 = -d * B
    w2 = np.linalg.matrix_power(B, 2)
    W = w0 + w1 + w2
    return np.matmul(W, X)

X = [5, 7, 11]
X = np.array(X)
d = 2

diffs = difference_method1(X, d)
diffs = difference_method2(X, d)

The point of using all that matrix algebra and the binomial theorem is that \((I - B)^d X\) works both for an integer \(d\) and for a non-integer \(d\). Which means you can now compute, say, the 0.42th-order differences of a time series. The only thing that changes with a non-integer \(d\) is that the series you saw above becomes infinite:

\[\sum_{k=0}^{\infty} \binom{d}{k} (-B)^k\]

Finding \(\sum_{k=0}^{\infty} \binom{d}{k} (-B)^k\) can be computationally expensive, so people usually find an approximation instead. There are different ways to go about that and de Prado discusses some alternatives in his book. But at the end of the day you’re still computing \((I - B)^d\), same as you do for integer \(d\) values. Multiply the result by \(X\) and you have your fractional differences. Just remember to discard the first element of the resulting vector (you can’t discard, say, 0.37 elements; you have to round it up to 1).

I thank mathematician Tzanko Matev, whose tutorial helped me understand fractional differencing. If I made any mistakes here they are all mine.

quick note

de Prado arrives at \(\sum_{k=0}^{\infty} \binom{d}{k} (-B)^k\) through a different route that requires fewer steps. Here it is, from page 77:

The way he does it is certainly more concise. But I found that by doing the matrix algebra in a more explicit way, and by including a few more intermediate steps, I understood the whole thing much better.

how to choose d

Alright, so a non-integer \(d<1\) will erase less memory than an integer \(d\geq1\), and now you know the math behind fractional differences. But how do you choose \(d\)? de Prado suggests that you try different values of \(d\), apply the ADF test on each set of results, and pick the lowest \(d\) that passes the test (p<0.05).

Time to work then. de Prado includes Python scripts for everything in his book, so you could simply use that. But here I want to try a more optimized and production-ready implementation. I couldn’t find anything like that for Python. But I found a nifty R package - fracdiff. It’s well documented and it uses an optimization trick that makes it run fast (the optimization is based on a 2014 paper - A Fast Fractional Difference Algorithm, by Andreas Jensen and Morten Nielsen -, with tweaks by Martin Maechler).

I do almost everything in Python these days and I didn’t want to deal with RStudio, dplyr, etc, just for this one thing. So I wrote almost the entire code in Python, and then inside my Python code I have this one string that is my R code, and I use a package called rpy2 to call R from inside Python.

Here is the full code. It reads the BOVA11 dollar bars that I created in my previous post (and saved here for your convenience), generates the plots and statistics I used in the beginning of this post, and finds the best \(d\).

import numpy as np
import pandas as pd
from rpy2 import robjects
from matplotlib import pyplot as plt
from statsmodels.tsa.stattools import adfuller

# load data
df = pd.read_csv('bars.csv')

# fix date field
df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d')

# sort by date
df.index = df['date']
del df['date']
df = df.sort_index()

# plot
df['closing_price'].plot(ylabel = 'price (R$)', xlabel = 'date')
df['closing_price'].diff().dropna().plot(ylabel = 'returns (R$)', xlabel = 'date')

# check if series is stationary
def check_stati(y):
    y = np.array(y)
    p = adfuller(y)[1]
    return pvalue

# prices

# returns

# check correlation
c = np.corrcoef(df['closing_price'].values[1:], df['closing_price'].diff().dropna().values)

# stringify vector of prices
# (so we can put it inside R code)
vector = ', '.join([str(e) for e in list(df['closing_price'])])

# try several values of d
ds = []
d = 0
got_it = False
while d < 1:

    # differentiate!
    rcode = '''
    diffseries(c({vector}), {d})
    '''.format(vector = vector, d = d)
    output = robjects.r(rcode)

    # check correlation between prices and differences
    c = np.corrcoef(output[1:], df['closing_price'].values[1:])[1][0]

    # check if prices are stationary
    p = check_stati(output)

    # appendd and c (to plot later)
    ds.append((d, c))

    if (not got_it) and (p < 0.05):
        print(' ')
        print('d:', d)
        print('c:', c)
        got_it = True

    d += 0.01

Running this code you find that the lowest \(d\) that passes the ADF test is 0.29. Much lower than the \(d=1\) most people use when differencing a price series.

With \(d=0.29\) the correlation between the price series and the differenced series is 0.87. In contrast, the correlation between the price series and the return series (\(d=1\)) was 0.06. In other words, with \(d=0.29\) you are preserving a lot more memory - and therefore signal.

As an addendum, the script also plots how the correlation between prices and \(d\)-order differences decreases as we increase \(d\):

Clearly the lower \(d\) is the more memory you preserve.

This is it! See you on the next post.