# Should You Build Your Own Backtester?

## By Michael Halls-Moore on August 2nd, 2016

This post relates to a talk I gave in April at QuantCon 2016 in New York City. QuantCon was hosted by Quantopian and I was invited to talk about some of the topics discussed on QuantStart. I decided to talk about whether it is worth building your own backtesting system. This post goes into more detail about the subject area and should hopefully leave you wanting to dive into building your own system!

If you wish to view the original slides, they can be found here.

The post is suitable for those who are beginning quantitative trading as well as those who have had some experience with the area. The post discusses the common pitfalls of backtesting, as well as some uncommon ones!

It also looks at the different sorts of backtesting mechanisms as well as the software landscape that implements these approaches. Then we discuss whether it is worth building your own backtester, even with the prevalence of open source tools available today.

Finally, we discuss the ins-and-outs of an event-driven backtesting system, a topic that I've covered frequently on QuantStart in prior posts.

## What Is A Backtest?

A backtest is the application of trading strategy rules to a set of historical pricing data.

That is, if we define a set of mechanisms for entry and exit into a portfolio of assets, and apply those rules to historical pricing data of those assets, we can attempt to understand the performance of this "trading strategy" that might have been attained in the past.

It was once said that "All models are wrong, but some are useful". The same is true of backtests. So what purpose do they serve?

Backtests ultimately help us decide whether it is worth live-trading a set of strategy rules. It provides us with an idea of how a strategy might have performed in the past. Essentially it allows us to filter out bad strategy rules before we allocate any real capital.

It is easy to generate backtests. Unfortunately backtest results are not live trading results. They are instead a model of reality. A model that usually contains many assumptions.

There are two main types of software backtest - the "for-loop" and the "event-driven" systems.

When designing backtesting software there is always a trade-off between accuracy and implementation complexity. The above two backtesting types represent either end of the spectrum for this tradeoff.

## Backtesting Pitfalls

There are many pitfalls associated with backtesting. They all concern the fact that a backtest is just a model of reality. Some of the more common pitfalls include:

• In-Sample Testing - This occurs when you utilise the same data to "train" your trading models as well as to "test" it. It almost always inflates the performance of a strategy beyond that which would be seen in live trading. This is because it has not been validated on unseen data, which will likely differ markedly from the training data. In essence, it is a form of overfitting.
• Survivorship Bias - For stock market indices like the S&P500, a periodic process of listing and de-listing occurs, changing the composition over time. By failing to take into account this changing composition over a backtest, trading strategies are automatically "picking the winners" by virtue of ignoring all the companies that fell out of the index due to low market capitalisation. Hence it is always necessary to use survivorship-bias free data when carrying out longer-term backtests.
• Look-Ahead Bias - Future data can "sneak in" to backtests in very subtle ways. Consider calculating a linear regression ratio over a particular time-frame. If this ratio is then used in the same sample, then we have implicitly brought in future data and thus will have likely inflated performance. Event-driven backtesters largely solve this problem, as we will discuss below.
• Market Regime Change - This concerns the fact that stock market "parameters" are not stationary. That is, the underlying process generating stock movements need not have parameters that stay constant in time. This makes it hard to generalise parametrised models (of which many trading strategies are instances of) and thus performance is likely to be higher in backtests than in live trading.
• Transaction Costs - Many For-Loop backtests do not take into account even basic transaction costs, such as fees or commissions. This is particularly true in academic papers where backtests are largely conducted free of transaction costs. Unfortunately it is all too easy to find strategies that are highly profitable without transaction costs, but make substantial losses when subjected to a real market. Typical costs include spread, market impact and slippage. All of these should be accounted for in realistic backtests.

There are some more subtle issues with backtesting that are not discusssed as often, but are still incredibly important to consider. They include:

• OHLC Data - OHLC data, that is the type of daily data taken from free sites such as Yahoo Finance, is often an amalgamation of multiple exchange feeds. Hence it is unlikely that some of the more extreme values seen (including the High and Low price of the day) would likely be obtained by a live trading system. Such "order routing" needs to be considered as part of a model.
• Capacity Constraints - When backtesting it is easy to utilise an "infinite" pot of money. However, in reality capital, as well as margin, is tightly constrained. It is necessary also to think of Average Daily Volume (ADV) limits, especially for small-cap stocks where it is possible that our trades might indeed move the market. Such "market impact" effects would need to be taken into account for risk management purposes.
• Benchmark Choice - Is the choice of benchmark against which the backtested strategy is being measured a good one? For instance if you are trading commodity futures and are neutral to the S&P500 US equity index, does it really make sense to use the S&P500 as your benchmark? Would a basket of other commodity trading funds make more sense?
• Robustness - By varying the starting time of your strategy within your backtest do the results change dramatically? It should not matter for a longer term strategy whether the backtest is started on a Monday or a Thursday. However, if it is sensitive to the "initial conditions" how can you reliably predict future performance when live trading?
• Overfitting/Bias-Variance Tradeoff - We've discussed this a little above in the In-Sample Testing point. However, overfitting is a broader problem for all (supervised) machine learning methods. The only real way to "solve" this problem is via careful use of cross-validation techniques. Even then, we should be extremely careful that we haven't simply fitted our trading strategies to noise in the training set.
• Psychological Tolerance - Psychology is often ignored in quant finance because (supposedly) it is removed by creating an algorithmic system. However, it always creeps in because quants have a tendency to "tinker" or "override" the system once deployed live. In addition, what may seem tolerable in a backtest, might be stomach-churning in live trading. If your backtested equity curve shows a 50% drawdown at some point in its trading history, could you also ride this through in a live trading scenario?

Much has been written about the problems with backtesting. Tucker Balch and Ernie Chan both consider the issues at length.

## For-Loop Backtest Systems

A For-Loop Backtester is the most straightforward type of backtesting system and the variant most often seen in quant blog posts, purely for its simplicity and transparency.

Essentially the For-Loop system iterates over every trading day (or OHLC bar), performs some calculation related to the price(s) of the asset(s), such as a Moving Average of the close, and then goes long or short a particular asset (often on the same closing price, but sometimes the day after). The iteration then continues. All the while the total equity is being tracked and stored to later produce an equity curve.

Here is the pseudo-code for such an algorithm:

for each trading bar:
do_something_with_prices();
next_bar();


As you can see the design of such a sytem is incredibly simple. This makes it attractive for getting a "first look" at the performance of a particular strategy ruleset.

For-Loop backtesters are straightforward to implement in nearly any programming language and are very fast to execute. The latter advantage means that many parameter combinations can be tested in order to optimise the trading setup.

The main disadvantage with For-Loop backtesters is that they are quite unrealistic. They often have no transaction cost capability unless specifically added. Usually orders are filled immediately "at market" with the midpoint price. As such there is often no accounting for spread.

There is minimal code re-use between the backtesting system and the live-trading system. This means that code often needs to be written twice, introducing the possibility of more bugs.

For-Loop backtesters are prone to Look-Ahead Bias, due to bugs with indexing. For instance, should you have used "i", "i+1" or "i-1" in your panel indexing?

For-Loop backtesters should really be utilised solely as a filtration mechanism. You can use them to eliminate the obviously bad strategies, but you should remain skeptical of strong performance. Further research is often required. Strategies rarely perform better in live trading than they do in backtests!

## Event-Driven Backtest Systems

Event-Driven Backtesters lie at the other end of the spectrum. They are much more akin to live-trading infrastructure implementations. As such, they are often more realistic in the difference between backtested and live trading performance.

Such systems are run in a large "while" loop that continually looks for "events" of differing types in the "event queue". Potential events include:

• Tick Events - Signify arrival of new market data
• Signal Events - Generation of new trading signals
• Order Events - Orders ready to be sent to market broker
• Fill Events - Fill information from the market broker

When a particular event is identified it is routed to the appropriate module(s) in the infrastructure, which handles the event and then potentially generates new events which go back to the queue.

The pseudo-code for an Event-Driven backtesting system is as follows:

while event_queue_isnt_empty():
event = get_latest_event_from_queue();
if event.type == "tick":
else if event.type == "signal":
portfolio.handle_signal(event);
else if event.type == "order":
portfolio.handle_order(event);
else if event.type == "fill":
portfolio.handle_fill(event)
sleep(600);  # Sleep for, say, 10 mins


As you can see there is a heavy reliance on the portfolio handler module. Such a module is the "heart" of an Event-Driven backtesting system as we will see below.

There are many advantages to using an Event-Driven backtester:

• Elimination of Look-Ahead Bias - By virtue of its message-passing design, Event-Driven systems are usually free from Look-Ahead Bias, at least at the trading level. There is the possibility of introducing bias indirectly through a pre-researched model, however.
• Code Re-Use - For live trading it is only necessary to replace the data handler and execution handler modules. All strategy, risk/position management and performance measurement code is identical. This means there are usually far less bugs to fix.
• Portfolio Level - With an Event-Driven system it is much more straightforward to think at the portfolio level. Introducing groups of instruments and strategies is easy, as are hedging instruments.
• "Proper" Risk/Position Management - Can easily modularise the risk and position management. Can introduce leverage and methodologies such as Kelly's Criterion easily. Can also easily include sector exposure warnings, ADV limits, volatility limits and illiquidity warnings.
• Remote Deployment/Monitoring - Modular nature of the code makes it easier to deploy in "the cloud" or to co-locate the software near an exchange on a virtualised system.

While the advantages are clear, there are also some strong disadvantages to using such a complex system:

• Tricky to Code - Building a fully-tested Event-Driven system will likely take weeks or months of full-time work. A corollary of this is that there is always a healthy market for freelance/contract quant developers!
• Require Object-Orientation - A modular design necessitates using object-oriented programming (OOP) principles, and thus a language that can support OOP easily. This does however make unit testing far more straightforward.
• Software Engineering - More likely to require good software engineering expertise and capabilities such as logging, unit testing, version control and continuous integration.
• Slow Execution - The message-passing nature of the code makes it far slower to execute compared to a vectorised For-Loop approach. Multiple parameter combinations can take a long time to calculate on unoptimised codes.

## The Software Landscape

In this section we will consider software (both open source and commercial) that exists for both For-Loop and Event-Driven systems.

For For-Loop backtesters, the main programming languages/software that are used include Python (with the Pandas library), R (and the quantmod library) and MatLab. There are plenty of code snippets to be found on quant blogs. A great list of such blogs can be found on Quantocracy.

The market for Event-Driven systems is much larger, as clients/users often want the software to be capable of both backtesting and live trading in one package.

The expensive commercial offerings include Deltix and QuantHouse. They are often found in quant hedge funds, family offices and prop trading firms.

Cloud-based backtesting and live trading systems are relatively new. Quantopian is an example of a mature web-based setup for both backtesting and live trading.

Institutional quants often also build their own in house software. This is due to a mix of regulatory constraints, investor relations/reporting and auditability.

Retail quants have a choice between using the "cloud+data" approach of Quantopian or "rolling their own" using a cloud vendor such as Amazon Web Services, Rackspace Cloud or Microsoft Azure, along with an appropriate data vendor such as DTN IQFeed or QuantQuote.

In terms of open source software, there are many libraries available. They are mostly written in Python (for reasons I will outline below) and include Zipline (Quantopian), PyAlgoTrade, PySystemTrade (Rob Carver/Investment Idiocy) and QSTrader (QuantStart's own backtester).

One of the most important aspects, however, is that no matter which piece of software you ultimately use, it must be paired with an equally solid source of financial data. Otherwise you will be in a situation of "garbage in, garbage out" and your live trading results will differ substantially from your backtests.

## Programming Languages

While software takes care of the details for us, it hides us from many implementation details that are often crucial when we wish to expand our trading strategy complexity. At some point it is often necessary to write our own systems and the first question that arises is "Which programming language should I use?".

Despite having a background as a quantitative software developer I am not personally interested in "language wars". There are only so many hours in the day and, as quants, we need to get things done - not spend time arguing language design on internet forums!

We should only be interested in what works. Here are some of the main contenders:

### Python

Python is an extremely easy to learn programming language and is often the first language individuals come into contact with when they decide to learn programming. It has a standard library of tools that can read in nearly any form of data imaginable and talk to any other "service" very easily.

It has some exceptional quant/data science/machine learning (ML) libraries in NumPy, SciPy, Pandas, Scikit-Learn, Matplotlib, PyMC3 and Statsmodels. While it is great for ML and general data science, it does suffer a bit for more extensive classical statistical methods and time series analysis.

It is great for building both For-Loop and Event-Driven backtesting systems. In fact, it is perhaps one of the only languages that straightforwardly permits end-to-end research, backtesting, deployment, live trading, reporting and monitoring.

Perhaps its greatest drawback is that it is quite slow to execute when compared to other languages such as C++. However, work is being carried out to improve this problem and over time Python is becoming faster.

### R

R is a statistical programming environment, rather than a full-fledged "first class programming language" (although some might argue otherwise!). It was designed primarily for performing advanced statistical analysis for time series, classical/frequentist statistics, Bayesian statistics, machine learning and exploratory data analysis.

It is widely used for For-Loop backtesting, often via the quantmod library, but is not particularly well suited to Event-Driven systems or live trading. It does however excel at strategy research.

### C++

C++ has a reputation for being extremely fast. Nearly all scientific high-performance computing is carried out either in Fortran or C++. This is its primary advantage. Hence if you are considering high frequency trading, or work on legacy systems in large organisations, then C++ is likely to be a necessity.

Unfortunately it is painful for carrying out strategy research. Due to being statically-typed it is quite tricky to easily load, read and format data compared to Python or R.

Despite its relative age, it has recently been modernised substantially with the introduction of C++11/C++14 and further standards refinements.

### Others?

You may also wish to take a look at Java, Scala, C#, Julia and many of the functional languages. However, my recommendation is to stick with Python, R and/or C++, as the quant trading communities are much larger.

## Should You Write Your Own (Event-Driven) Backtester?

It is a great learning experience to write your own Event-Driven backtesting system. Firstly, it forces you to consider all aspects of your trading infrastructure, not just spend hours tinkering on a particular strategy.

Even if you don't end up using the system for live trading, it will provide you with a huge number of questions that you should be asking of your commercial or FOSS backtesting vendors.

For example: How does your current live system differ from your backtest simulation in terms of:

• Algorithmic execution and order routing?
• Spread, fees, slippage and market impact?
• Risk management and position sizing?

While Event-Driven systems are not quick or easy to write, the experience will pay huge educational dividends later on in your quant trading career.

## Event-Driven Backtest Design 101

How do you go about writing such a system?

The best way to get started is to simply download Zipline, QSTrader, PyAlgoTrade, PySystemTrade etc and try reading through the documentation and code. They are all written in Python (due to the reasons I outlined above) and thankfully Python is very much like reading pseudo-code. That is, it is very easy to follow.

I've also written many articles on Event-Driven backtest design, which you can find here, that guide you through the development of each module of the system. Rob Carver, at Investment Idiocy also lays out his approach to building such systems to trade futures.

Remember that you don't have to be an expert on day #1. You can take it slowly, day-by-day, module-by-module. If you need help, you can always contact me or other willing quant bloggers. See the end of the article for my contact email.

I'll now discuss the modules that are often found in many Event-Driven backtesting systems. While not an exhaustive list, it should give you a "flavour" of how such systems are designed.

### Securities Master Database

This is where all of the historical pricing data is stored, along with your trading history, once live. A professsional system is not just a few CSV files from Yahoo Finance!

Instead, we use a "first class" database or file system, such as PostgreSQL, MySQL, SQL Server or HDF5.

Ideally, we want to obtain and store tick-level data as it gives us an idea of trading spreads. It also means we can construct our own OHLC bars, at lower frequencies, if desired.

We should always be aware of handling corporate actions (such as stock splits and dividends), survivorship bias (stock de-listing) as well as tracking the timezone differences between various exchanges.

Individual/retail quants can compete here as many production-quality database technologies are mature, free and open source. Data itself is becoming cheaper and "democratised" via sites like Quandl.

There are still plenty of markets and strategies that are too small for the big funds to be interested in. This is a fertile ground for retail quant traders.

The trading strategy module in an Event-Driven system generally runs some kind of predictive or filtration mechanism on new market data.

It receives bar or tick data and then uses these mechanisms to produce a trading signal to long or short an asset. This module is NOT designed to produce a quantity, that is carried out via the position-sizing module.

95% of quant blog discussion usually revolves around trading strategies. I personally believe it should be more like 20%. This is because I think it is far easier to increase expected returns by reducing costs through proper risk management and position sizing, rather than chasing strategies with "more alpha".

### Portfolio & Order Management

The "heart" of an Event-Driven backtester is the Portfolio & Order Management system. It is the area which requires the most development time and quality assurance testing.

The goal of this system is to go from the current portfolio to the desired portfolio, while minimising risk and reducing transaction costs.

The module ties together the strategy, risk, position sizing and order execution capabilities of the sytem. It also handles the position calculations while backtesting to mimic a brokerage's own calculations.

The primary advantage of using such a complex system is that it allows a variety of financial instruments to be handled under a single portfolio. This is necessary for insitutional-style portfolios with hedging. Such complexity is very tricky to code in a For-Loop backtesting system.

### Risk & Position Management

Separating out the risk management into its own module can be extremely advantageous. The module can modify, add or veto orders that are sent from the portfolio.

In particular, the risk module can add hedges to maintain market neutrality. It can reduce order sizes due to sector exposure or ADV limits. It can completely veto a trade if the spread is too wide, or fees are too large relative to the trade size.

A separate position sizing module can implement volatility estimation and position sizing rules such as Kelly leverage. In fact, utilising a modular approach allows extensive customisation here, without affecting any of the strategy or execution code.

Such topics are not well-represented in the quant blogosphere. However, this is probably the biggest difference between how institutions and some retail traders think about their trading. Perhaps the simplest way to get better returns is to begin implementing risk management and position sizing in this manner.

### Execution Handling

In real life we are never guaranteed to get a market fill at the midpoint!

We must consider transactional issues such as capacity, spread, fees, slippage, market impact and other algorithmic execution concerns, otherwise our backtesting returns are likely to be vastly overstated.

The modular approach of an Event-Driven system allows us to easily switch-out the BacktestExecutionHandler with the LiveExecutionHandler and deploy to the remote server.

We can also easily add multiple brokerages utilising the OOP concept of "inheritance". This of course assumes that said brokerages have a straightforward Application Programming Interface (API) and don't force us to utilise a Graphical User Interface (GUI) to interact with their system.

One issue to be aware of is that of "trust" with third party libraries. There are many such modules that make it easy to talk to brokerages, but it is necessary to perform your own testing. Make sure you are completely happy with these libraries before committing extensive capital, otherwise you could lose a lot of money simply due to bugs in these modules.

### Performance & Reporting

Retail quants can and should borrow the sophisticated reporting techniques utilised by institutional quants. Such tools include live "dashboards" of the portfolio and corresponding risks, a "backtest equity" vs "live equity" difference or "delta", along with all the "usual" metrics such as costs per trade, the returns distribution, high water mark (HWM), maximum drawdown, average trade latency as well as alpha/beta against a benchmark.

Consistent incremental improvements should be made to this infrastructure. This can really enchance returns over the long term, simply by eliminating bugs and improving issues such as trade latency. Don't simply become fixated on improving the "world's greatest strategy" (WGS).

The WGS will eventually erode due to "alpha decay". Others will eventually discover the edge and will arbitrage away the returns. However, a robust trading infrastructure, a solid strategy research pipeline and continual learning are great ways of avoiding this fate.

Infrastructure optimisation may be more "boring" than strategy development but it becomes significantly less boring when your returns are improved!

### Deployment & Monitoring

Deployment to a remote server, along with extensive monitoring of this remote system, is absolutely crucial for institutional grade systems. Retail quants can and should utilise these ideas as well.

A robust system must be remotely deployed in "the cloud" or co-located near an exchange. Home broadband, power supplies and other factors mean that utilising a home desktop/laptop is too unreliable. Often things fail right at the worst time and lead to substantial losses.

The main issues when considering a remote deployment include; monitoring hardware, such as CPU, RAM/swap, disk and network I/O, high-availability and redundancy of systems, a well thought through backup AND restoration plan, extensive logging of all aspects of the system as well as continuous integration, unit testing and version control.

Remember Murphy's Law - "If it can fail it will fail."

There are many vendors on offer that provide relatively straightforward cloud deployments, including Amazon Web Services, Microsoft Azure, Google and Rackspace. For software engineering tasks vendors include Github, Bitbucket, Travis, Loggly and Splunk, as well as many others.

## Final Thoughts

Unfortunately there is no "quick fix" in quant trading. It involves a lot of hard work and learning in order to be successful.

Perhaps a major stumbling block for beginners (and some intermediate quants!) is that they concentrate too much on the best "strategy". Such strategies always eventually succumb to alpha decay and thus become unprofitable. Hence it is necessary to be continually researching new strategies to add to a portfolio. In essence, the "strategy pipeline" should always be full.

It is also worth investing a lot of time in your trading infrastructure. Spend time on issues such as deployment and monitoring. Always try and be reducing transaction costs, as profitability is as much about reducing costs as it is about gaining trading revenue.

I recommend writing your own backtesting system simply to learn. You can either use it and continually improve it or you can find a vendor and then ask them all of the questions that you have discovered when you built your own. It will certainly make you aware of the limitations of commercially available systems.

Finally, always be reading, learning and improving. There are a wealth of textbooks, trade journals, academic journals, quant blogs, forums and magazines which discuss all aspects of trading. For more advanced strategy ideas I recommend SSRN and arXiv - Quantitative Finance.