This article requires a subscription to view the full text. If you have a subscription you may use the login form below to view the article. Access to this article can also be purchased.

## Abstract

This article presents an automated forecasting system that provides a macroeconomic forecasting approach that some hedge funds may find useful. The authors describe the structure of an econometric forecasting system designed to produce multiequation econometric forecasting models of national macroeconomies. They also describe the functioning of an automatic model-building system that builds the forecasting equation for each series submitted and produces forecasts of the series without human intervention. The automatic model-building system employs information criteria and cross-validation in the equation building process, and it uses Bayesian model averaging to combine forecasts of individual series. The system outperforms standard benchmarks for a variety of macroeconomic datasets. To demonstrate its use, the automatic system is used to build a fixed-income macro trading system.

Global macro hedge funds employ a broad range of strategies and operate in a variety of global markets. A feature they have in common is strategies that are based on the behavior of underlying economic variables. Some global macro funds use *judgmental*, or *heuristic*, strategies, while others base their strategies on rigorous quantitative analysis. In this article, we focus on the latter approach. We present an automatic forecasting system on which global macro trading strategies could be based. The automatic model-building system is designed to build, without human intervention, large econometric forecasting systems for various national macroeconomies that subsequently could serve as the foundation for a global macro trading system. For purposes of demonstration, we employ the system to construct a simple macro bond-trading system. The automatic system uses a variety of methods to build forecasting equations for any number of time series in an economic system and to produce forecasts of those series. The construction methods include Granger causality tests, information criteria, cross-validation, and Bayesian model averaging. We find that the system performs well for a variety of national macroeconomies ranging from developed to emerging economies.

Automatic model-building systems lend themselves to the construction of the kind of forecasting systems used by some global macro hedge funds. They may be useful in cases in which the dataset is so large that it could be prohibitively costly to produce systems with human model builders. In addition, unconstrained by preconceived ideas, automatic systems can provide new ideas about equation construction, which can lead to potential improvements in existing forecasting equations.

Other automatic model-building systems are in use at present. Providing a comprehensive review of these other systems is beyond our scope and intent, but we briefly comment on two such systems because they serve as contrasts to the system presented here. The lasso method, proposed by Tibshirani [1996], is a well-known automatic model-building system based on a shrinkage approach that can be expressed by a single Lagrangian (see Hastie, Tibshirani, and Friedman [2009]). In that sense, lasso creates an econometric equation in a single pass. Sequential reduction is another well-known automatic approach, as exemplified by the general-to-specific (Gets) method, wherein the set of candidate variables in an equation is reduced sequentially by a set of tests rooted in a theory of model building (see Hendry and Krolzig [2005]). Obviously, lasso and Gets represent very different concepts about how to conduct automatic model building.

The automatic model-building system described in this article represents yet another approach. Like lasso and sequential reduction, it seeks to create small, parsimonious forecasting equations from a set of candidate independent variables, and it is based on a set of rules like sequential-reduction methods. However, unlike Gets, the system presented here is not based on any theory of model building. Rather, the goal is to develop an automatic model-building system based on an empirical investigation of model construction methods, a theory-free automatic system—that is, a system not rooted in any notion of a theory of automatic model construction.

The topic of how to construct a theory of automatic model building is not uncontroversial. There is a long history of debate in the literature on the issue of whether such a theory can be constructed (see Campos, Ericsson, and Hendry [2005] for a survey of this literature). We propose a different approach in which the automatic model-building system is constructed using empirical investigation, thereby doing away with the need for a theory of automatic model construction. Using U.S. data, we examine alternative approaches to automatic model building, selecting the combination of methods that produce desirable forecasting results. The methods we examine, taken by themselves, are not new and should be familiar to readers; it is the empirical approach to selecting the combination of methods used to automatically construct forecasting equations that readers may find novel.

Empirically searching for the combination of automatic construction methods introduces the issue of overfitting in that we use the same dataset (i.e., our selected set of U.S. time series) repeatedly in the empirical search and investigation. Bailey et al. [2014] broadly defined overfitting as a situation in which particular observations, rather than a general structure, are targeted; they suggested using hold-out samples to deal with the problem. Using hold-out samples, a model—or in our case, a model-building system—is constructed using one dataset to set the parameters and construct the system with other datasets set aside to be used to test the model or system. Thus, the automatic model-building system was developed using U.S. data exclusively, and we then applied the system to several sets of non-U.S. data, described later, with possibly very different characteristics, to examine the performance and usefulness of the automatic system. In that sense, the multiple sets of non-U.S. data were used as the hold-out samples to evaluate the system.

The system also differs from lasso and sequential reduction in that 1) it builds up the parsimonious model in steps from a set of preselected candidate variables, which means in that sense that it does not select a small model from a larger model, and 2) it combines forecasts from these parsimonious equations through Bayesian model averaging (BMA) with forecasts from other forecasting equations (specifically those that describe autoregressions and modified autoregressions, which constitute alternative representations of the behavior observed in the data) to produce improved forecasts.

The automatic system is generally simpler than Gets in terms of the rules it employs; furthermore, BMA constitutes a key part of the system, and the system is not based on an underlying theory. Rather, we seek to develop a simple set of rules for equation construction that produces desirable forecasting results. We find that BMA can work effectively when the forecasts that are averaged come from very different ways of representing the behavior observed in the data. The focus then is on constructing and combining these alternative representations, or models, of economic variables. This stands in contrast to systems like lasso and Gets, which focus on finding a single model for each variable in the dataset.

The article describes the construction and operation of the system and examines its performance relative to standard benchmarks. The following section discusses data: We apply the system to data from a variety of national economies, nine in all, which report economic data under alternative formats. Models of the nine macroeconomies are automatically constructed. Some are constructed from large, highly disaggregated datasets, while others are built from small aggregated datasets. The subsequent section discusses the structure of the machine-built system, which operates by combining causality tests and forward-selection and backward-deletion methods using information criteria and cross-validation. The next section discusses BMA, which we use to combine forecasts of time series. Then we present forecast results: We find that for the nine datasets in our study, the system outperforms standard benchmarks generally used to test forecasting models. The final section summarizes our findings.

**DATA**

All data for all models presented in this article are monthly in frequency. We applied the system to various macroeconomies representing diverse systems of economic structure and organization. We do not use artificially generated data. Rather, we employ alternative datasets for the United States; France and Germany, the two modeled as a single economy; Japan; the United Kingdom; Canada; Sweden; Ireland; Greece; and India. Obviously, most of these macroeconomies have well-developed financial markets, while others are emerging-market economies.

The U.S. economy is an obvious choice for model-building purposes. Perhaps less obvious is a single model of France and Germany. However, these two countries use the euro as their common currency so that their markets and time series are based on the same *numéraire*; they each have large, technologically advanced economies and are the two largest economies in the European Union; they employ harmonized regulatory environments and operate within the legal framework of the EU; sharing a common border, they trade with each other with a high degree of specialization and economic and financial integration; and financial capital flows between them in large volumes. Thus, we treated France and Germany as a single economy so that equations for time series in each country may contain both French and German time series as right-hand variables. We also combine France and Germany to subject the automatic model-building system to the problem of creating forecasting equations for what some may consider an unconventional forecasting setup.

Japan is included because of the unique way in which its economy is organized. Its economy, which has been described as “state-directed capitalism,” is dominated by very large, tightly integrated industrial groups that operate with a high degree of cooperation with, and direction from, government authorities. In contrast to some of the other countries examined, Japan has limited domestic natural resources and thus depends on trade to obtain them, but Japan’s trade balance deteriorated significantly in our sample period.

The United Kingdom has a large, service-based economy, but the country remains a major producer of automobiles, and its aerospace industry, dominated by BAE Systems, is one of the largest in the world.

Canada has a unique economic structure. Its economy is technologically advanced, but in contrast with Japan, Canada has large, resource-based sectors. These include agriculture—especially wheat farming—and timber and petroleum production.

Sweden, highly industrialized and intensely engaged in international trade, is a study in contrasts. Income equality is extremely high in the country—by some measures the highest in the world—but wealth is relatively concentrated. Sweden’s public-service sector is very large, but in our sample period, Sweden was engaged in a long-term program to dismantle that sector, at least in part.

Ireland made substantial economic progress in the years prior to our data sample, moving from an agricultural economy to a knowledge-based economy with foreign direct investment from firms such as Intel, Google, and Microsoft. However, in the latter part of our sample, Ireland experienced a severe real estate bubble.

Greece has not evolved successfully. Its economy, which depends heavily on tourism and shipping, suffers from a multitude of economic problems that include low global competitiveness, high government deficits, and an inefficient public service sector.

India is an example of a large, rapidly emerging economy. The data sample for India starts a few years after the country transitioned from a socialist model to a market-driven deregulated economy, and its economic reforms began to show results. We considered using China instead of India as an example of a large emerging economy, but Chinese economic data can be problematic and are often only available with short histories.

The sample period used for the models in the analysis starts in January 1998 with the exceptions of the France–Germany model, which begins in January 1999, and the Greece model, which begins in January 2000. These exceptions are made because many time series in these economies have shorter reporting histories. As described in greater detail in the section titled “Out-of-Sample Forecast Results,” the end-of-sample period for the estimation of each model advances month by month, and out-of-sample forecasts are computed for each model over rolling periods during the three-year period from January 2008 to December 2010. There is an equation for every series in each macroeconomy.

Exhibit 1 shows the number of series in each model in our analysis, broken down by types of data. The economies in the exhibit are ordered by the number of series in the respective datasets. The time series listed in the exhibit are not meant to be a comprehensive list of all series for the respective economies. Instead, our goal is to develop a system that constructs models that perform well even when we deprive the system of a complete data representation of an economy.

For the U.S. model, we employed a large disaggregated dataset. The dataset for the France–Germany model is also relatively large and disaggregated. However, to test the limits of the automatic system, the datasets for some of the other countries in our analysis are intentionally restricted to small, sparse sets of series. For example, the extremely small and aggregated dataset for India, just 14 series, presents the machine system with a unique challenge.

**STRUCTURE OF THE SYSTEM**

**Conversion of Data**

The system begins the construction of the forecasting model by converting all time series to stationary form. Many unit-root tests that can be employed in this conversion are available, including the augmented Dickey–Fuller test, Phillips–Perron test, Elliott–Rothenberg–Stock test, Schmidt–Phillips test, and Kwiatkowski–Phillips–Schmidt–Shin test. Each test has its own merits and shortcomings. We are concerned that overtesting would result if we were to run time series through all these tests. Additionally, one of the main themes of this article is to reduce overfitting. That is why we opt to use an unconventional, simple process to convert series to stationary form as white noise.

The conversion is done in steps. First, the autocorrelation function is computed for the series in as-reported levels for distances of one to 12 periods. Generally, where the underlying series is white noise, the sample autocorrelation coefficients at lags greater than zero are distributed approximately normally with a mean of zero and a standard deviation
, where *n* is the sample size (Shumway and Stoffer [2006]). Thus, if the maximum of the |*?*
_{i}|; *i* = 1,2,3,…,12 is less than
, the as-reported levels are used in the construction of equations. Otherwise, the levels are differenced, and the test is performed again. If the maximum |*?*
_{i}| is less than
or if the series contains nonpositive values, first differences of levels are used in the construction of equations, and the conversion process stops for these series. For the remaining series—those with only positive values where the maximum |*?*
_{i}| is greater than or equal to
—the test is performed on logarithms. Again, log levels are examined and then, if necessary, first differences of logs. Under this method, the broad majority of series in our databases are converted to differences of logs. The following schematic illustrates how the conversion process works.

Throughout, we followed the practice of developing the global macro forecasting system using only the U.S. economic dataset, holding the other economic databases in reserve for tests of the final version of the system. Therefore, using only the U.S. dataset, we tested alternative methods for converting series to white noise. In one approach, each series containing only positive values was regressed as logarithms against an index of time. Where the parameter of the index of time was statistically significant, the series was converted to logarithms. Then the autocorrelation function was computed at distances of one to 12 periods for both levels (log levels) and first differences. We summed the absolute values of the *?*
_{i}; *i*=1,2,3,…,12, and the variable form with the minimum
was chosen as the converted form. This approach as well as other similar approaches we tested produced worse overall performance than the approach described earlier.^{1}

**Causality and Variable Selection**

The automatic system is based on the premise that Granger causality tests (Granger [1969]) may be used as building blocks in the system, and the system combines Granger causality, forward selection, and backward deletion to construct macroeconomic models. After conversion of the series as described earlier, the system conducts Granger causality tests on the converted forms of all the series submitted to the system. Granger causality tests are used to prescreen the dataset to obtain a reduced set of candidate series for the selection process that subsequently builds the regression equations. For the U.S. dataset, Granger causality tests were conducted using alternative 0.90, 0.95, and 0.99 critical values. The 0.99 critical value produced equations with fewer parameters overall and better performance, so we used that value in the system. We will return to the issue of equation size and overfitting later. Selection of variables to be included in the equation of each series is conducted using as candidate series only those variables that Granger-cause the dependent variable.

*Bidirectional* Granger causality—in which *x* causes *y,* and *y* causes *x*—receives special treatment in the system. In early versions of the system as applied to the U.S. dataset, bidirectionality was allowed in equation construction. The result was that the performance of the system was degraded, presumably because of the effects of strong interactions between bidirectionally related variables. Where a dataset contains pairs of variables that represent approximately the same thing—such as the industrial production of automobiles index from the Federal Reserve’s industrial-production market groups report and unit automobile production from the Bureau of Economic Analysis—allowing bidirectionality could create loops generating excessive instability in the system of equations. Thus, the system was modified so that where bidirectional Granger relationships arise between two variables, the relationship with the weaker F-statistic is ignored. Thus, at the beginning of the equation-construction process, the machine system produces for each variable a candidate list of unidirectional causal variables.

**Variable Selection Criteria**

Using the U.S. dataset, we considered and tested alternative criteria for forward selection of independent variables from these lists. One of these is the information criterion created by Akaike [1973]. Based on likelihood theory, the criterion, denoted AIC, is defined as follows:

1where *RSS* is the residual sum of squares of the regression equation, *n* is the number of observations, and *k* is the number of parameters. The AIC-based method is intended to generate equations with small *RSS* and *k* relative to *n*, and the equation with the smallest AIC is the best equation according to likelihood theory.

Another approach from Schwarz [1978] is rooted in Bayesian methods in which posterior distributions are employed to derive the Bayesian information criterion (BIC):

2As with AIC, the equation with the smallest BIC is the equation selected. Contrasted to AIC, BIC uses ln(*n*) to penalize *k* instead of the penalty of 2 in AIC. Thus, the penalty in BIC is larger than that in AIC where ln(*n*) > 2 ? *n* = 8.

Gagné and Dayton [2002] analyzed the performance of AIC and BIC using simulated regressions. Using the *valid predictor ratio* (VPR), which is the ratio of valid predictors to total, or potential, predictors, they found that for VPR < 0.5, BIC performed uniformly better than AIC. For most of the models in our analysis, the majority of variables in the dataset, even with the restrictions based on Granger causality described previously, have a VPR < 0.5. The authors also note that using *n* as the denominator in the term *RSS/n* in information criteria introduces bias into these measures because it does not correct for the number of parameters. Thus, we constructed an adjusted version of BIC that we call BICa, in which the denominator *n* in the first term in Equation (2) is replaced with *n – k* to yield the following:

In BIC, the extra weight ln(*n*) applied to the penalty term means that equations created with BIC tend to be smaller than equations created with AIC. BICa tends to produce even smaller equations, so the uniformly better performance of BIC-type criteria can be rationalized by the fact that, with many potential descriptors, AIC and its variants are in general likely to select too many variables, resulting in overfitting.

We use BICa to construct equations in the automatic system. We compute *RSS* in Equation 3 using both fourfold and tenfold cross-validation. Here, the sample is divided into 4 (10) equally long subsamples, the equation is estimated with 3 (9) of these subsamples, and residuals are computed for the remaining subsample. The process is repeated 3 (9) more times so that out-of-sample residuals are computed for all 4 (10) subsamples. Then *RSS* is computed from these residuals. The purpose of cross-validation is to construct models that are robust out of sample, and we adopted this method as part of the equation-construction process. Each equation is estimated with both fourfold and tenfold cross-validation, and the version with the lowest *RSS* is chosen as the equation for the system. We view cross-validation as an essential part of our system: Within the sample, parameters are pulled toward values that make *RSS* less than it would be if it were computed with cross-validation. Relying on within-sample *RSS* poses a serious risk of overfitting and poor out-of-sample performance.

We employ forward selection as follows. A distributed lag of each series *x* on the unidirectional list of variables that Granger-causes the dependent variable *y* is appended one by one to a distributed lag of the dependent variable as shown in the following equation:

The independent variable whose equation has the smallest BICa is selected as the first variable to be included in the forward-selection process. The residuals from this equation are then regressed, one by one, against a distributed lag of each of the remaining candidate series on the unidirectional list, and the series with the smallest BICa is selected for inclusion in the equation. Then the residuals from this new equation are regressed in the same way against each of the remaining candidate series, and the series with the smallest BICa is selected for inclusion. This process of adding new variables to the equations is allowed to continue under the condition that the addition of a variable reduces BICa. However, as this process repeats and the equation grows in size, degrees of freedom fall. To prevent degrees of freedom from falling too low for efficient estimation, the forward-selection process stops when the number of parameters is greater than one-fourth of the sample size. At that point, backward deletion begins: The variable in the equation with the *t*-statistic lowest in absolute value is deleted, and if BICa falls, the variable is permanently deleted from consideration for inclusion in the regression equation. This process continues until BICa rises. At that point, the variable just deleted is put back in the equation, and forward selection resumes. The process continues in this way, automatically switching back and forth between forward selection and backward deletion until either 1) all series that Granger-cause the dependent variable have been put into the equation or permanently deleted from consideration, or 2) the combination of forward selection and backward deletion does not improve the equation.^{2}

We note in passing that Equation (4) contains the term *? _{0}
*

*x*

_{t}, a contemporaneous independent variable, and that, under forward selection, other contemporaneous variables may be added to the equation. These contemporaneous variables may survive backward deletion in the automatic construction process, which means that models built by the automatic system are to be solved as a simultaneous system of equations. We return to this issue in a later section, where we present our preferred way of solving simultaneous systems of equations.

We also note that, with regard to the various macroeconomic datasets, the system of equations created from the France–Germany dataset features feedback from one country to the other (i.e., a French time series may be modeled with both French and German series as causal variables, and a German time series may be modeled with both French and German series). The other datasets are treated as closed systems such that, for example, a Canadian time series is modeled with only Canadian series as causal variables.

In addition, we note that every time series is processed only one time into stationary form, which means that for each model the system of equations is linear in both the parameters and the variables.

**Seasonal Adjustment, Time Trend, and Backward Deletion Issues**

Some time series in our nine datasets are not seasonally adjusted, calling for seasonal dummy variables in equations. However, seasonal dummy variables may be necessary even when all variables in an equation are seasonally adjusted. Economic time series are often adjusted for seasonality by reporting agencies using some variant of a ratio-to-moving-average algorithm. Although other methods can produce seasonally adjusted series, ratio-to-moving-average algorithms are the method of choice for many reporting agencies, not because of optimality considerations, but because they are mathematically robust and simple to apply. Still, they can leave residual seasonality in so-called seasonally adjusted series. This is one reason seasonal dummy variables may turn out to be significant descriptors in regressions containing only seasonally adjusted data. The system described in this article deals with this issue by appending seasonal dummy variables to each equation that emerges from the process of forward selection and backward deletion and by subjecting the resulting equation to backward deletion using BICa and cross-validation. We find that in employing this process seasonal dummy variables appear occasionally in equations constructed by the system.^{3}

Mestre and McAdam [2011] examined the issue of “deterministic” shifts or breaks in econometric models, and they suggested that simple intercept adjustment rules have a significant impact on forecast accuracy. They proposed various schemes that center on the idea of using regression residuals at the end of the sample period to construct corrections to the regression intercept when computing forecasts outside the sample period. We tried variations of this scheme with no success and use an alternative approach for intercept correction: Following the appending of seasonal dummy variables and subsequent backward deletion, a time-trend variable is appended to each equation—the inclusion of a time-trend variable corrects for deterministic shifts in the intercept over time—and backward deletion is again applied to determine whether the time-trend variable should be included in the equation. In equations constructed with our U.S. database, we find that the time-trend variable is usually backward-deleted out of the equation.

The *t*-statistics used to select variables for potential elimination in backward deletion are derived from the within-sample residuals, but BICa used to delete those potential variables is computed from the *RSS* obtained from the cross-validation folds. Thus, some equations constructed with the method as described thus far tend 1) to include parameters with weak *t-*statistics and 2) to be large in the number of parameters, possibly an indication of overfitting. Thus, we include an additional step in the equation-construction process in which the equations are subjected to a final adjustment at the end of the process where backward deletion is performed using BICa computed from the within-sample residuals. With this adjustment, weak *t*-statistics are eliminated, and econometric equations are generally smaller.

**Characteristics of the Machine-Built Equations**

The econometric equations produced by the system are at times unlike those produced by human judgment and often have a distinctive appearance. The system converts most time series to differences of logarithms, and the regression equations sometimes feature very short distributed lags of the dependent variable. Sometimes there are one or two key independent variables in an equation with very large *t*-statistics in absolute value that make intuitive sense as causal variables; these variables may appear either as single variables or in the form of distributed lags. In addition, there are sometimes a few other independent variables with typically smaller *t*-statistics that seem to serve as modifiers of the key variables. One or more seasonal dummy variables may be present in some equations. A few equations contain the time-trend variable.

The equations produced by the system sometimes make immediate economic sense, but at other times they may be novel in that they make economic sense only after some reflection. Thus, the system may be useful to forecasters who are looking for new ideas about how to modify and improve existing forecasting systems built with human judgment. Some equations contain independent variables that are difficult to rationalize at all in economic terms. Although it is an infrequent occurrence, a few equations contain only the intercept term; with most series converted to differences of logarithms, such equations mean that the series is forecast to grow at a constant rate. Even though all estimation performed by the system is ordinary least squares, residual serial correlation and heteroskedasticity problems are largely absent in the estimated equations. The attention to conversion to white noise contributes to this effect.

Exhibits 2 to 5 provide a selection of examples of the machine-built equations constructed with the U.S. database. In the first example, we see an equation for the housing sector. In the U.S. dataset, housing starts and building permits are broken down into the four major regions reported by the U.S. Census Bureau: Northeast, Midwest, South, and West. In the example, we see that the machine system models Northeast housing starts as a function of Northeast building permits. In most localities in the United States, a housing start, which is the pouring of the foundation of a structure, must be preceded by the issuance of a building permit, so the machine system has modeled starts on a well-known legal–institutional relationship between starts and permits, and it has correctly linked starts to permits within the same region of the country. The distributed lag on permits reflects the fact that a start may take place over a period of months after the permit is issued. The negative coefficients of the lagged dependent variables reflect the fact that, once a start takes place, the permit cannot be used again, thereby reducing the number of possible starts going forward in time.

In the second example, the machine system models the prime rate as a function of the federal funds rate. Many banks set their prime rate on a cost-plus basis, often using the federal funds rate or a weighted average of the funds rate and, say, large certificate of deposit rates or the London Interbank Offered Rate as the measure of the cost of funds. In the equation, the federal funds rate appears contemporaneously with a coefficient of 0.7874 and lagged one period with a coefficient of 0.1897. The lagged funds rate reflects the sticky adjustment of the prime rate to the funds rate (see Zhu, Chen, and Li [2009]).

In the third example, the machine system models the Conference Board Consumer Confidence Present Situation Index as a function of its Consumer Confidence Future Expectations Index. Interestingly, wholesale trade also appears as a right-hand variable. Wholesale trade may serve as the kind of modifier variable that we mentioned earlier: It appears as a right-hand variable with some frequency in other equations in the U.S. model, suggesting that it may have greater informational content than model builders often realize. Manufactured goods flow through the wholesale sector into the retail sector, and so information from the wholesale sector may serve as an indication of imbalances between the retail and manufacturing sectors.

In the fourth example, the machine system models the Institute for Supply Management Manufacturing Prices Index as a function of price measures often associated with the manufacturing sector.

At a glance, these examples show that the machine system constructs equations that are very different from those using traditional vector autoregression (VAR). Employing traditional VAR with the U.S. database, there would be hundreds of right-hand variables in equations instead of the parsimonious representations seen in the examples. Even where the sample size is greater than the number of independent variables, as would be the case with some of the smaller datasets employed in this article, there can be a large number of right-hand variables, resulting in noisy estimates, unstable predictions, and difficult-to-interpret temporal dependence. Bayesian VAR could alleviate this problem to some extent, but it could still generate equations with too many variables. The number of variables in a dataset emerges as an issue again with seemingly unrelated regression (SUR) estimation, which requires the inversion of the skedasticity matrix S. The numerical precision of this computation tends to fall as the number of variables in the dataset increases.

**Solving and Forecasting with the System**

Because 1) each series can enter into an equation only in its unique converted form, and 2) the equations are estimated with classical linear regression, the system of equations could be solved with Cramer’s rule. However, the machine system employs the simple and robust Gauss–Seidel method (GS) (see Jeffreys and Jeffreys [1988]) used in various mathematics software packages to solve systems of equations. Under GS, in each forecast period, each dependent variable is set initially at its value in the previous period, and the forecasting equations are computed using that value in regression equations in which it appears as a contemporaneous right-hand variable. This generates updates of the dependent variables. These updated values are substituted for the respective previous values of the contemporaneous right-hand variables, and the forecasting equations are computed again. The process continues until the values of the dependent variables stabilize. Forecasts for the next period then are computed using the same approach. Where the underlying system of equations is well behaved, GS converges quickly to the forecast solution, producing results identical to Cramer’s rule.

GS greatly simplifies employing BMA. In addition, we note that GS lends itself to future enhancement of the system. Because of the iterative nature of GS, the system of forecasting equations need not be linear, either in the parameters or the variables. In fact, the equations need not be regression equations. For example, GS can solve a system that combines regression equations and neural networks.

**BAYESIAN MODEL AVERAGING**

In the out-of-sample performance tests that follow, the forecasts from the machine system are tested as standalone forecasts, and they are also combined with forecasts from equations representing autoregressions and modified autoregressions using BMA. BMA, proposed by Leamer [1978], has received attention from various authors, including Min and Zellner [1993]; Raftery, Madigan, and Hoeting [1997]; Hoeting et al. [1999]; Cremers [2002]; and Wright [2008]. Cremers [2002] used it to forecast stock-index excess returns and found that BMA improves forecasts, and Wright [2008] used it to forecast exchange rates out of sample and found that under the right conditions BMA forecasts outperform accepted benchmarks. The mathematics of BMA is discussed in the appendix.

**OUT-OF-SAMPLE FORECAST RESULTS**

Out-of-sample forecasts were performed as follows: January 2008 to December 2010 was selected as the forecast period for each model. We chose this three-year period because it presents a challenge to any forecasting system: It encompasses the onset of a sharp global recession followed by a recovery, making it useful for testing the system’s ability to forecast the economy. The challenge is made more interesting and difficult because the economies in our analysis responded in different ways to the recession, and their recoveries were different. Exhibit 6, which shows industrial production for each country in the three-year period rebased to 1 at the beginning of the period, serves to illustrate the differences in how the countries in our analysis reacted to the recession. For example, although each country was affected by the recession, India recovered relatively quickly in the three-year period, Ireland experienced a gradual slowdown but recovered eventually, Japan experienced a very sharp contraction and managed to stage only a partial turnaround, and Greece did not recover.

To begin the forecast computation, the white-noise forms of the variables, the Granger rankings, and the equations of each macroeconomic model were created using data for the respective sample beginning periods to the end-of-sample period in December 2007. Out-of-sample forecasts of one to six months ahead were computed for January to June 2008, and the resulting root mean square errors (RMSEs) one to six months ahead were stored. Then the sample period was lengthened one month to end in January 2008; new white-noise forms, Granger rankings, and equations were created with that sample; out-of-sample forecasts of one to six months were computed for February to July 2008; and the RMSEs one to six months ahead were stored. The process was continued, extending the end month by one month and computing out-of-sample results.

**Alternative Yardsticks and Performance Measurement of the Machine System**

Alternative forecasting approaches were also computed as yardsticks for comparison with the system’s forecasts. In one of these, the forecast is a random walk without drift—that is, the forecast is set equal to the actual value in the last period before the forecast period begins. We call this the *naïve* approach. We place some emphasis on the comparison of the system’s forecasts to those of the naïve model. The naïve approach provides a useful diagnostic in that it is a *flat* forecast, which means that if the system outperforms the naïve yardstick, the system is in a sense “getting right” the direction, up or down, of the time series in the database. We also computed an AR(12) model for each variable in the database and compared its forecasts to those from the system. AR models are often treated as a benchmark comparison for tests of forecasting systems (Stock and Watson [2003, 2008], Diebold and Li [2006], and Elliott and Timmermann [2013]). It has been found that they are hard to beat and frequently produce smaller forecast errors than models with additional independent variables (Pesaran and Timmermann [2005]).

Let *RMSE_S*(*i,j*) denote the RMSE of variable *i* for the forecast *j* months ahead produced by the system and *B*(*i,j*) denote the corresponding RMSE of the naïve benchmark model. We computed the value *D* as follows:

where *n* is the number of variables in the model. Using ratios of RMSEs normalizes results so that RMSEs of individual series do not dominate. A value of *D* greater than zero indicates that the system is outperforming, in an overall sense, the naïve benchmark model. We also computed the expression for *D* substituting the RMSEs of the AR(12) forecasts for the RMSEs of the naïve forecasts.

The results for both the naïve and AR models are shown in Exhibit 7 for forecasts of one to six months. In the exhibit, the results for *D* for the system relative to the naïve model are presented under the label “system to naïve,” and the results for *D* for the system relative to the AR model are presented under the label “system to AR.” The results are for the system’s forecasts without BMA, and they show that, although results vary across economies, when averaged over all the economic datasets in our analysis, the system’s forecasts outperform the naïve forecasts and the AR(12) forecasts in all forecast periods.

**Performance Measurement Using BMA**

We also computed forecasts combining the machine forecasts with forecasts from an AR(12) model using BMA. We call these the BMA_{ma} forecasts. The results are shown in Exhibit 8. Comparing the results to those in Exhibit 7, we see that performance is better overall than it was without BMA. BMA_{ma} outperforms both benchmarks for each model in every forecast period with the exception of the naïve model for Ireland in period 6.

Exhibit 9 shows the average Bayesian weights for all variables for each model for all forecast periods for BMA_{ma.} The weight *w _{m}
* for the machine model is greater than the weight

*w*for the AR(12) model, and each weight is stable across the nine economic models.

_{a}We also combined the forecasts of the machine system with those from an AR(12) model and an AR(1) with appended seasonal dummy variables. We call these the BMA_{mas} forecasts. The results are given in Exhibit 10, which shows that the BMA_{mas} forecasts outperform both the naïve model and the AR benchmark for each macroeconomy in all six forecast periods. Comparing the results in Exhibit 10 to those in Exhibit 8, we find that the BMA_{mas} forecasts also outperform the BMA_{ma} forecasts relative to the naïve model in all but a few forecast periods in a few datasets, and they outperform relative to the AR(12) model in all periods in all datasets.

We note that for the Irish economy, BMA_{mas} lost ground progressively to the naïve model over the six-month forecast period. Referring again to Exhibit 6, which shows industrial production for the various economies, one explanation for this effect may be that the Irish economy, more than the others, tended to oscillate around its initial level of industrial production in the forecast period and did the same with regard to other time series in the dataset for Ireland, so that the naïve model had a built-in advantage. For Ireland, BMA_{mas} may have had an advantage over the naïve model in the shorter forecast periods because it incorporates useful information not available to the naïve model, but the naïve model progressively closed some of the gap as forecast errors in the BMA_{mas} system naturally accumulated with longer forecast periods. Lending weight to this interpretation, Japan, Greece, and India diverged strongly from their initial values over the forecast period, with Japan and Greece diverging below and India diverging above. Thus, the naïve model was at a disadvantage for these economies and performs poorly relative to the BMA_{mas} forecasts as seen in the exhibit.

Let *w _{s}
* denote the weight of the forecasts of the AR(1) model with appended seasonal dummy variables. Exhibit 11 shows the average weights for all variables for each model. Each weight is stable across the macroeconomic models, and for every model the largest weight is

*w*.

_{m}
Exhibit 12 shows the distribution of *w _{m}
* for the U.S. model for all variables in all forecast periods. Why does BMA

_{mas}perform as well as it does? Does BMA

_{mas}beat the AR(12) forecasts, for instance, because it co-opts them through model averaging? We think the answer is more complicated. The largest weights in both BMA

_{ma}and BMA

_{mas}are for the machine forecasts, and for the BMA

_{mas}weights shown in Exhibit 11, the AR(12) weight

*w*averages just 0.280 across the nine models, which is marginally less than the average

_{a}*w*and much less than the average

_{s}*w*. We think that BMA

_{m}_{mas}performs as well as it does because it combines forecasts that come from very different ways of representing the data. The machine forecasts allow feedback from variable to variable; in contrast, the AR(12) forecasts contain information only from the dependent variable itself. There is no simultaneity in these forecasts to create potentially undesirable feedback effects and forecast instability, so the AR(12) forecasts serve to anchor BMA

_{mas}.

The forecasts from the AR(1) equations with appended seasonal dummy variables provide information about seasonal effects. Of course, seasonal dummy variables are included as candidate series in the machine equations, but other variables may mimic the effects of seasonality and provide information other than seasonality, thus displacing seasonal dummy variables that would otherwise survive the backward deletion process used in the machine system. The forecasts from the AR(1) equations with appended seasonal dummy variables therefore may serve to restore lost information about seasonality to the forecasts using BMA_{mas}. Likewise, the AR(12) terms are included in the original Equation (4), but these terms, like all of those added to the equation under the forward-selection process, are subsequently subjected to backward deletion under the BICa criterion which, as we have said, tends to produce smaller equations than either BIC or AIC. In some instances under this relatively strong criterion, AR(12) terms that provide useful information to the forecast could be deleted, and so reintroducing that information through BMA_{mas} may improve forecast results.

**Forecast Stability**

Using the BMA_{mas} forecasts, we normalized the RMSEs for each series in each of the models by dividing the RMSE in each forecast period by the RMSE for that series in the one-period-ahead forecast, and we computed the average of these normalized RMSEs for each forecast period as follows:

Exhibit 13 shows the results. Here, we see the progression of normalized RMSEs of the system as the forecast lengthens, and we see that the average normalized RMSEs generally do not exhibit explosive behavior as the forecast period is extended. The normalized RMSEs inflect slightly upward for the model of India, which contains only 14 series.

**Directional Performance**

The machine-built equations were constructed to minimize forecast error rather than to forecast direction, but forecast directional properties are important in the implementation of global macro trading strategies. Leitch and Tanner [1991] examined fixed income trading based on interest rate forecasting and found that traditional measures of forecast accuracy such as RMSE are not meaningful indicators of the profitability of forecasts. Leitch and Tanner [1991, p. 588] found instead that the “only substitute criterion for [trading] profits found in the literature that appears to be closely related is directional accuracy.” To see the implications of this statement with the automatic model-building system, we conducted a simple exercise with out-of-sample BMA_{mas} forecasts of the 30-year Treasury bond yield in the U.S. model. We computed one-period-ahead forecasts of the yield for our previously chosen forecast period of January 2008 to December 2010. We then applied a simple rule that, if the forecast called for a decrease in yield, a long position was taken in that month in the Chicago Board of Trade 30-year Treasury bond first-expiring (nearby) contract. If the forecast called for an increase in yield, we computed two possible actions: In one of these, a flat position (i.e., a cash position) was taken; in the other, a short position was taken in the contract. In addition, to allow for the effects of data-reporting lags, we computed the same exercise using two-period-ahead forecasts. Exhibit 14 shows the outcomes compared to a buy-and-hold approach.

For the one-period-ahead forecasts, starting with an investment of $1 at the end of December 2007, the buy-and-hold approach generates an investment of $1.05, the long/flat approach generates $1.27, and the long/short approach generates $1.50 at end of the final period, December 2010. For the two-period-ahead forecasts, starting with $1 at the end of January 2008, the buy-and-hold approach generates $1.02, the long/flat approach generates $1.15, and the long/short approach generates $1.28 at the end of December 2010. We therefore see evidence that directional forecasts matter. Exhibit 15 presents performance measures in the form of annualized efficiency for the one- and two-period-ahead forecasts.

We emphasize that this exercise is not presented as a trading strategy. It lacks the analytical rigor and scope to suggest a trading strategy. Our purpose is simply to demonstrate the usefulness of the automatic model-building system.

More broadly, we computed the one-period change in direction, up or down, of all the economic series in each macroeconomic model converted to their as-reported forms for the one- to six-month BMA_{mas} forecasts. Exhibit 16 summarizes the directional results expressed as percentages for all variables and all forecast periods. BMA_{mas} gets the one-period change in direction right 61.8 to 69 percent of the time in the one-period-ahead forecasts, and it gets the one-period change in direction right 54.9 to 62.6 percent of the time in the six-period-ahead forecasts.

**FINDINGS**

This article presents a system for constructing multiequation forecasting systems based on a simple set of rules that could be used to facilitate the implementation of global macro trading strategies. We apply the system to a variety of macroeconomies: the United States, France and Germany combined, Canada, Japan, the United Kingdom, Sweden, Greece, Ireland, and India. The datasets of these nine economies vary substantially in the degree of aggregation and the number of variables. The system combines the use of Granger causality tests, alternating forward selection and backward deletion, information criteria, and cross-validation to construct forecasting equations. BMA* *is used to combine the machine-built forecasts with commonly used autoregression-based models on the posterior probability. Thus, we not only can estimate the best model, but also include the relative merits of all the candidates in consideration. The weighted average forecasts generally beat established benchmark performance measures for all of the economies tested in all forecast periods, and the machine forecasts generally receive the highest weight among the candidate models, which we regard as a demonstration of the high explanatory power of such forecasts.

The automatic model-building system described here bears a relationship to other automatic model-building systems, but it differs in that it builds up the small parsimonious representation, and it combines these small equations through Bayesian model averaging with models that constitute alternative representations of the data. The system performs well for economies represented by a large number of variables and also for very small sets of variables. This could be of particular help when conducting research in new markets and economic systems. The workings of the system can be rationalized on the grounds that, given the complexity of economic systems, the best one can do is to build forecasting models based on statistical properties in the data.

**APPENDIX**

**MATHEMATICS OF BMA**

Under the BMA scheme, the forecast is a posterior Bayesian mean. With a set of candidate models *M _{i}
*, i = 1,2,…,

*n*for a training dataset

*Z*, the posterior mean is:

where *Y* is the variable to be forecast. This Bayesian forecast is a weighted average of the individual forecasts, and the weight is the posterior probability of each model. To calculate the posterior probability, we followed the work done by Ripley [1996]. With a uniform prior distribution Pr (*M _{i}
*)=1/

*n*over the model space, Pr (

*M*|

_{i}*Z*) is proportional to:

where *?* represents the *k* parameters in the model and L(*?*|*Z*, *M _{i}
*) is the likelihood function. Under general regularity conditions near the MLE
, using the saddle-point approximation or Laplace’s method, the likelihood function can be approximated as:

where V( ) is the estimated variance-covariance matrix of the MLE. Using the improper prior d? the integration is related to the multivariate normal distribution, and the normalization constant in the integration can be obtained by means of:

where |.| is the determinant of a matrix. Subsequently, we can get:

where V_{1} (
)^{-1} is the information matrix. In addition, with
we have:

Taking the log of this expression and multiplying it by –2, we obtain:

Because asymptotically the last two terms are dominated by the term of order ln(*m*), we arrive at the BIC criterion by dropping them. Let *w _{i}
* denote the weight of the forecast of model

*i*. Then the posterior probability of each model

*M*can be approximated by:

_{i}where BICi (BICj) is BIC for model *i* (*j*). In our forecast tests, these BICs are computed for the sample period preceding the forecast interval using 10-fold cross-validation.

## ENDNOTES

↵

^{1}In early versions of the system as applied to the U.S. dataset, the system was allowed to difference a second, third, and fourth time. In rare instances, the system selected these higher degrees of differencing. The result was that the system of equations at times exhibited instability in its forecasts of these series, presumably because of the accelerative effects of excessive differencing, and performance suffered. Thus, the system presented here does not compute beyond first differences and is limited to four types of series:*y*,_{t}*?y*, ln (_{t}*y*), and ln_{t}*?*(*y*)._{t}↵

^{2}Using the U.S. dataset, we also tested a version of the system in which forward selection is performed not with the residuals of the existing equation but with variables appended to an equation in which the dependent variable is the series that is being modeled. Performance was slightly degraded with this system, and it was abandoned in favor of the version described earlier.↵

^{3}Using the U.S. dataset, we examined an alternative method in which the seasonal dummies were put in the equation first, and then the variables from the list of series that Granger-cause the dependent variable were appended using forward selection and backward deletion as described previously. There was no discernible overall difference in performance in forecast tests, and we adopted the practice of appending seasonal dummies after constructing the eqtion from the list of Granger-causal variables.

- © 2016 Pageant Media Ltd