E-commerce behaves differently compared to the traditional retailing industry. Customers’ online purchase experience requires less seller-buyer interaction. Customer satisfaction depends on factors that are outside the realm of the online selling company. A huge customer base, seasonal changing patterns, lack of historical data for new products, disruption in social and behavioral norms, customer shifting buying criteria, and disruptions in demand patterns from competitors make forecasting a very difficult task. Determining which forecast model to apply is difficult and the measuring of forecast accuracy is not reliable all the time.
E-COMMERCE IN PERSPECTIVE
The e-commerce industry is growing considerably. According to Statista.com, a reliable Internet statistics online company, the Business to Consumer e-commerce sales is expected to grow 15.6% worldwide this year and approximately 13% in 2016. The number of online stores is also growing as improvements in technology change the mobility, data interchange, and accessibility in today’s agile and dynamic business environment. Existing e-tailersi like Amazon.com and Alibaba.comii are constantly looking for ways to stay ahead in this race to achieve and maintain competitive advantage. As a result, the level of competitiveness in this industry is rapidly changing. Furthermore, one aspect of great concern throughout the entire e-commerce industry is the challenges posed by uncertainty in demand planning and inventory management. It is true that uncertainty is not a new trend; it has challenged replenishment and sourcing managers for a long time. The e-commerce industry, on the other hand, is fairly new and rapidly growing. The old rules that apply to the traditional retailing practices are not as applicable to the e-tailer business as it could be thought. Therefore, demand is a priority on the agendas of managers at all levels for online sellers. An important question to answer is, “What forecasting model could more accurately tackle the uncertainties in this industry.” The major challenges in forecasting demand for e-commerce include customer base, seasonal changing patterns, lack of historical data for new products, disruption in social and behavioral norms, customer shifting buying criteria, and disruptions in demand patterns from competitors (Forrester Consulting).
The era of digitalization has brought many benefits for buyers and opportunities for sellers. On the seller’s side, these opportunities do not come without hardships and challenges. The e-commerce industry has the advantage of great accessibility, almost like an omnipresence that makes the predictive analytics part of today’s modern business environment practically unbearable. Nonetheless, some companies have managed to come up with very clever strategies to gain market share in this endless pool of potential customers. CRM systems, Search Engine Optimizations, Click and Collect, or even Pay-Per-Click marketing strategies are all part of the predictive analytics activities that translate into competitive advantage and core competencies. Everywhere there is a computer or a smart phone with Internet accessibility; there is the possibility of one or several customers. That’s when CRM’s can play an important role. E-tailers store information on personalized data according to purchasing patterns. It does not matter what computer an online buyer uses; IP’s are not always taken into consideration - at least not as a key factor to locate customers – but rather other queues like customer name/last name, address, including country, credit card information, etc. There is a current trend of telecommunication and marketing companies integrating with CRM’s in efforts to improve customer service and increase data collection capabilities. Avaya IP Office is an example of such a company that has being integrated with CRM’s (avaya.com). Along with personal information, CRM’s keep records of items bought by every new and repeating customer. This is not enough though; customers changing their buying patterns are too unpredictable for a company to collect the information necessary on its customer base, especially on a global scale. Yes, forecasts can be made based on historical data on regular customers based on these patterns, but what about new customers? A time series data forecast model lacks the ability to account for exogenous variables that influence buying decisions in the short run. The answer then, is social networking and media marketing. (DeMers) How many times do we run into a pop-up window while surfing the net that reads, “sign in with Google [or] sign in with Facebook?” What do you think that means? E-tailers, blogs even news websites like Forbes.com, Quora.com, the big Amazon.com, and Alibaba.com, all use this marketing device to access customers’ information. From a demand forecasting standpoint what they are doing is merely filling in those independent variables that are not taken into account for new customers. Information is all around us, and social networking is a paradise for online sellers (Carroll).
Before going farther in this analysis of forecasting demand in e-commerce, we need to acknowledge a simple fact, “forecasts are almost always wrong, if not always” (Production and Inventory Management). This phrase, popular within the econometrician and forecasting expert community, points out the sad but real paradox that surrounds this necessary business practice. Two questions arise then: How wrong is a forecast, and what model minimizes the errors associated with the forecast? The differences in supply chain structure between the traditional retailing business and its younger brother e-commerce make it much more important for managers to predict demand as accurate as ‘humanly’ possible. Customer satisfaction levels in online sales do not depend on the human interaction between a costumer and a sales representative, or even a cashier and the consumer. The buying experience in online sales depends more on other factors, like online product availability and short delivery times, among others. This is truly what customers want in an online purchase experience. All these factors are dependent on accurate forecast of demand.
Forecasting demand and inventory levels accurately is a challenge. Measures of forecast accuracy are as important and as useful as the very forecast. As mentioned before, to determine e-commerce’s future demand, we shouldn’t rely solely on historical data.
A solution to forecasting demand in a more accurate manner would be to integrate different factors from marketing and qualitative forecasting into a multiple regression analysis. By combining research methods, like data mining, as well as by relying on the forecaster’s experience, studying economic parameters at the micro, or short-term forecasts, and the macro, or long-term forecasts, independent variables can be drawn to build a good model. However, there is only so much a forecaster and an expert can get from these marketing strategies about the behavior of an irregular customer. Once again, the customer base is too big and the buying criteria are too spread out across many regions. It is almost impossible to create a model that encapsulates so many independent variables and outcomes. A solution to this problem would be to create subgroups according to specific criteria, but by doing so we run into the problem of multi-collinearityiii (Hanke and Wichern, 297). In addition, a multiple regression forecast with the characteristics posed by the conditions in the demand for e-commerce could use ‘dummy variables’to set the boundaries between qualitative forecast biased and the dependent variable (Hanke and Wichern, 297-300). In the case of using a multiple regression analysis combined with qualitative forecasting techniques, the indicators or dummyvariables can nullify coefficients that are not significant to the model (Hanke and Wichern, 293). Nevertheless, as mentioned before, this is very difficult due to the dependency on qualitative methods of determining the regressors or independent variables and the very large customer base.
A solution to some of the problems on forecasting demand for e-commerce mentioned before might be obtained by applying another forecasting model. We have established that time series data is not the best forecasting technique due to the lack of historical data on new customers, the seasonal difference across regions and the extremely large customer base issues. We have also recognized the fact that multiple regression analysis is not a very effective method of forecasting future e-commerce demand due to an unrealistic dependency on qualitative selection of independent variables; we will continue to prove this theory. The forecast model that I offer next is a regression with time series data (Hanke and Wichern, 339-367). This is a combination of both time series and regression analysis. It takes the best of both models to predict future demand and only leaves us with the problem of autocorrelation. Why is this model better than the others? Time series models, including the Holt Winters and the exponentially weighed moving average, do not include the effects of external factors like causal models do. The opposite applies to causal models; they don’t take historical patterns like seasonality, cyclicality, trend, and level very seriously. The challenges of forecasting demand for e-commerce apply alternatively to both time series and causal models. However, if we combined both, we can reduce or pool the risk in a way that minimizes the forecasting error and optimizes measures of accuracy, like mean absolute percentage error [MAPE], absolute percentage error [APE], mean square error [MSE] and the correlation coefficient.iv Nevertheless, a model like this, capable of integrating time series data and regression analysis is sadly going to keep a few weaknesses from each model. One of such defects is specifically applicable to the regression analysis element of it, autocorrelation (Hanke and Wichern, 347).
Autocorrelation brings a series of problems, the first being the omitted variable or model specification error (Hanke and Wichern, 348). The solution to this challenge would be to improve the model specification, or simply find the missing variable. This part is not that simple, because the variable may not be available or it is not quantifiable. We say that it is not quantifiable when drawing assumptions about relevant regressors from a qualitative standpoint. We already stated that quantitative forecasts and independent variable selection is a very hard task when forecasting demand for e-commerce, because of the lack of historical data on new and prospective customers and all the other factors previously mentioned. The same problem applies to the model specification solution for autocorrelation on customers demand for e-commerce products. The second problem with autocorrelation in this model is the regression with differences (Hanke and Wichern, 350). In regression with time series data models, we also have the possibility of running into a very highly auto-correlated data. A solution to this problem would be instead of running a regression in terms of the dependent and the independent variables; we use the differences between the dependent variable at time (t) and itself lagged one time. This solution also requires using the difference between the predictors Ytand Yt-1, Yt-k. – we will see how this is not completely a bad circumstance later on (Hanke and Wichern, 350). The third problem with autocorrelation, or serial correlation, is the possibility of having auto-correlated errors or what is known as generalized differences (Hanke and Wichern, 354). This condition is present on a regression analysis with time series data when, Yt =ß0+ ß1Xt+εt and εt = εt-1+vt (Hanke and Wichern, 340).
Yt: Actual demand for period t
ß0: intercept coefficient
ß1: slope coefficient
Xt: regress-or, in this case second series or the variable Yt lagged k number of times.
εt: error at time t for big samples or a population
vt: independent error following a standard normal distribution z~N(0,σ2y) (Hanke and Wichern, 340).
In the case that the error term ui follows a normal distribution that is not dependent on Xi, the error term is said to be heteroskedastic. This is, the variance of the conditional distribution is not constant but increases/decreases with every observation Xi. In such circumstance, it becomes more difficult to conduct a test statistic without mathematically manipulating the error term. According to FIGURE 1, the error term is indeed heteroskedastic and will interfere with the Durbin-Watson test statistics. We will see why is this a problem later on when testing for autocorrelation.
FIGURE 1. Conditional distribution of the error term and Heteroskedasticity
The solution for this problem of generalized differences, in the available data for e-commerce demand, is to take the correlation between two consecutive errors into the equation, Y’t= ß0 (1-)+ ß1 X’t + vt, whereis a binomial, or Bernoulli distribution, depicting the correlation between consecutive errors in e-commerce demand forecast (Hanke and Wichern, 354).
These are the three possible problems with the corresponding solutions for a regression analysis of times series data. All of them are relatively big challenges to the forecasting manager when using this model and the solutions although available, are complicated in nature and sometimes unrealistic. However, when modeling data such as the one available for e-commerce, it might not be our choice but rather a last resort when all else has failed. In the first part of this paper we mentioned the difficulties, or rather the impracticality of using a standard multiple regression analysis on exogenous regressors or independent variables. We have also established that standard models of time series data like moving averages, exponentially weighted moving averages,and even the (standard or additive) Holt-Winters Modelv are not feasible for forecasting demand for e-commerce due to factors like very large customer base, changing seasonal patterns simultaneously across regions, and rapidly changing customer buying criteria (Hanke and Wichern, 126-136). Therefore, given all these challenges, it is left up to me to prove that the most feasible model is a regression analysis on time series data. For this, an important step is to review the test statistics and check for the degree of autocorrelation in the demand for e-commerce data available, and hope that it passes the Durbin-Watson testvi (Hanke and Wichern, 344-347). There is one hiccup in this respect though. I apologize for the suspense up until this moment or hopefully, dear reader, you might have realized by now that we cannot do a regression analysis on time series data if there is no independent variable or exogenous regressors associated or predicting the demand for e-commerce. We have already concluded that the use of qualitative methods to find relevant independent variables or regressors is very difficult or unrealistic. Therefore, at this point, we are going to rule out the third model, regression with time series data using regressors. Not all is gloomy news, though. The description of the problems and solutions to autocorrelation of this last mentioned model has shed light on a model that might be our last hope in finding a solution to the challenges of forecasting demand for e-commerce. This final model is called autoregressive model and is built on the idea that autocorrelation is not too bad after all, and could be used as a predicting factor for this type of data. Therefore, the autocorrelation showed in Figures 2 and 4, will allow us to run an efficient autoregressive model that will predict future demand for e-commerce at the macro level; at least in a more efficient manner compared to the rest of the models explained here.
DURBIN-WATSON HYPOTHESIS TEST (TESTING FOR AUTOCORRELATION)
H0: =0 there is no autocorrelation
H1: >0 there is significant autocorrelation
(Hanke and Wichern, 344-347)
One way to determine the type of autocorrelation is to calculate the Durbin-Watson test statistics and then find the upper and lower bounds on the DW Test Bounds (Hanke and Wichern, 344-345).
To make things easier and more understandable we are going to use the already calculated DW test statistic provided in Figure 2, EViews Regression Analysis on United States B2C e-commerce sales from 2002 to 2013 (in billions). Then, by looking at the Durbin-Watson test Bounds Tablevii, we see that the upper bound for [n] sample size of eleven observations, [k] lag 1 is roughly [dU]= 1.36 and the lower bound [dL]= 1.08.viii There is possibility to have an inconclusive Durbin-Watson test for autocorrelation. If the DW falls within the lower and upper bounds, as it does in this case, then we cannot conclude that there is indeed autocorrelation. The inability to perform a successful DW test statistic is mainly because the forecast exhibits a heteroskedastic error term. However, we can still test the residual autocorrelation coefficient at 5% significance level. This is our last resort to prove there is autocorrelation. Then, if falls within ; we say there is no autocorrelation. Since the residual autocorrelation coefficient does not fall within this interval, then we can safely conclude there is autocorrelatio and can run build our model on the data available for ecommerce sales from 2002 to 2013 (Hanke and Wichern, 344).
FIGURE 2: EViews Regression Analysis on United States B2C e-commerce sales from 2002 to 2013 (in billions)ix
FIGURE 3: Actual vs. Forecast U.S. B2C Sales from 2002 to 2014 with Polynomial order 6 trend-lines.x
FIGURE 4: EViews Correlogram showing the autocorrelation between U.S. e-commerce sales and Lag_1 of the same data.xi
Figure 4 shows that the Durbin-Watson Test proves there is indeed an autocorrelation on the data from 2002 to 2014 of sales in the e-commerce industry in the U.S. We see that the first bar depicting the autocorrelation between U.S_sales and U.S_sales_lag_1 is significant at the 95% confidence level. Therefore, our final autoregressive model would look like this;
= =$346.37 (billions)
Since the error term is assumed to have the same [OLS] Ordinary Least Square principle from a standard linear regression model, which is 0 or asymptotically close to 0 (Hanke and Wichern, 357).
TABLE 1: Actual vs. Forecast U.S. B2C Sales from 2002 to 2014 data, calculations and forecast accuracy measures.
Autoregressive Model Forecast
Abs. Value Error Term
Abs % Error
FORECAST ACCURACY MEASURES
Mean absolute deviation= = (Hanke and Wichern, 82)
Mean absolute percentage error = MAPE = = 0.47 (Hanke and Wichern, 83)
Mean Square Error = MSE = = 1767.50 (Hanke and Wichern, 82)
Then, how accurate is this autoregressive model? The MSE tells how large the error is in magnitude and since it is squared this measure tends to be very big for larger samples. In this case, our sample is rather small, so it makes us wonder how good this model is. On the other hand, the mean absolute percentage error is quite small; less than 1. This is usually good news and often discredits MAPE’s measures. In my opinion, measures of bias tend to be more important for smaller sample data than measures of magnitude, because we are not accounting for degrees of freedom like we do in standard linear regression models. However, like I have mentioned before, forecasts are only as good as the results they yield and the criteria evaluating the results depend on the data and the forecaster. We have already explained the characteristics and challenges that the demand for e-commerce represents in terms of data collection and reliability. Therefore, we expected a rather large forecast accuracy measure in terms of magnitude and prefer a small biased measure. This is in fact what the three forecast accuracy measures are telling us about the autoregressive model on demand for e-commerce. Nonetheless, we cannot definitely assert how good the forecast really is in the long run, because of the short-term scope of this model. The autoregressive forecasting’s random process that is described and calculated here will have the same short termed, or micro focused, characteristic of a time series model that we have been trying to refute.
Finally, I would like to leave this discussion about the challenges that forecasting demand for e-commerce represents for decision makers in today’s globalized, sparse, and at the same time interconnected world, with an open ended question: Is there really an accurate way of predicting demand in a fast paced, evolving, and innovative industry like e-commerce? Accurate, is rather a broad term in forecasting and probably every replenishment manager would quote or make up a different definition for it. Coming back, one more time, to the quote at the beginning, “forecasts are almost always wrong, if not always” (Carroll). But truly, what makes a forecast right or wrong, less accurate, or closer to the forecaster or decision maker’s expectations are really the results; what might work for some people, might not work for others. In an ideal world we would be able to find independent variables that predict e-commerce demand in an unbiased and accurate manner, but that is unlikely today. Yes, companies are getting real breakthroughs in that area with CRM’s and other marketing strategies but that is not enough to achieve the real competitive advantage that they are looking for. Amazon looks promisingly close to that target, but there is a long stretch yet ahead of them. Furthermore, the challenge is even bigger today with trends like Omni-channel commerce. Disruption in social behavior and customer buying criteria means more competition as e-tailers fight to achieve higher market share in this stochastic business model.
Carroll, Matthew. Forecasting Revenue & Expenses for an E-Commerce Startup: Sales Build. 20 Feb 2015 .
DeMers, Jayson. The Top 10 Benefits Of Social Media Marketing. 11 August 2014. 20 Feb 2015 .
Forrester Consulting. Customer Desires Vs. Retailer Capabilities: Minding The Omni- Channel Commerce Gap. Forrester Research, Inc. accenture.com, January 2014.
Hanke, John E. and Dean W Wichern. Business Forecasting. Ed. Eric Svendsen. Ninth Edition. Upper Saddle River: Pearson Prentice Hall, n.d.
"Production and Inventory Management." American Production and Inventory Management 27.1-2 (1986): 95.
statista.com. Annual B2C e-commerce sales in the United States from 2002 to 2013 (in billion U.S. dollars). statista.com. 23 Feb 2015 .
i I will use this term in occasion and interchangeably for companies solely selling products via electronic transaction or companies practicing Omni-channel marketing and operation activities, like Walmart Inc. and its division Walmart.com.
ii URL’s to companies mentioned in this paper:
iii Multi-collinearity is a situation in which independent variables in a multiple regression model are highly correlated to each other. This will create a biased forecast towards these inter-correlated variables and underestimate the rest of the regressors.
iv Correlation coefficient measures the strength of the correlation between two, or more variables in multiple regression analysis, of a liner regression model (37).
v Alternatives to the Holt-Winters Model like multiplicative components with different variability across the data series (167).
vi The Durbin-Watson test statistics is used to prove that positive lag_1 autocorrelation does not exist (343-344).
vii This table can be found in any Business Forecasting textbook. For this paper I used Business Forecasting by John E. Hanke and Dean W. Wichern; see Works Cited for more information.
viii Since the table only includes bounds for samples equal or greater than 15, I will use the bounds for n=15.
ix Data source statista.com URL: http://www.statista.com/statistics/271449/annual-b2c-e-commerce-sales-in-the-united-states/
x This chart uses a naïve forecast for actual demand in 2014, for the purpose of forecast measure, same sales volume actual vs. autoregressive forecast for period 2014.