Illuminating Financial Time Series:

Data Exploration, Normality Checks, and Forecasting Through Smoothing

Barry Quinn

Exploring Data

Learning Objectives

Understand how to visualize and explore financial time series data
Learn key R packages and functions for financial analysis
Develop intuition for patterns in financial data

Why This Matters

Good data exploration helps identify potential modeling approaches
Visual analysis reveals patterns that statistics might miss
Essential skill for any financial analyst

Outline

package tsfe and class datasets
ts objects and ts function
Visual trends and patterns
Distributions properties of asset returns
Time series decomposition

Rethinking visualisation

Charts are not meant to be seen, they are intended to be read
They are not just images but visual arguments.
Data doesn’t speak for itself.
Data visualisations need to be shown and explained

`ts` objects and `ts` function

A time series is stored in a ts object in R:

a list of numbers
information about times those numbers were recorded.

Example

x <- c(123,39,78,52,110)
yr <- 2012:2016
knitr::kable(data.frame(Year=yr,Observation=x), booktabs=TRUE)

Year	Observation
2012	123
2013	39
2014	78
2015	52
2016	110

`ts` objects and `ts` function

For observations that are more frequent than once per year, add a frequency argument.

E.g., monthly data stored as a numerical vector z:

y <- ts(z, frequency=12, start=c(2003, 1))

ts(data, frequency, start)

Type of data

frequency	start	example
Annual	1	1995
Quarterly	4	`c(1995,2)`
Monthly	12	`c(1995,9)`
Daily	7 or 365.25	1 or c(1995,234)
Weekly	52.18	c(1995,23)
Hourly	24 or 168 or 8,766	1
Half-hourly	48 or 336 or 17,532	1

Class package (pre-loaded in Q-RaP)

library(tidyverse); library(tidyquant); library(DT)

DT loads datatable interactive table visualisation

tidyquant loads:

tq_transmute function (for transforming between time frequencies)
tq_mutate function (create return series)

tidyverse loads many packages include:

ggplot plotting package
dplyr data wrangling package

library(fpp2)

This loads:

forecast package (for forecasting functions)
fma package (for lots of time series data)
expsmooth package (for more time series data)

Some class data

MyDeleteItems<-ls() 
rm(list=MyDeleteItems) 
# Good practice to clear all objects before loading data
library(tsfe) # includes class datasets
print(data(package='tsfe'))

Programmatically accessing data from the

Download data using tg_get
Create a monthly series using tq_transmute

# Download FTSE 100 index data starting from 2016
ftse <- tq_get("^FTSE", from="2016-01-01")

# Convert daily data to monthly frequency
# tq_transmute is used to transform the data while selecting specific columns
ftse_m <- tq_transmute(ftse,
                     select = adjusted,    # Select adjusted closing prices
                     mutate_fun = to.monthly)  # Convert to monthly frequency

# Create a time series object from the monthly data
# Start date is set to January 2016 with monthly frequency (freq=12)
ftse_m_ts <- ts(ftse_m$adjusted, 
                start=c(2016,1),  # Starting year and month
                freq=12)          # 12 periods per year (monthly data)

ts object has plotting functionality.
Quick visualisation of financial time trends in monthly ftse 100 index price

autoplot(tsfe::ftse_m_ts) + theme_tq()

Is a table a good visualisation?

Quarterly earnings per share data for carnival PLC.
Using DT to create an interactive table

#|echo: true

tsfe::ni_hsales %>% datatable()

Rethinking visualisation using `ggplot2`

ggplot2 is based on The Grammar of Graphics, the idea that you can build every graph from the same components: a data set, a coordinate system, and geoms—visual marks that represent data points.

Example: Exchange rate time series

tsfe::usuk_rate %>%  # Data
  ggplot(aes(x=date, y=price )) + # Coordinate system
  geom_line(colour="pink") # geom

Your turn

Use tq_get to download the CBOE VIX Index from 2016-01-01 using the symbol ^VIX
create a time series object using the VIX index price.
Plot this daily VIX Index Price using autoplot.

Hint: the ts object of daily financial time series does not have a regular frequency to input into ts() function, so leave this argument blank.

Distributions properties of asset returns

Why normal?

This is a plot of a normal distribution with mean equal to zero and variance equal to one.

require(grDevices) 
require(graphics)
ggplot(data = data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm, 
                n = 101, 
                args = list(mean = 0, sd =1)
                ,lwd=1,colour="red") + 
  ylab("") +
  scale_y_continuous(breaks = NULL) +
  labs(title=expression(y[it] %~% N(0,1)),
       x=expression(y[it])) +
  theme(plot.title = element_text(hjust = 0.5,size=8))

Why normal?

Named after Carolo Friderico Gavss, the normal (or Gaussian) distribution is the most common used distributional assumption in statistical analysis.

This is because:

Easy to calculate with
Common in nature
Very conservative assumptions

Importance of simulation

`the importance of simulation`

Simulation of random variables is important in applied statistics for several reasons.
First, we use probability models to mimic variation in the world, and the tools of simulation can help us better understand how this variation plays out.

Patterns of randomness are notoriously contrary to normal human thinking—our brains don’t seem to be able to do a good job understanding that random swings will be present in the short term but average out in the long run—and in many cases simulation is a big help in training our intuitions about averages and variation. (Gelman et al. 2021)

`the importance of simulation`

.Second, we can use simulation to approximate the sampling distribution of data and propagate this to the sampling distribution of statistical estimates and procedures.
Third, regression models are not deterministic; they produce probabilistic predictions.

`the importance of simulation`

Simulation is the most convenient and general way to represent uncertainties in forecasts.

Throughout this course and in our practice, we use simulation for all these reasons;

In this lecture we introduce the basic ideas and the tools required to perform simulations in R. ]

Suppose your company is being IPO’d at a starting price of £40.
You want to know the future price of the stock in 200 days.
You have been told Monte Carlo simulation can help predict stock market futures
Can you create a number of possible future price paths and find average price in 200 days?
How many futures should you create?
What assumptions should we make about these futures?

Monte Carlo Simulation

The coding for simulation becomes cleaner if we express the steps for a single simulation in an R function.

# Function to simulate a single price path
path_sim <- function(){
    days <- 200                           # Simulation horizon
    
    # Generate daily returns with slight upward drift
    # mean=1.001 implies 0.1% average daily return
    # sd=0.005 represents daily volatility of 0.5%
    changes <- rnorm(200, mean=1.001, sd=0.005)
    
    # Calculate cumulative price path starting from £40
    # cumprod() calculates the cumulative product of the changes
    sample.path <- cumprod(c(40, changes))
    
    # Return the final price after 200 days
    closing.price <- sample.path[days+1]  # +1 because we add the opening price
    return(closing.price)
}

# Run the simulation 10,000 times to generate possible future prices
number_of_possible_futures <- 10000
mc.closing <- replicate(number_of_possible_futures, path_sim())

Visualizing Monte Carlo Results

Monte Carlo Simulation Results

Enhanced Monte Carlo Results

Key Improvements in Enhanced Simulation

More realistic assumptions:
- Uses actual trading days (252) instead of calendar days
- Creates fat tails using mixture of normal distributions
- Annual parameters based on typical market behavior
- Models more extreme market events
Uses log-returns instead of simple returns
Provides more comprehensive risk assessment

Summarising simulations

Simulations are a versatile way to summarise a probability model, predictions from a fitted regression or uncertainty about parameters of a fitted model (the probabilistic equivalent of estimates and standard errors)
One useful way to summarise the location of the distribution is to use the median function in the variation in the distribution is the median absolute deviation standard deviation (mad sd).

Median statistics

We typically prefer median-based summaries because they are more computationally stable, and we rescale the median-based summary of variation as described above so as to be comparable to the standard deviation, which we already know how to interpret in usual statistical practice.

cat("median = ",median(mc.closing),
    "mad sd = ",mad(mc.closing),
    "mean = ",mean(mc.closing),
    "sd = ",sd(mc.closing))

median =  48.75303 mad sd =  3.441515 mean =  48.85153 sd =  3.466317

Why normal?

Practical reasons

Processes that produce normal distributions include:
Addition of many independent random variables
Lots of process are approximately normal
Product of small deviations
logarithms of products
Logarithms are just magnitudes

Understanding Normal Distributions: Two Perspectives

The Ontological Perspective: “What Actually Exists”

Ontology asks: “What is the nature of reality?”
In financial markets, this means understanding how normal distributions naturally arise:

Key Points:

Many small, independent changes add up to normal distributions
Like many traders making small price impacts
The final shape emerges from natural processes
Similar to how water finds its level naturally

The Epistemological Perspective: “What Can We Know?”

Epistemology asks: “How do we know what we know?”
In finance, this means understanding our limitations in knowing market behavior:

Key Points:

When we only know the mean and variance:
- Normal distribution is the most conservative assumption
- “Conservative” means making the fewest additional assumptions
- This is called the “maximum entropy” principle
Like betting on a horse race when you only know the average speed

Why These Perspectives Matter in Finance

Ontological (Reality):
- Helps understand how market prices actually form
- Explains why some patterns naturally emerge
- Shows limits of our ability to predict
Epistemological (Knowledge):
- Guides our modeling choices
- Helps avoid overconfident predictions
- Explains why we use normal distributions even when we know they’re not perfect

Why normal asset returns?

Financial time series processes considered in this course:
Return series
Financial statement information
Volatility processes
Extreme events
Multivariate series

Why assume normal asset returns?

Normality assumption allows for asset returns properties to be tractable.
Tractable mean and variance
- They provide information about the long-term return and risk, respectively.
Tractable symmetrical properties
- Symmetry has important implications in holding short and long financial positions in risk management.
Tractable kurtosis properties
- Kurtosis is related to volatility forecasting, efficiency in estimation and tests,etc.

Simple model of asset returns

To engineer a statistical model of asset returns we need to make some assumptions about the data story or data generating process.

\[ \{R_{it}|i-1,\dots N;t=1,\dots,T\} \stackrel{i.i.d}\sim N (m_1,m_2) \]

A traditional assumption in financial analysis is the simple returns are independently and identically distributed (iid) as normal with a fixed mean \(m_1\) and variance \(m_2\).

Why normal asset returns?

The previous assumption is unrealistic in a number of ways:

The lower bound of a simple return is -1, but a normal distribution has no lower bound.
Multiperiod simple returns \(R_it[k]\) is not normally distributed as it is the product of one-period returns.
Empirically, asset returns tend to have heavy tails or positive excess kurtosis ]

Another model of asset returns

\[ \{r_{it}|i=1,\dots N ;t=1,\dots,T\} \stackrel{i.i.d}\sim N (\mu,\sigma^2) \]

Another common assumption is that log returns are iid as normal with mean \(\mu\) and variance \(\sigma^2\).
As the sum of a finite number of iid random variables is normal, \(r_t[k]\) is also normally distributed.
There is also no lower bound for \(r_t\)
However, lognormal assumption is not consistent with positive excess kurtosis.

Are stock returns distributed normally?

The following code loads Glencore Plc asset prices from yahoo finance, converts daily adjusted prices to monthly log returns, and creates a monthly time series object.

Ugly code

glen <- tq_get("GLEN.L")
glen_m <- tq_transmute(glen,
                       select = adjusted,
                       mutate_fun = monthlyReturn,
                       type="log",
                       col_rename = "log_return")
glen_m_ts <- ts(glen_m$log_return,
                frequency=12,start=c(2011,5))

Piping (%>%) code is more readable?

literate programming

glen <- tq_get("GLEN.L")
glen_m <- glen %>%
  tq_transmute(select = adjusted,
               mutate_fun = monthlyReturn,
               type="log",
               col_rename = "log_return")
glen_m_ts <- glen_m$log_return %>%
  ts(frequency=12,start=c(2011,5))

Are stock returns distributed normally?

Visual Arguments

We can use ggplot to visualise the empirical distribution, superimposing what the returns would look like if they were normally distributed.
- Only two parameters (a mean and a variance) are required to create a hypothetical normal distribution of a returns series

ggplot code

# Download Glencore stock data from Yahoo Finance
glen <- tq_get("GLEN.L")

# Calculate monthly log returns
glen_m <- glen %>%
    tq_transmute(
        select = adjusted,           # Select adjusted closing prices
        mutate_fun = monthlyReturn,  # Convert to monthly returns
        type = "log",               # Calculate log returns
        col_rename = "log_return"    # Name the new column
    )

# Create a time series object from the returns
# Starting from May 2011 with monthly frequency
glen_m_ts <- glen_m$log_return %>%
    ts(frequency=12, start=c(2011,5))

# Create density plot comparing returns to normal distribution
glen_m %>% 
    ggplot(aes(x=log_return)) +
    geom_density() +                # Plot the actual return distribution
    stat_function(                  # Add normal distribution line
        fun=dnorm,
        args=list(
            mean(glen_m$log_return),    # Use actual mean
            sd=sd(glen_m$log_return)    # Use actual standard deviation
        ),
        col="red"
    )

Inference from the plot

What patterns are revealed?
The normal distribution is superimposed over the histogram of the daily equity returns.
Compared to the normal the distribution of the returns has longer tails and a higher central peak.
In statistical terms we say the distribution is leptokurtic, or fat-tailed.

Are stock returns distributed normally?

Quantile-quantile plot

A quantile-quantile plot is a graphical tool to compare two distributions.

glen_m %>%
  ggplot(aes(sample=100*log_return)) +
  stat_qq(distribution = stats::qnorm) +
  stat_qq_line(distribution = stats::qnorm,
               colour="red") +
  labs(title = "Quantile-quantile plot of glencore stock returns")

Inference from plot

This plot compares the quantiles of a normal distribution (thinner straight line) to the quantiles of the data (thicker scatter plot).
If the plots exactly overlap then the data is probably normally distributed.
While the returns and the normal distribution are similar between +12.5% and -12.5%, outside these limits the returns behave non-normally.

Should we care if asset returns are normal distributed variables?

In regressions the assumption of normality of model errors is one of the least important
For the purpose of estimating the regression line it is barely important at all
Diagnostics of normality of errors is not recommended unless the model is being used to predict individual data points.

Is normality of errors important?

If the distribution of errors is of interest, perhaps because of predictive goals, this should be distinguished from the distribution of the data, \(y\).
A regression model does not assume or require that predictors are normally distributed
Furthermore, the normal distribution on the outcome refers to the regression errors, not the raw data.
Depending on the structure of the predictors, it is possible for data \(y\) to be far from normally distributed even when coming from a linear regression model.
See Gelman et al., (2020) Chapter 11 for more detail

Null hypothesis significant testing

Null hypothesis significance tests (NHST) are models.
We assume an underlying data story with distributional properties which then allows us to create p-values based on null hypothesis.
In practice they are often misused to create ‘bright line’ acceptance or rejection decision about underlying theoretical questions.
In applied statistics their misuse is well understood

Read Bailey, D. H. & Prado, M. L. de. Finance is Not Excused: Why Finance Should Not Flout Basic Principles of Statistics. Ssrn Electron J (2021) doi:10.2139/ssrn.gi3895330.

A NHST for normality

The Shapiro-Wilk test is a test of the null hypothesis that the data is normally distributed.
The test statistic is \(W\) and the null hypothesis is that the data is normally distributed.
The test is sensitive to the tails of the distribution and is recommended for small samples.

shapiro.test(glen_m$log_return)


    Shapiro-Wilk normality test

data:  glen_m$log_return
W = 0.94734, p-value = 0.000121

Lung-Box test

The Lung-Box test is a test of the null hypothesis that the data is independently and identically distributed.
The test statistic is \(Q\) and the null hypothesis is that the data is independently and identically distributed.
The test is sensitive to the tails of the distribution and is recommended for small samples.

Box.test(glen_m$log_return,lag = 5, type="Ljung-Box")


    Box-Ljung test

data:  glen_m$log_return
X-squared = 2.0245, df = 5, p-value = 0.8457

Are stock returns distributed normally?

Statistical inference

Both test give different result!!
Assuming the models underpinning the previous NHST are valid, we can reject the Null that Glencore monthly log returns are normally distributed at even the 1% significance level.

Practical inference

Assuming normality to model the middle of the data is probably ok
Modelly extreme observations may need more distributional tools.

Heavy Tail Statistical Distributions

Student’s t: very similar to the normal but wider and lower.
Stable: generalisation of the normal, stable under addition thus can be used with log returns, can capture excess kurtosis well but has infinite variance which conflicts with finance theory.
Scale mixture: a combination of a number of normals.

Visual explorations

Rethinking time plots

One limitation with time plots is that the simple passage of time is not a good explanatory variable.
There are occasional exceptions where there is a clear mechanism driving the financial time series.

Descriptive chronology is not causal explanation - Edward Tufte ( 2015) The visual display of quantitative information P37

Time plot

glen_m %>% ggplot(aes(x=date,y=log_return)) + 
  geom_line()

Seasonal plots

glen_m_ts %>% ggseasonplot(year.labels=TRUE,
                           year.labels.left=TRUE) + 
  ylab("") +
  ggtitle("Seasonal plot: Glencore returns")

Inference from the plot

Data plotted against the individual “seasons” in which the data were observed. (In this case a “season” is a month.)
- Something like a time plot except that the data from each season are overlapped.
- Enables the underlying seasonal pattern to be seen more clearly, and also allows any substantial departures from the seasonal pattern to be easily identified.
- In R: ggseasonplot()

Polar plot

glen_m_ts %>% ggseasonplot(polar=TRUE) + ylab("")

Subseries plots

ggsubseriesplot(glen_m_ts) + ylab("") +
  ggtitle("Subseries plot: Glencore returns")

Data for each season collected together in time plot as separate time series.
Enables the underlying seasonal pattern to be seen clearly, and changes in seasonality over time to be visualized.
In R: ggsubseriesplot()

Seasonal or cyclic?

Time series patterns

Trend: pattern exists when there is a long-term increase or decrease in the data.
Seasonal: pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week).
Cyclic: pattern exists when data exhibit rises and falls that are not of fixed period (duration usually of at least 2 years).

Time series components

Differences between seasonal and cyclic patterns:

seasonal pattern constant length; cyclic pattern variable length
average length of cycle longer than length of seasonal pattern
magnitude of cycle more variable than magnitude of seasonal pattern

Time series patterns

ts(tsfe::ni_hsales$`Total Verified Sales`,start = c(2005,1),frequency = 4)->ni_hsales_ts
autoplot(ni_hsales_ts) +
  ggtitle("Northern Ireland Quarter House Sales") +
  xlab("Year") + ylab("Total Verified Sales")

Time series patterns

autoplot(carnival_eps_ts) +
  ggtitle("Quarterly EPS for Carnival Plc") +
  xlab("Year") + ylab("")

Time series patterns

tsfe::usuk_rate %>%
  ggplot(aes(x=date,y=price)) +
  geom_line(colour="darkgreen") +
  labs(title=" Time Plot of GBP:USD",
       x="",
       y="Value of £1 in Dollars")

Seasonal or cyclic?

Time series patterns

seasonal pattern constant length; cyclic pattern variable length
average length of cycle longer than length of seasonal pattern
magnitude of cycle more variable than magnitude of seasonal pattern

The timing of peaks and troughs is predictable with seasonal data, but unpredictable in the long term with cyclic data.

Lag plots and autocorrelation

gglagplot(tsfe::carnival_eps_ts)

Lagged scatterplots

Each plot shows \(y_t\) plotted against \(y_{t-k}\) for different values of \(k\).
The autocorrelations are the correlations associated with these scatterplots.

Autocorrelation

One of the most important data properties that financial time series models exploits
Covariance and correlation: measure extent of linear relationship between two variables (\(y\) and \(X\)).
Autocovariance and autocorrelation: measure linear relationship between lagged values of a time series \(y\).
We measure the relationship between:
\(y_{t}\) and \(y_{t-1}\)
\(y_{t}\) and \(y_{t-2}\)
\(y_{t}\) and \(y_{t-3}\)
etc.

Autocorrelation

We denote the sample autocovariance at lag \(k\) by \(c_k\) and the sample autocorrelation at lag \(k\) by \(r_k\). Then define

\[c_k = \frac{1}{T}\sum_{t=k+1}^T (y_t-\bar{y})(y_{t-k}-\bar{y})\]

where \(\bar{y}\) is the sample mean of the \(y_t\).

\[r_{k} = c_k/c_0\]

\(r_1\) indicates how successive values of \(y\) relate to each other
\(r_2\) indicates how \(y\) values two periods apart relate to each other
\(r_k\) is almost the same as the sample correlation between \(y_t\) and \(y_{t-k}\).

Autocorrelation

Results for first 9 lags for Carnival earnings data:

\(r_1\)	\(r_2\)	\(r_3\)	\(r_4\)	\(r_5\)	\(r_6\)	\(r_7\)	\(r_8\)	\(r_9\)
0.103	-0.109	0.069	0.839	0.051	-0.143	0.016	0.707	-0.021

Autocorrelation Function (ACF) plot

Results for first 9 lags for Carnival earnings data:

ggAcf(tsfe::carnival_eps_ts)

Autocorrelation inference

\(r_{1}\) is positive and high, indicating that successive values of the series are positively correlated.
\(r_{4}\) higher than for the other lags. This is due to the seasonal pattern in the data: the peaks tend to be 4 quarters apart and the troughs tend to be 2 quarters apart.
\(r_2\) is more negative than for the other lags because troughs tend to be 2 quarters behind peaks.
Together, the autocorrelations at lags 1, 2, \(\dots\), make up the autocorrelation or ACF.
The plot is known as a correlogram

ACF plots Trend and seasonality

When data have a trend, the autocorrelations for small lags tend to be large and positive.
When data are seasonal, the autocorrelations will be larger at the seasonal lags (i.e., at multiples of the seasonal frequency)
When data are trended and seasonal, you see a combination of these effects.

Any trends?

autoplot(carnival_eps_ts)

Any autcorrelation?

ggAcf(carnival_eps_ts)

Discussion

Time plot shows clear trend and seasonality.
The same features are reflected in the ACF.
- The slowly decaying ACF indicates trend.
- The ACF peaks at lags 4, 8, 12, 16, 20, \(\dots\), indicate seasonality of length 4.

Your turn

use tq_get() to download GLEN.L stock price from yahoo finance
using this series create a time series object hint use frequency=1
Explore this time series for trends and autocorrelation

Signal and the noise

On important stylised fact of financial time series data is a low \(\frac{Signal}{Noise}\) ratio, which changes over time (is dynamic).
The signal is the sense you want your model to capture and predict, the noise is the nonsense you do not want your model to capture as it is unpredictable.

Noise makes trading in financial markets possible, and thus allows us to observe prices in financial assets. [But] noise also causes markets to be somewhat inefficient…. . Most generally, noise makes it very difficult to test either practical or academic theories about the way that financial or economic markets work. We are forced to act largely in the dark Fisher Black, Noise, Journal of Finance ,41 ,3 (1986) (p.529)

Example: White noise

set.seed(1222)
wn <- ts(rnorm(36))
autoplot(wn)

Example: White noise

\(r_{1}\)	\(r_{2}\)	\(r_{3}\)	\(r_{4}\)	\(r_{5}\)	\(r_{6}\)	\(r_{7}\)	\(r_{8}\)	\(r_{9}\)	\(r_{10}\)
0.132	-0.032	0.074	-0.098	0.043	0.339	0.151	0.097	-0.102	-0.073

Sample autocorrelations for white noise series
We expect each autocorrelation to be close to zero.

Sampling distribution of autocorrelations

Sampling distribution of \(r_k\) for white noise data is asymptotically N(0,\(1/T\)).

95% of all \(r_k\) for white noise must lie within \(\pm 1.96/\sqrt{T}\).
If this is not the case, the series is probably not WN.
Common to plot lines at \(\pm 1.96/\sqrt{T}\) when plotting ACF.

Your turn

You can compute the daily changes in the Google stock price using

dgoog <- diff(goog)

Does dgoog look like white noise?

Financial Data Forecastability

Predictability of an event or quantity depends on several factors

How well we understand the factors that contribute to it
How much data are available
Whether the forecast can affect the thing we are trying to forecast

Prediction and EMH

Crudely, the efficent market hypothesis (EMH) implies that returns from speculative assets are unforecastable
The overpowering logic of the EMH is:
- If returns are forecastable,there should exist a money machine producing unlimited wealth
Based on the random walk theory change in price are defined as white noise as follows: \[\begin{align*} p_t=p_{t-1} + a_t \\ p_t-p_{t-1} = a_t \\ \text{where} \end{align*}\]

Prediction and EMH

High quality models identify the signal from the noise in financial data.
The signal is the regular pattern that is likely to repeat.
The noise is the irregular pattern which occurrs by chance and unlikely to repeat.
Overfitting or data snooping can result in your model capturing both signal and noise.
Overfitted models usually produce poor predictions and inferences.

Time Series Modelling

Financial time series data often exhibit patterns, trends, and fluctuations
Appropriate modelling and processing techniques are required to extract meaningful insights
Two commonly used approaches:
- ARIMA (Autoregressive Integrated Moving Average) modelling
- Smoothing techniques

ARIMA Modelling

Class of statistical models widely used for time series forecasting and analysis
Combines autoregressive (AR) and moving average (MA) components
Handles non-stationarity through differencing

Key Aspects of ARIMA Modelling

Stationarity
- Assumes time series is stationary (constant mean, variance, and autocorrelation)
- Differencing is applied to achieve stationarity if data is non-stationary
Autocorrelation
- Captures the data’s autocorrelation structure
- Future values are influenced by past values and/or past errors
Model Identification
- Specified by three parameters: p (AR order), d (differencing degree), q (MA order)
- Determined through iterative model identification, estimation, and diagnostic checking
Forecasting
- Generates forecasts for future periods once an appropriate ARIMA model is identified and estimated

Suitability of ARIMA Models

Capturing underlying patterns and dynamics of time series data
Handling trends, seasonality, and autocorrelation structures
Widely used in finance for forecasting stock prices, exchange rates, and economic indicators

Smoothing Techniques

Reduce noise or irregularities in time series data
Reveal the underlying trend or signal
Apply filters or weighted averages to smooth out fluctuations
Do not explicitly model the autocorrelation structure

Standard Smoothing Techniques

Moving Averages (Simple, Exponential, Weighted)
Savitzky-Golay Filter
Lowess (Locally Weighted Scatterplot Smoothing)
Kalman Filter

Suitability of Smoothing Techniques

Extracting the underlying trend or signal from noisy data
Preprocessing step before further analysis or visualization
Focus on denoising the data and revealing the underlying trend or signal

Choosing the Right Approach

Depends on the specific objectives and characteristics of the financial time series data
ARIMA models:
- Forecasting future values
- Accounting for autocorrelation and capturing underlying patterns
Smoothing techniques:
- Denoising the data and revealing the underlying trend or signal

Combining Approaches

ARIMA modelling and smoothing techniques can be combined
Used in conjunction with other techniques (decomposition methods, machine learning algorithms)
Helps gain deeper insights into financial time series data

Financial time series smoothing

Financial time series data often exhibit noise, irregularities, and fluctuations
Smoothing techniques help reduce random variations and reveal underlying trends
This presentation explores various smoothing methods used in financial time series analysis

Simple Moving Average (SMA)

Basic smoothing technique that calculates the average of a fixed number of data points
Formula: \(SMA(t) = \frac{y(t) + y(t-1) + ... + y(t-n+1)}{n}\)
Widely used in technical analysis for identifying trends and generating trading signals
Advantages: Easy to understand and implement, effective for removing high-frequency noise
Limitations: Introduces a lag, sensitive to outliers, may distort underlying patterns

Simple Moving Average (SMA)

# Simple Moving Average (SMA)
sma_20 <- SMA(aapl_prices, n = 20)
# Plot SMA
ggplot() +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(aapl_prices)), color = "black") +
  geom_line(aes(x = index(sma_20), y = as.numeric(sma_20)), color = "red") +
  labs(title = "Simple Moving Average (SMA)",
       x = "Time",
       y = "Adjusted Close") +
  scale_color_manual(name = "Series", values = c("black", "red"), labels = c("Price", "SMA")) +
  theme_minimal()

Exponential Moving Average (EMA)

Assigns exponentially decreasing weights to older data points
Formula: \(EMA(t) = \alpha \times y(t) + (1 - \alpha) \times EMA(t-1)\)
Commonly used in technical analysis, forecasting, and signal processing
Advantages: Responds quickly to changes, less lag than SMA, less sensitive to outliers
Limitations: Requires tuning the smoothing parameter (\(\alpha\))

Exponential Moving Average (EMA)

# Exponential Moving Average (EMA)
ema_20 <- EMA(aapl_prices, n = 20)
ggplot() +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(aapl_prices)), color = "black") +
  geom_line(aes(x = index(ema_20), y = as.numeric(ema_20)), color = "blue") +
  labs(title = "Exponential Moving Average (EMA)",
       x = "Time",
       y = "Adjusted Close") +
  scale_color_manual(name = "Series", values = c("black", "blue"), labels = c("Price", "EMA")) +
  theme_minimal()

Weighted Moving Average (WMA)

Assigns different weights to data points within the window
Formula: \(WMA(t) = \frac{w_1 \times y(t) + w_2 \times y(t-1) + ... + w_n \times y(t-n+1)}{w_1 + w_2 + ... + w_n}\)
Used in technical analysis, signal processing, and trend analysis
Advantages: Allows for customized weighting schemes, can better capture underlying patterns
Limitations: Requires careful selection of weights, inappropriate weights can distort the series

Weighted Moving Average (WMA)

wma_custom <- WMA(aapl_prices, n = 5, wts = c(0.1, 0.2, 0.3, 0.2, 0.2))
ggplot() +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(aapl_prices)), color = "black") +
  geom_line(aes(x = index(wma_custom), y = as.numeric(wma_custom)), color = "green") +
  labs(title = "Weighted Moving Average (WMA)",
       x = "Time",
       y = "Adjusted Close") +
  scale_color_manual(name = "Series", values = c("black", "green"), labels = c("Price", "WMA")) +
  theme_minimal()

Savitzky-Golay Filter

Performs local polynomial regression on a moving window of data points
Widely used in signal processing, spectroscopy, and financial time series analysis
Advantages: Preserves data features (peaks and valleys), handles noisy data effectively
Limitations: Computationally expensive, choice of polynomial order and window size affects results

Savitzky-Golay Filter

sg_filter <- sgolayfilt(aapl_prices, p = 3, n = 21)
# Plot Savitzky-Golay Filter
ggplot() +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(aapl_prices)), color = "black") +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(sg_filter)), color = "purple") +
  labs(title = "Savitzky-Golay Filter",
       x = "Time",
       y = "Adjusted Close") +
  scale_color_manual(name = "Series", values = c("black", "purple"), labels = c("Price", "SG Filter")) +
  theme_minimal()

Lowess (Locally Weighted Scatterplot Smoothing)

Non-parametric regression technique that fits a low-degree polynomial to localized subsets of data
Useful for identifying non-linear trends and patterns in financial time series data
Advantages: Effective for handling non-linear relationships, robust to outliers, captures complex patterns
Limitations: Computationally intensive, sensitive to the choice of smoothing parameters

Lowess (Locally Weighted Scatterplot Smoothing)

# Lowess Smoothing
lowess_smooth <- lowess(aapl_prices)
# Plot Lowess Smoothing
ggplot() +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(aapl_prices)), color = "black") +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(lowess_smooth$y)), color = "orange") +
  labs(title = "Lowess Smoothing",
       x = "Time",
       y = "Adjusted Close") +
  scale_color_manual(name = "Series", values = c("black", "orange"), labels = c("Price", "Lowess")) +
  theme_minimal()

Kalman Filter

Recursive algorithm that estimates the actual state of a dynamic system from noisy observations
Widely used in finance for portfolio optimization, risk management, and forecasting
Advantages: Optimal for linear systems with Gaussian noise, handles missing data, provides state estimates and uncertainties
Limitations: Assumes linear system model and Gaussian noise, performance degrades for non-linear or non-Gaussian systems

Kalman Filter

# Apply the Kalman filter to the price series
# dlmModPoly(1) specifies a local linear trend model
# dV is the observation variance (noise in price measurements)
# dW is the system variance (noise in the underlying price evolution)
s <- dlmSmooth(aapl_prices, 
               dlmModPoly(1, 
                         dV = 15100,  # Observation variance
                         dW = 1470))  # System variance

# Create a data frame for plotting
# Combining original prices and Kalman filter estimates
data_df <- data.frame(
    Date = index(aapl_prices),                 # Time index
    Price = as.numeric(aapl_prices),           # Original prices
    Kalman = as.numeric(dropFirst(s$s))        # Smoothed estimates
)

# Create visualization comparing original prices to Kalman filter estimates
data_df |> 
    ggplot(aes(x = Date)) +
    # Plot original price series
    geom_line(aes(y = Price, color = "Price")) +
    # Plot Kalman filter estimates
    geom_line(aes(y = Kalman, color = "Kalman")) +
    # Add labels and formatting
    labs(
        title = "Apple Inc. (AAPL) Stock Prices",
        subtitle = "Comparison of raw prices vs Kalman filter smoothing",
        x = "Time",
        y = "Adjusted Close"
    ) +
    scale_color_manual(
        name = "Series", 
        values = c("Price" = "black", "Kalman" = "blue")
    ) +
    theme_minimal()

Quantile-Quantile Plot Creation

# Create Q-Q plot to compare returns distribution to normal distribution
glen_m %>%
    ggplot(aes(sample = 100 * log_return)) +  # Multiply by 100 to show as percentage
    # Add Q-Q points
    stat_qq(
        distribution = stats::qnorm,  # Compare against normal distribution
        color = "blue",
        alpha = 0.6
    ) +
    # Add reference line
    stat_qq_line(
        distribution = stats::qnorm,
        colour = "red"
    ) +
    # Add labels and formatting
    labs(
        title = "Q-Q Plot of Glencore Stock Returns",
        subtitle = "Comparing returns distribution to normal distribution",
        x = "Theoretical Quantiles",
        y = "Sample Quantiles (%)"
    ) +
    theme_minimal()

Seasonal Plot Creation

# Create seasonal plot to visualize patterns within years
glen_m_ts %>% 
    ggseasonplot(
        year.labels = TRUE,          # Show labels for each year
        year.labels.left = TRUE,     # Place year labels on left side
        col = rainbow(10)            # Use different color for each year
    ) + 
    # Add labels and formatting
    labs(
        title = "Seasonal Plot: Glencore Returns",
        subtitle = "Monthly patterns across different years",
        x = "Month",
        y = "Returns"
    ) +
    theme_minimal() +
    # Adjust theme for better readability
    theme(
        legend.position = "right",
        axis.text.x = element_text(angle = 45, hjust = 1)
    )

Understanding Autocorrelation

Autocorrelation measures how related a time series is with itself at different time lags
Think of it like comparing today’s price with:
- Yesterday’s price (lag 1)
- Last week’s price (lag 5)
- Last month’s price (lag 20)
Values range from -1 to 1:
- Close to 1: Strong positive relationship
- Close to -1: Strong negative relationship
- Close to 0: No relationship

What is White Noise?

White noise is a sequence of random values with:
- Constant mean (usually 0)
- Constant variance
- No correlation between values at different times
Important because:
- It’s the simplest type of random process
- Used to model “pure randomness” in financial markets
- Helps identify when a pattern is truly significant

Introduction to Kalman Filtering

Named after Rudolf Kálmán
Think of it like a GPS system:
- Combines noisy measurements with predictions
- Updates estimates as new data arrives
- Gets better over time
In finance, used for:
- Estimating true price trends
- Filtering out market noise
- State space modeling

Understanding Q-Q Plots

Q-Q (Quantile-Quantile) plots help assess if data follows a normal distribution
How to read them:
- Points following diagonal line → data is normally distributed
- Points above line → heavier tails than normal
- Points below line → lighter tails than normal
Why important in finance:
- Many models assume normal distribution
- Helps identify potential risk in extreme events
- Guides choice of statistical methods

Seasonal Patterns in Financial Data

Many financial time series show recurring patterns
Common types:
- Daily patterns (market open/close)
- Weekly patterns (weekend effect)
- Monthly patterns (end-of-month effect)
- Yearly patterns (tax-loss selling)
Understanding seasonality helps:
- Better timing of trades
- Risk management
- Performance attribution

Comparing Smoothing Techniques

Simple Methods

Moving Averages
- Easy to understand
- Lag in signals
- Equal weights to all points

Advanced Methods

Kalman Filter
- Adaptive to changes
- Handles uncertainty
- More complex to implement

When to Use What?

Simple MA: Quick trend analysis
EMA: More responsive trading signals
Kalman: Complex state estimation
Seasonal: Recurring patterns

Illuminating Financial Time Series:

Exploring Data

Learning Objectives

Why This Matters

Outline

Rethinking visualisation

ts objects and ts function

Example

ts objects and ts function

Type of data

Class package (pre-loaded in Q-RaP)

Some class data

Programmatically accessing data from the

Is a table a good visualisation?

Rethinking visualisation using ggplot2

Example: Exchange rate time series

Your turn

Distributions properties of asset returns

Why normal?

Why normal?

Importance of simulation

the importance of simulation

the importance of simulation

the importance of simulation

Monte Carlo Simulation

Visualizing Monte Carlo Results

Monte Carlo Simulation Results

Enhanced Monte Carlo Results

Key Improvements in Enhanced Simulation

Summarising simulations

Median statistics

Why normal?

Practical reasons

Understanding Normal Distributions: Two Perspectives

The Ontological Perspective: “What Actually Exists”

Key Points:

The Epistemological Perspective: “What Can We Know?”

Key Points:

Why These Perspectives Matter in Finance

Why normal asset returns?

Why assume normal asset returns?

Simple model of asset returns

Why normal asset returns?

Another model of asset returns

Are stock returns distributed normally?

Ugly code

Piping (%>%) code is more readable?

Are stock returns distributed normally?

ggplot code

Inference from the plot

Are stock returns distributed normally?

Quantile-quantile plot

Inference from plot

Should we care if asset returns are normal distributed variables?

Is normality of errors important?

Null hypothesis significant testing

A NHST for normality

Lung-Box test

Are stock returns distributed normally?

Statistical inference

Practical inference

Heavy Tail Statistical Distributions

Visual explorations

Rethinking time plots

Time plot

Seasonal plots

Inference from the plot

Polar plot

Subseries plots

Seasonal or cyclic?

Time series patterns

Time series components

Differences between seasonal and cyclic patterns:

Time series patterns

Time series patterns

Time series patterns

Seasonal or cyclic?

Time series patterns

Lag plots and autocorrelation

Example: Earnings per share

`ts` objects and `ts` function

`ts` objects and `ts` function

Rethinking visualisation using `ggplot2`

`the importance of simulation`

`the importance of simulation`

`the importance of simulation`