Exploring Data

Learning Objectives

  • Understand how to visualize and explore financial time series data
  • Learn key R packages and functions for financial analysis
  • Develop intuition for patterns in financial data

Why This Matters

  • Good data exploration helps identify potential modeling approaches
  • Visual analysis reveals patterns that statistics might miss
  • Essential skill for any financial analyst

Outline

  • package tsfe and class datasets
  • ts objects and ts function
  • Visual trends and patterns
  • Distributions properties of asset returns
  • Time series decomposition

Rethinking visualisation

  • Charts are not meant to be seen, they are intended to be read
  • They are not just images but visual arguments.
  • Data doesn’t speak for itself.
  • Data visualisations need to be shown and explained

ts objects and ts function

A time series is stored in a ts object in R:

  • a list of numbers
  • information about times those numbers were recorded.

Example

x <- c(123,39,78,52,110)
yr <- 2012:2016
knitr::kable(data.frame(Year=yr,Observation=x), booktabs=TRUE)
Year Observation
2012 123
2013 39
2014 78
2015 52
2016 110

ts objects and ts function

For observations that are more frequent than once per year, add a frequency argument.

E.g., monthly data stored as a numerical vector z:

y <- ts(z, frequency=12, start=c(2003, 1))

ts(data, frequency, start)

Type of data

frequency start example
Annual 1 1995
Quarterly 4 c(1995,2)
Monthly 12 c(1995,9)
Daily 7 or 365.25 1 or c(1995,234)
Weekly 52.18 c(1995,23)
Hourly 24 or 168 or 8,766 1
Half-hourly 48 or 336 or 17,532 1

Class package (pre-loaded in Q-RaP)

library(tidyverse); library(tidyquant); library(DT)

DT loads datatable interactive table visualisation

tidyquant loads:

  • tq_transmute function (for transforming between time frequencies)
  • tq_mutate function (create return series)

tidyverse loads many packages include:

  • ggplot plotting package
  • dplyr data wrangling package
library(fpp2)

This loads:

  • forecast package (for forecasting functions)
  • fma package (for lots of time series data)
  • expsmooth package (for more time series data)

Some class data

MyDeleteItems<-ls() 
rm(list=MyDeleteItems) 
# Good practice to clear all objects before loading data
library(tsfe) # includes class datasets
print(data(package='tsfe'))

Programmatically accessing data from the

  • Download data using tg_get
  • Create a monthly series using tq_transmute
# Download FTSE 100 index data starting from 2016
ftse <- tq_get("^FTSE", from="2016-01-01")

# Convert daily data to monthly frequency
# tq_transmute is used to transform the data while selecting specific columns
ftse_m <- tq_transmute(ftse,
                     select = adjusted,    # Select adjusted closing prices
                     mutate_fun = to.monthly)  # Convert to monthly frequency

# Create a time series object from the monthly data
# Start date is set to January 2016 with monthly frequency (freq=12)
ftse_m_ts <- ts(ftse_m$adjusted, 
                start=c(2016,1),  # Starting year and month
                freq=12)          # 12 periods per year (monthly data)
  • ts object has plotting functionality.
  • Quick visualisation of financial time trends in monthly ftse 100 index price
autoplot(tsfe::ftse_m_ts) + theme_tq()

Is a table a good visualisation?

  • Quarterly earnings per share data for carnival PLC.

  • Using DT to create an interactive table

#|echo: true

tsfe::ni_hsales %>% datatable()

Rethinking visualisation using ggplot2

  • ggplot2 is based on The Grammar of Graphics, the idea that you can build every graph from the same components: a data set, a coordinate system, and geoms—visual marks that represent data points.

Example: Exchange rate time series

tsfe::usuk_rate %>%  # Data
  ggplot(aes(x=date, y=price )) + # Coordinate system
  geom_line(colour="pink") # geom

Your turn

  • Use tq_get to download the CBOE VIX Index from 2016-01-01 using the symbol ^VIX
  • create a time series object using the VIX index price.
  • Plot this daily VIX Index Price using autoplot.

Hint: the ts object of daily financial time series does not have a regular frequency to input into ts() function, so leave this argument blank.

Distributions properties of asset returns

Why normal?

  • This is a plot of a normal distribution with mean equal to zero and variance equal to one.
require(grDevices) 
require(graphics)
ggplot(data = data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm, 
                n = 101, 
                args = list(mean = 0, sd =1)
                ,lwd=1,colour="red") + 
  ylab("") +
  scale_y_continuous(breaks = NULL) +
  labs(title=expression(y[it] %~% N(0,1)),
       x=expression(y[it])) +
  theme(plot.title = element_text(hjust = 0.5,size=8))

Why normal?

  • Named after Carolo Friderico Gavss, the normal (or Gaussian) distribution is the most common used distributional assumption in statistical analysis.

This is because:

  • Easy to calculate with
  • Common in nature
  • Very conservative assumptions

Importance of simulation

the importance of simulation

  • Simulation of random variables is important in applied statistics for several reasons.
  • First, we use probability models to mimic variation in the world, and the tools of simulation can help us better understand how this variation plays out.

Patterns of randomness are notoriously contrary to normal human thinking—our brains don’t seem to be able to do a good job understanding that random swings will be present in the short term but average out in the long run—and in many cases simulation is a big help in training our intuitions about averages and variation. (Gelman et al. 2021)

the importance of simulation

  • .Second, we can use simulation to approximate the sampling distribution of data and propagate this to the sampling distribution of statistical estimates and procedures.

  • Third, regression models are not deterministic; they produce probabilistic predictions.

the importance of simulation

  • Simulation is the most convenient and general way to represent uncertainties in forecasts.

Throughout this course and in our practice, we use simulation for all these reasons;

  • In this lecture we introduce the basic ideas and the tools required to perform simulations in R. ]

  • Suppose your company is being IPO’d at a starting price of £40.

  • You want to know the future price of the stock in 200 days.

  • You have been told Monte Carlo simulation can help predict stock market futures

  • Can you create a number of possible future price paths and find average price in 200 days?

  • How many futures should you create?

  • What assumptions should we make about these futures?

Monte Carlo Simulation

  • The coding for simulation becomes cleaner if we express the steps for a single simulation in an R function.
# Function to simulate a single price path
path_sim <- function(){
    days <- 200                           # Simulation horizon
    
    # Generate daily returns with slight upward drift
    # mean=1.001 implies 0.1% average daily return
    # sd=0.005 represents daily volatility of 0.5%
    changes <- rnorm(200, mean=1.001, sd=0.005)
    
    # Calculate cumulative price path starting from £40
    # cumprod() calculates the cumulative product of the changes
    sample.path <- cumprod(c(40, changes))
    
    # Return the final price after 200 days
    closing.price <- sample.path[days+1]  # +1 because we add the opening price
    return(closing.price)
}

# Run the simulation 10,000 times to generate possible future prices
number_of_possible_futures <- 10000
mc.closing <- replicate(number_of_possible_futures, path_sim())

Visualizing Monte Carlo Results

Monte Carlo Simulation Results

Enhanced Monte Carlo Results

Key Improvements in Enhanced Simulation

  • More realistic assumptions:
    • Uses actual trading days (252) instead of calendar days
    • Creates fat tails using mixture of normal distributions
    • Annual parameters based on typical market behavior
    • Models more extreme market events
  • Uses log-returns instead of simple returns
  • Provides more comprehensive risk assessment

Summarising simulations

  • Simulations are a versatile way to summarise a probability model, predictions from a fitted regression or uncertainty about parameters of a fitted model (the probabilistic equivalent of estimates and standard errors)
  • One useful way to summarise the location of the distribution is to use the median function in the variation in the distribution is the median absolute deviation standard deviation (mad sd).

Median statistics

We typically prefer median-based summaries because they are more computationally stable, and we rescale the median-based summary of variation as described above so as to be comparable to the standard deviation, which we already know how to interpret in usual statistical practice.

cat("median = ",median(mc.closing),
    "mad sd = ",mad(mc.closing),
    "mean = ",mean(mc.closing),
    "sd = ",sd(mc.closing))
median =  48.75303 mad sd =  3.441515 mean =  48.85153 sd =  3.466317

Why normal?

Practical reasons

  • Processes that produce normal distributions include:
  • Addition of many independent random variables
  • Lots of process are approximately normal
  • Product of small deviations
  • logarithms of products
  • Logarithms are just magnitudes

Understanding Normal Distributions: Two Perspectives

The Ontological Perspective: “What Actually Exists”

  • Ontology asks: “What is the nature of reality?”
  • In financial markets, this means understanding how normal distributions naturally arise:

Key Points:

  • Many small, independent changes add up to normal distributions
  • Like many traders making small price impacts
  • The final shape emerges from natural processes
  • Similar to how water finds its level naturally

The Epistemological Perspective: “What Can We Know?”

  • Epistemology asks: “How do we know what we know?”
  • In finance, this means understanding our limitations in knowing market behavior:

Key Points:

  • When we only know the mean and variance:
    • Normal distribution is the most conservative assumption
    • “Conservative” means making the fewest additional assumptions
    • This is called the “maximum entropy” principle
  • Like betting on a horse race when you only know the average speed

Why These Perspectives Matter in Finance

  • Ontological (Reality):
    • Helps understand how market prices actually form
    • Explains why some patterns naturally emerge
    • Shows limits of our ability to predict
  • Epistemological (Knowledge):
    • Guides our modeling choices
    • Helps avoid overconfident predictions
    • Explains why we use normal distributions even when we know they’re not perfect

Why normal asset returns?

  • Financial time series processes considered in this course:
  • Return series
  • Financial statement information
  • Volatility processes
  • Extreme events
  • Multivariate series

Why assume normal asset returns?

  • Normality assumption allows for asset returns properties to be tractable.
  • Tractable mean and variance
    • They provide information about the long-term return and risk, respectively.
  • Tractable symmetrical properties
    • Symmetry has important implications in holding short and long financial positions in risk management.
  • Tractable kurtosis properties
    • Kurtosis is related to volatility forecasting, efficiency in estimation and tests,etc.

Simple model of asset returns

  • To engineer a statistical model of asset returns we need to make some assumptions about the data story or data generating process.

\[ \{R_{it}|i-1,\dots N;t=1,\dots,T\} \stackrel{i.i.d}\sim N (m_1,m_2) \]

  • A traditional assumption in financial analysis is the simple returns are independently and identically distributed (iid) as normal with a fixed mean \(m_1\) and variance \(m_2\).

Why normal asset returns?

  • The previous assumption is unrealistic in a number of ways:
  1. The lower bound of a simple return is -1, but a normal distribution has no lower bound.
  2. Multiperiod simple returns \(R_it[k]\) is not normally distributed as it is the product of one-period returns.
  3. Empirically, asset returns tend to have heavy tails or positive excess kurtosis ]

Another model of asset returns

\[ \{r_{it}|i=1,\dots N ;t=1,\dots,T\} \stackrel{i.i.d}\sim N (\mu,\sigma^2) \]

  • Another common assumption is that log returns are iid as normal with mean \(\mu\) and variance \(\sigma^2\).
  • As the sum of a finite number of iid random variables is normal, \(r_t[k]\) is also normally distributed.
  • There is also no lower bound for \(r_t\)
  • However, lognormal assumption is not consistent with positive excess kurtosis.

Are stock returns distributed normally?

  • The following code loads Glencore Plc asset prices from yahoo finance, converts daily adjusted prices to monthly log returns, and creates a monthly time series object.

Ugly code

glen <- tq_get("GLEN.L")
glen_m <- tq_transmute(glen,
                       select = adjusted,
                       mutate_fun = monthlyReturn,
                       type="log",
                       col_rename = "log_return")
glen_m_ts <- ts(glen_m$log_return,
                frequency=12,start=c(2011,5))

Piping (%>%) code is more readable?

  • literate programming
glen <- tq_get("GLEN.L")
glen_m <- glen %>%
  tq_transmute(select = adjusted,
               mutate_fun = monthlyReturn,
               type="log",
               col_rename = "log_return")
glen_m_ts <- glen_m$log_return %>%
  ts(frequency=12,start=c(2011,5))

Are stock returns distributed normally?

Visual Arguments

  • We can use ggplot to visualise the empirical distribution, superimposing what the returns would look like if they were normally distributed.

    • Only two parameters (a mean and a variance) are required to create a hypothetical normal distribution of a returns series

ggplot code

# Download Glencore stock data from Yahoo Finance
glen <- tq_get("GLEN.L")

# Calculate monthly log returns
glen_m <- glen %>%
    tq_transmute(
        select = adjusted,           # Select adjusted closing prices
        mutate_fun = monthlyReturn,  # Convert to monthly returns
        type = "log",               # Calculate log returns
        col_rename = "log_return"    # Name the new column
    )

# Create a time series object from the returns
# Starting from May 2011 with monthly frequency
glen_m_ts <- glen_m$log_return %>%
    ts(frequency=12, start=c(2011,5))

# Create density plot comparing returns to normal distribution
glen_m %>% 
    ggplot(aes(x=log_return)) +
    geom_density() +                # Plot the actual return distribution
    stat_function(                  # Add normal distribution line
        fun=dnorm,
        args=list(
            mean(glen_m$log_return),    # Use actual mean
            sd=sd(glen_m$log_return)    # Use actual standard deviation
        ),
        col="red"
    )

Inference from the plot

  • What patterns are revealed?
  • The normal distribution is superimposed over the histogram of the daily equity returns.
  • Compared to the normal the distribution of the returns has longer tails and a higher central peak.
  • In statistical terms we say the distribution is leptokurtic, or fat-tailed.

Are stock returns distributed normally?

Quantile-quantile plot
  • A quantile-quantile plot is a graphical tool to compare two distributions.
glen_m %>%
  ggplot(aes(sample=100*log_return)) +
  stat_qq(distribution = stats::qnorm) +
  stat_qq_line(distribution = stats::qnorm,
               colour="red") +
  labs(title = "Quantile-quantile plot of glencore stock returns")

Inference from plot

  • This plot compares the quantiles of a normal distribution (thinner straight line) to the quantiles of the data (thicker scatter plot).
  • If the plots exactly overlap then the data is probably normally distributed.
  • While the returns and the normal distribution are similar between +12.5% and -12.5%, outside these limits the returns behave non-normally.

Should we care if asset returns are normal distributed variables?

  • In regressions the assumption of normality of model errors is one of the least important
  • For the purpose of estimating the regression line it is barely important at all
  • Diagnostics of normality of errors is not recommended unless the model is being used to predict individual data points.

Is normality of errors important?

  • If the distribution of errors is of interest, perhaps because of predictive goals, this should be distinguished from the distribution of the data, \(y\).
  • A regression model does not assume or require that predictors are normally distributed
  • Furthermore, the normal distribution on the outcome refers to the regression errors, not the raw data.
  • Depending on the structure of the predictors, it is possible for data \(y\) to be far from normally distributed even when coming from a linear regression model.
  • See Gelman et al., (2020) Chapter 11 for more detail

Null hypothesis significant testing

  • Null hypothesis significance tests (NHST) are models.
  • We assume an underlying data story with distributional properties which then allows us to create p-values based on null hypothesis.
  • In practice they are often misused to create ‘bright line’ acceptance or rejection decision about underlying theoretical questions.
  • In applied statistics their misuse is well understood

Read Bailey, D. H. & Prado, M. L. de. Finance is Not Excused: Why Finance Should Not Flout Basic Principles of Statistics. Ssrn Electron J (2021) doi:10.2139/ssrn.gi3895330.

A NHST for normality

  • The Shapiro-Wilk test is a test of the null hypothesis that the data is normally distributed.
  • The test statistic is \(W\) and the null hypothesis is that the data is normally distributed.
  • The test is sensitive to the tails of the distribution and is recommended for small samples.
shapiro.test(glen_m$log_return)

    Shapiro-Wilk normality test

data:  glen_m$log_return
W = 0.94734, p-value = 0.000121

Lung-Box test

  • The Lung-Box test is a test of the null hypothesis that the data is independently and identically distributed.
  • The test statistic is \(Q\) and the null hypothesis is that the data is independently and identically distributed.
  • The test is sensitive to the tails of the distribution and is recommended for small samples.
Box.test(glen_m$log_return,lag = 5, type="Ljung-Box")

    Box-Ljung test

data:  glen_m$log_return
X-squared = 2.0245, df = 5, p-value = 0.8457

Are stock returns distributed normally?

Statistical inference

  • Both test give different result!!
  • Assuming the models underpinning the previous NHST are valid, we can reject the Null that Glencore monthly log returns are normally distributed at even the 1% significance level.

Practical inference

  • Assuming normality to model the middle of the data is probably ok
  • Modelly extreme observations may need more distributional tools.

Heavy Tail Statistical Distributions

Student’s t
very similar to the normal but wider and lower.
Stable
generalisation of the normal, stable under addition thus can be used with log returns, can capture excess kurtosis well but has infinite variance which conflicts with finance theory.
Scale mixture
a combination of a number of normals.

Visual explorations

Rethinking time plots

  • One limitation with time plots is that the simple passage of time is not a good explanatory variable.
  • There are occasional exceptions where there is a clear mechanism driving the financial time series.

Descriptive chronology is not causal explanation - Edward Tufte ( 2015) The visual display of quantitative information P37

Time plot

glen_m %>% ggplot(aes(x=date,y=log_return)) + 
  geom_line()

Seasonal plots

glen_m_ts %>% ggseasonplot(year.labels=TRUE,
                           year.labels.left=TRUE) + 
  ylab("") +
  ggtitle("Seasonal plot: Glencore returns")

Inference from the plot

  • Data plotted against the individual “seasons” in which the data were observed. (In this case a “season” is a month.)
    • Something like a time plot except that the data from each season are overlapped.
    • Enables the underlying seasonal pattern to be seen more clearly, and also allows any substantial departures from the seasonal pattern to be easily identified.
    • In R: ggseasonplot()

Polar plot

glen_m_ts %>% ggseasonplot(polar=TRUE) + ylab("")

Subseries plots

ggsubseriesplot(glen_m_ts) + ylab("") +
  ggtitle("Subseries plot: Glencore returns")
  • Data for each season collected together in time plot as separate time series.
  • Enables the underlying seasonal pattern to be seen clearly, and changes in seasonality over time to be visualized.
  • In R: ggsubseriesplot()

Seasonal or cyclic?

Time series patterns

Trend
pattern exists when there is a long-term increase or decrease in the data.
Seasonal
pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week).
Cyclic
pattern exists when data exhibit rises and falls that are not of fixed period (duration usually of at least 2 years).

Time series components

Differences between seasonal and cyclic patterns:

  • seasonal pattern constant length; cyclic pattern variable length
  • average length of cycle longer than length of seasonal pattern
  • magnitude of cycle more variable than magnitude of seasonal pattern

Time series patterns

ts(tsfe::ni_hsales$`Total Verified Sales`,start = c(2005,1),frequency = 4)->ni_hsales_ts
autoplot(ni_hsales_ts) +
  ggtitle("Northern Ireland Quarter House Sales") +
  xlab("Year") + ylab("Total Verified Sales")

Time series patterns

autoplot(carnival_eps_ts) +
  ggtitle("Quarterly EPS for Carnival Plc") +
  xlab("Year") + ylab("")

Time series patterns

tsfe::usuk_rate %>%
  ggplot(aes(x=date,y=price)) +
  geom_line(colour="darkgreen") +
  labs(title=" Time Plot of GBP:USD",
       x="",
       y="Value of £1 in Dollars")

Seasonal or cyclic?

Time series patterns

  • seasonal pattern constant length; cyclic pattern variable length
  • average length of cycle longer than length of seasonal pattern
  • magnitude of cycle more variable than magnitude of seasonal pattern

The timing of peaks and troughs is predictable with seasonal data, but unpredictable in the long term with cyclic data.

Lag plots and autocorrelation

Example: Earnings per share

gglagplot(tsfe::carnival_eps_ts)

Lagged scatterplots

  • Each plot shows \(y_t\) plotted against \(y_{t-k}\) for different values of \(k\).
  • The autocorrelations are the correlations associated with these scatterplots.

Autocorrelation

  • One of the most important data properties that financial time series models exploits

  • Covariance and correlation: measure extent of linear relationship between two variables (\(y\) and \(X\)).

  • Autocovariance and autocorrelation: measure linear relationship between lagged values of a time series \(y\).

  • We measure the relationship between:

  • \(y_{t}\) and \(y_{t-1}\)

  • \(y_{t}\) and \(y_{t-2}\)

  • \(y_{t}\) and \(y_{t-3}\)

  • etc.

Autocorrelation

  • We denote the sample autocovariance at lag \(k\) by \(c_k\) and the sample autocorrelation at lag \(k\) by \(r_k\). Then define

\[c_k = \frac{1}{T}\sum_{t=k+1}^T (y_t-\bar{y})(y_{t-k}-\bar{y})\]

where \(\bar{y}\) is the sample mean of the \(y_t\).

\[r_{k} = c_k/c_0\]

  • \(r_1\) indicates how successive values of \(y\) relate to each other
  • \(r_2\) indicates how \(y\) values two periods apart relate to each other
  • \(r_k\) is almost the same as the sample correlation between \(y_t\) and \(y_{t-k}\).

Autocorrelation

Results for first 9 lags for Carnival earnings data:

\(r_1\) \(r_2\) \(r_3\) \(r_4\) \(r_5\) \(r_6\) \(r_7\) \(r_8\) \(r_9\)
0.103 -0.109 0.069 0.839 0.051 -0.143 0.016 0.707 -0.021

Autocorrelation Function (ACF) plot

Results for first 9 lags for Carnival earnings data:

ggAcf(tsfe::carnival_eps_ts)

Autocorrelation inference

  • \(r_{1}\) is positive and high, indicating that successive values of the series are positively correlated.
  • \(r_{4}\) higher than for the other lags. This is due to the seasonal pattern in the data: the peaks tend to be 4 quarters apart and the troughs tend to be 2 quarters apart.
  • \(r_2\) is more negative than for the other lags because troughs tend to be 2 quarters behind peaks.
  • Together, the autocorrelations at lags 1, 2, \(\dots\), make up the autocorrelation or ACF.
  • The plot is known as a correlogram

ACF plots Trend and seasonality

  • When data have a trend, the autocorrelations for small lags tend to be large and positive.
  • When data are seasonal, the autocorrelations will be larger at the seasonal lags (i.e., at multiples of the seasonal frequency)
  • When data are trended and seasonal, you see a combination of these effects.

Any autcorrelation?

ggAcf(carnival_eps_ts)

Discussion

  • Time plot shows clear trend and seasonality.

  • The same features are reflected in the ACF.

    • The slowly decaying ACF indicates trend.
    • The ACF peaks at lags 4, 8, 12, 16, 20, \(\dots\), indicate seasonality of length 4.

Your turn

  • use tq_get() to download GLEN.L stock price from yahoo finance
  • using this series create a time series object hint use frequency=1
  • Explore this time series for trends and autocorrelation

Signal and the noise

  • On important stylised fact of financial time series data is a low \(\frac{Signal}{Noise}\) ratio, which changes over time (is dynamic).
  • The signal is the sense you want your model to capture and predict, the noise is the nonsense you do not want your model to capture as it is unpredictable.

Noise makes trading in financial markets possible, and thus allows us to observe prices in financial assets. [But] noise also causes markets to be somewhat inefficient…. . Most generally, noise makes it very difficult to test either practical or academic theories about the way that financial or economic markets work. We are forced to act largely in the dark Fisher Black, Noise, Journal of Finance ,41 ,3 (1986) (p.529)

Example: White noise

set.seed(1222)
wn <- ts(rnorm(36))
autoplot(wn)

Example: White noise

\(r_{1}\) \(r_{2}\) \(r_{3}\) \(r_{4}\) \(r_{5}\) \(r_{6}\) \(r_{7}\) \(r_{8}\) \(r_{9}\) \(r_{10}\)
0.132 -0.032 0.074 -0.098 0.043 0.339 0.151 0.097 -0.102 -0.073
  • Sample autocorrelations for white noise series

  • We expect each autocorrelation to be close to zero.

Sampling distribution of autocorrelations

Sampling distribution of \(r_k\) for white noise data is asymptotically N(0,\(1/T\)).

  • 95% of all \(r_k\) for white noise must lie within \(\pm 1.96/\sqrt{T}\).
  • If this is not the case, the series is probably not WN.
  • Common to plot lines at \(\pm 1.96/\sqrt{T}\) when plotting ACF.

Your turn

  • You can compute the daily changes in the Google stock price using
dgoog <- diff(goog)
  • Does dgoog look like white noise?

Financial Data Forecastability

  • Predictability of an event or quantity depends on several factors
  1. How well we understand the factors that contribute to it
  2. How much data are available
  3. Whether the forecast can affect the thing we are trying to forecast

Prediction and EMH

  • Crudely, the efficent market hypothesis (EMH) implies that returns from speculative assets are unforecastable
  • The overpowering logic of the EMH is:
    • If returns are forecastable,there should exist a money machine producing unlimited wealth
  • Based on the random walk theory change in price are defined as white noise as follows: \[\begin{align*} p_t=p_{t-1} + a_t \\ p_t-p_{t-1} = a_t \\ \text{where} \end{align*}\]

Prediction and EMH

  • High quality models identify the signal from the noise in financial data.
  • The signal is the regular pattern that is likely to repeat.
  • The noise is the irregular pattern which occurrs by chance and unlikely to repeat.
  • Overfitting or data snooping can result in your model capturing both signal and noise.
  • Overfitted models usually produce poor predictions and inferences.

Time Series Modelling

Time Series Modelling

  • Financial time series data often exhibit patterns, trends, and fluctuations
  • Appropriate modelling and processing techniques are required to extract meaningful insights
  • Two commonly used approaches:
    • ARIMA (Autoregressive Integrated Moving Average) modelling
    • Smoothing techniques

ARIMA Modelling

  • Class of statistical models widely used for time series forecasting and analysis
  • Combines autoregressive (AR) and moving average (MA) components
  • Handles non-stationarity through differencing

Key Aspects of ARIMA Modelling

  1. Stationarity
    • Assumes time series is stationary (constant mean, variance, and autocorrelation)
    • Differencing is applied to achieve stationarity if data is non-stationary
  2. Autocorrelation
    • Captures the data’s autocorrelation structure
    • Future values are influenced by past values and/or past errors
  3. Model Identification
    • Specified by three parameters: p (AR order), d (differencing degree), q (MA order)
    • Determined through iterative model identification, estimation, and diagnostic checking
  4. Forecasting
    • Generates forecasts for future periods once an appropriate ARIMA model is identified and estimated

Suitability of ARIMA Models

  • Capturing underlying patterns and dynamics of time series data
  • Handling trends, seasonality, and autocorrelation structures
  • Widely used in finance for forecasting stock prices, exchange rates, and economic indicators

Smoothing Techniques

  • Reduce noise or irregularities in time series data
  • Reveal the underlying trend or signal
  • Apply filters or weighted averages to smooth out fluctuations
  • Do not explicitly model the autocorrelation structure

Standard Smoothing Techniques

  1. Moving Averages (Simple, Exponential, Weighted)
  2. Savitzky-Golay Filter
  3. Lowess (Locally Weighted Scatterplot Smoothing)
  4. Kalman Filter

Suitability of Smoothing Techniques

  • Extracting the underlying trend or signal from noisy data
  • Preprocessing step before further analysis or visualization
  • Focus on denoising the data and revealing the underlying trend or signal

Choosing the Right Approach

  • Depends on the specific objectives and characteristics of the financial time series data
  • ARIMA models:
    • Forecasting future values
    • Accounting for autocorrelation and capturing underlying patterns
  • Smoothing techniques:
    • Denoising the data and revealing the underlying trend or signal

Combining Approaches

  • ARIMA modelling and smoothing techniques can be combined
  • Used in conjunction with other techniques (decomposition methods, machine learning algorithms)
  • Helps gain deeper insights into financial time series data

Financial time series smoothing

  • Financial time series data often exhibit noise, irregularities, and fluctuations
  • Smoothing techniques help reduce random variations and reveal underlying trends
  • This presentation explores various smoothing methods used in financial time series analysis

Simple Moving Average (SMA)

  • Basic smoothing technique that calculates the average of a fixed number of data points
  • Formula: \(SMA(t) = \frac{y(t) + y(t-1) + ... + y(t-n+1)}{n}\)
  • Widely used in technical analysis for identifying trends and generating trading signals
  • Advantages: Easy to understand and implement, effective for removing high-frequency noise
  • Limitations: Introduces a lag, sensitive to outliers, may distort underlying patterns

Simple Moving Average (SMA)

# Simple Moving Average (SMA)
sma_20 <- SMA(aapl_prices, n = 20)
# Plot SMA
ggplot() +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(aapl_prices)), color = "black") +
  geom_line(aes(x = index(sma_20), y = as.numeric(sma_20)), color = "red") +
  labs(title = "Simple Moving Average (SMA)",
       x = "Time",
       y = "Adjusted Close") +
  scale_color_manual(name = "Series", values = c("black", "red"), labels = c("Price", "SMA")) +
  theme_minimal()

Exponential Moving Average (EMA)

  • Assigns exponentially decreasing weights to older data points
  • Formula: \(EMA(t) = \alpha \times y(t) + (1 - \alpha) \times EMA(t-1)\)
  • Commonly used in technical analysis, forecasting, and signal processing
  • Advantages: Responds quickly to changes, less lag than SMA, less sensitive to outliers
  • Limitations: Requires tuning the smoothing parameter (\(\alpha\))

Exponential Moving Average (EMA)

# Exponential Moving Average (EMA)
ema_20 <- EMA(aapl_prices, n = 20)
ggplot() +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(aapl_prices)), color = "black") +
  geom_line(aes(x = index(ema_20), y = as.numeric(ema_20)), color = "blue") +
  labs(title = "Exponential Moving Average (EMA)",
       x = "Time",
       y = "Adjusted Close") +
  scale_color_manual(name = "Series", values = c("black", "blue"), labels = c("Price", "EMA")) +
  theme_minimal()

Weighted Moving Average (WMA)

  • Assigns different weights to data points within the window
  • Formula: \(WMA(t) = \frac{w_1 \times y(t) + w_2 \times y(t-1) + ... + w_n \times y(t-n+1)}{w_1 + w_2 + ... + w_n}\)
  • Used in technical analysis, signal processing, and trend analysis
  • Advantages: Allows for customized weighting schemes, can better capture underlying patterns
  • Limitations: Requires careful selection of weights, inappropriate weights can distort the series

Weighted Moving Average (WMA)

wma_custom <- WMA(aapl_prices, n = 5, wts = c(0.1, 0.2, 0.3, 0.2, 0.2))
ggplot() +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(aapl_prices)), color = "black") +
  geom_line(aes(x = index(wma_custom), y = as.numeric(wma_custom)), color = "green") +
  labs(title = "Weighted Moving Average (WMA)",
       x = "Time",
       y = "Adjusted Close") +
  scale_color_manual(name = "Series", values = c("black", "green"), labels = c("Price", "WMA")) +
  theme_minimal()

Savitzky-Golay Filter

  • Performs local polynomial regression on a moving window of data points
  • Widely used in signal processing, spectroscopy, and financial time series analysis
  • Advantages: Preserves data features (peaks and valleys), handles noisy data effectively
  • Limitations: Computationally expensive, choice of polynomial order and window size affects results

Savitzky-Golay Filter

sg_filter <- sgolayfilt(aapl_prices, p = 3, n = 21)
# Plot Savitzky-Golay Filter
ggplot() +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(aapl_prices)), color = "black") +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(sg_filter)), color = "purple") +
  labs(title = "Savitzky-Golay Filter",
       x = "Time",
       y = "Adjusted Close") +
  scale_color_manual(name = "Series", values = c("black", "purple"), labels = c("Price", "SG Filter")) +
  theme_minimal()

Lowess (Locally Weighted Scatterplot Smoothing)

  • Non-parametric regression technique that fits a low-degree polynomial to localized subsets of data
  • Useful for identifying non-linear trends and patterns in financial time series data
  • Advantages: Effective for handling non-linear relationships, robust to outliers, captures complex patterns
  • Limitations: Computationally intensive, sensitive to the choice of smoothing parameters

Lowess (Locally Weighted Scatterplot Smoothing)

# Lowess Smoothing
lowess_smooth <- lowess(aapl_prices)
# Plot Lowess Smoothing
ggplot() +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(aapl_prices)), color = "black") +
  geom_line(aes(x = index(aapl_prices), y = as.numeric(lowess_smooth$y)), color = "orange") +
  labs(title = "Lowess Smoothing",
       x = "Time",
       y = "Adjusted Close") +
  scale_color_manual(name = "Series", values = c("black", "orange"), labels = c("Price", "Lowess")) +
  theme_minimal()

Kalman Filter

  • Recursive algorithm that estimates the actual state of a dynamic system from noisy observations
  • Widely used in finance for portfolio optimization, risk management, and forecasting
  • Advantages: Optimal for linear systems with Gaussian noise, handles missing data, provides state estimates and uncertainties
  • Limitations: Assumes linear system model and Gaussian noise, performance degrades for non-linear or non-Gaussian systems

Kalman Filter

# Apply the Kalman filter to the price series
# dlmModPoly(1) specifies a local linear trend model
# dV is the observation variance (noise in price measurements)
# dW is the system variance (noise in the underlying price evolution)
s <- dlmSmooth(aapl_prices, 
               dlmModPoly(1, 
                         dV = 15100,  # Observation variance
                         dW = 1470))  # System variance

# Create a data frame for plotting
# Combining original prices and Kalman filter estimates
data_df <- data.frame(
    Date = index(aapl_prices),                 # Time index
    Price = as.numeric(aapl_prices),           # Original prices
    Kalman = as.numeric(dropFirst(s$s))        # Smoothed estimates
)

# Create visualization comparing original prices to Kalman filter estimates
data_df |> 
    ggplot(aes(x = Date)) +
    # Plot original price series
    geom_line(aes(y = Price, color = "Price")) +
    # Plot Kalman filter estimates
    geom_line(aes(y = Kalman, color = "Kalman")) +
    # Add labels and formatting
    labs(
        title = "Apple Inc. (AAPL) Stock Prices",
        subtitle = "Comparison of raw prices vs Kalman filter smoothing",
        x = "Time",
        y = "Adjusted Close"
    ) +
    scale_color_manual(
        name = "Series", 
        values = c("Price" = "black", "Kalman" = "blue")
    ) +
    theme_minimal()

Quantile-Quantile Plot Creation

# Create Q-Q plot to compare returns distribution to normal distribution
glen_m %>%
    ggplot(aes(sample = 100 * log_return)) +  # Multiply by 100 to show as percentage
    # Add Q-Q points
    stat_qq(
        distribution = stats::qnorm,  # Compare against normal distribution
        color = "blue",
        alpha = 0.6
    ) +
    # Add reference line
    stat_qq_line(
        distribution = stats::qnorm,
        colour = "red"
    ) +
    # Add labels and formatting
    labs(
        title = "Q-Q Plot of Glencore Stock Returns",
        subtitle = "Comparing returns distribution to normal distribution",
        x = "Theoretical Quantiles",
        y = "Sample Quantiles (%)"
    ) +
    theme_minimal()

Seasonal Plot Creation

# Create seasonal plot to visualize patterns within years
glen_m_ts %>% 
    ggseasonplot(
        year.labels = TRUE,          # Show labels for each year
        year.labels.left = TRUE,     # Place year labels on left side
        col = rainbow(10)            # Use different color for each year
    ) + 
    # Add labels and formatting
    labs(
        title = "Seasonal Plot: Glencore Returns",
        subtitle = "Monthly patterns across different years",
        x = "Month",
        y = "Returns"
    ) +
    theme_minimal() +
    # Adjust theme for better readability
    theme(
        legend.position = "right",
        axis.text.x = element_text(angle = 45, hjust = 1)
    )

Understanding Autocorrelation

  • Autocorrelation measures how related a time series is with itself at different time lags
  • Think of it like comparing today’s price with:
    • Yesterday’s price (lag 1)
    • Last week’s price (lag 5)
    • Last month’s price (lag 20)
  • Values range from -1 to 1:
    • Close to 1: Strong positive relationship
    • Close to -1: Strong negative relationship
    • Close to 0: No relationship

What is White Noise?

  • White noise is a sequence of random values with:
    • Constant mean (usually 0)
    • Constant variance
    • No correlation between values at different times
  • Important because:
    • It’s the simplest type of random process
    • Used to model “pure randomness” in financial markets
    • Helps identify when a pattern is truly significant

Introduction to Kalman Filtering

  • Named after Rudolf Kálmán
  • Think of it like a GPS system:
    • Combines noisy measurements with predictions
    • Updates estimates as new data arrives
    • Gets better over time
  • In finance, used for:
    • Estimating true price trends
    • Filtering out market noise
    • State space modeling

Understanding Q-Q Plots

  • Q-Q (Quantile-Quantile) plots help assess if data follows a normal distribution
  • How to read them:
    • Points following diagonal line → data is normally distributed
    • Points above line → heavier tails than normal
    • Points below line → lighter tails than normal
  • Why important in finance:
    • Many models assume normal distribution
    • Helps identify potential risk in extreme events
    • Guides choice of statistical methods

Seasonal Patterns in Financial Data

  • Many financial time series show recurring patterns
  • Common types:
    • Daily patterns (market open/close)
    • Weekly patterns (weekend effect)
    • Monthly patterns (end-of-month effect)
    • Yearly patterns (tax-loss selling)
  • Understanding seasonality helps:
    • Better timing of trades
    • Risk management
    • Performance attribution

Comparing Smoothing Techniques

Simple Methods

  • Moving Averages
    • Easy to understand
    • Lag in signals
    • Equal weights to all points

Advanced Methods

  • Kalman Filter
    • Adaptive to changes
    • Handles uncertainty
    • More complex to implement

When to Use What?

  • Simple MA: Quick trend analysis
  • EMA: More responsive trading signals
  • Kalman: Complex state estimation
  • Seasonal: Recurring patterns