Toolkit

Advanced Financial Data Analytics

Barry Quinn

2024-02-01

Outline

Credibility and reproducible analytics
Data Collection
Data Processing
Growth Mindset

Reproducible financial analytics

What does it mean for a data analysis to be “reproducible”?

Near-term goals:

Are the tables and figures reproducible from the code and data?
Does the code actually do what you think it does?
In addition to what was done, is it clear why it was done?

Long-term goals:

Can the code be used for other data?
Can you extend the code to do other things?

Toolkit for reproducibility

Scriptability $\rightarrow$ R
Literate programming (code, narrative, output in one place) $\rightarrow$ Quarto
- Version control $\rightarrow$ Git / GitHub{.incremental}
- ~~Version control~~ $\rightarrow$ Git / GitHub

R packages

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data¹
As of September 2020, there are over 16,000 R packages available on CRAN (the Comprehensive R Archive Network)²
We’re going to work with a small (but important) subset of these!

¹ Wickham and Bryan, R Packages. ² CRAN contributed packages.

Tour: R and Posit IDE

A short list (for now) of R essentials

Functions are (most often) verbs, followed by what they will be applied to in parentheses:

```{r eval=FALSE}
do_this(to_this)
do_that(to_this, to_that, with_those)
```

Packages are installed with the install.packages function and loaded with the library function, once per session:

```{r eval=FALSE}
install.packages("package_name")
library(package_name)
```

R essentials (continued)

Columns (variables) in data frames are accessed with $:

```{r eval=FALSE}
dataframe$var_name
```

Object documentation can be accessed with ?

```{r eval=FALSE}
?mean
```

Tidyverse.org

The tidyverse is an opinionated collection of R packages designed for data science
All packages share an underlying philosophy and a common grammar

quarto.org

Quarto and the various packages that support it enable R users to write their code and prose in reproducible computational documents
We will generally refer to Quarto documents (with .qmd extension), e.g. “Do this in your Quarto document” and rarely discuss loading the Quarto

Tour: Quarto

Environments

The environment of your Quarto document is separate from the

Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!

Environments (Your turn)

Go to the Q-RaP and create a new project in the Spaces name FIN7028….
Click here to access Q-RaP

First, run the following in the console

x <- 2
x * 3

All looks good, eh?

Then, add the following in an R chunk in your R Markdown document

x * 3

What happens? Why the error?

Quarto help

Quarto CheatSheet

In the IDE go to the dropdown menu

Help -> Markdown Quick Reference

How will we use Quarto?

Every lab / homework / project / mock /exam etc. is an Quarto document
You’ll always have a template Quarto document to start with
The amount of scaffolding in the template will decrease over the semester

What’s with all the hexes?

Mitchell O’Hara-Wild, useR! 2018 feature wall

Reading Data In (Your Turn)

Visit JKP Factors.
Select the desired options for your data download, such as the region/country (e.g., World), theme/factor, data frequency (daily), and weighting method.
Click the ‘Download’ button to download the data in CSV format.

Reading Data In (Your Turn)

Once you have downloaded the CSV file, you can load it into R using the following R code:

data <- read.csv("path_to_your_downloaded_file.csv")

APIs and Databases

APIs and Databases are important in finance because of:

Real-Time Data Access: APIs enable efficient retrieval of live financial data from markets and institutions, essential for timely analysis and decision-making.
Data Accuracy and Automation: APIs ensure direct access to accurate source data, reducing errors, and automating data collection for efficiency.
Structured Data Management: Databases provide structured storage and management of large volumes of financial data, enabling quick retrieval and manipulation.
Data Integrity and Security: Databases ensure data integrity and security, crucial for maintaining the reliability and confidentiality of financial information.
Historical Data Analysis: Databases are key for storing and analyzing historical financial data, aiding in trend analysis and predictive modeling.
Scalability and Integration: Both APIs and databases offer scalability for growing data volumes and integrate seamlessly with analytical tools for advanced financial analysis.

APIs and Databases (Your Turn)

Connecting to a financial API to fetch real-time stock data.

```{r}
# Assuming a package like tidyquant is installed
library(tidyquant)
library(tidyr)
library(janitor)

symbol <- "AAPL"
start_date <- as.Date("2020-01-01")
end_date <- Sys.Date()

# Get stock data
appl<-tq_get(symbol, from = start_date, to = end_date)
```

Data transformations play a crucial role in preparing raw financial data for analysis, modeling, visualisation, and presentation purposes.

Scaling Numerical Variables

Scaling numerical variables involves normalizing the range of variables to facilitate comparisons across disparate measures.
Two widely used scaling methods include standardization and normalization.
Standardization: Converts variables to zero-centered distributions with unit variance.
Normalization: Scales features between defined intervals (e.g., 0 to 1).

R Implementation Scaling

Implement scaling using functions found in the scale() and tsfe::rescale() functions, both part of the built-in base package.

Example in R:

```{R}
set.seed(42)
x <- rnorm(100, mean = 10, sd = 2)
std_x <- scale(x)
tsfe::rescale_variable(x,0,1)
```

  [1] 0.82656575 0.45994575 0.63567929 0.68676787 0.64347133 0.54680118
  [7] 0.85318897 0.54897278 0.94919789 0.55502326 0.81404831 1.00000000
 [13] 0.30384654 0.51409797 0.54165000 0.68735271 0.51306305 0.06375976
 [19] 0.10466872 0.81693552 0.50882312 0.22951560 0.53433979 0.79696508
 [25] 0.92585766 0.48536919 0.51817382 0.23295239 0.65404554 0.44568430
 [31] 0.65316534 0.70040013 0.76295368 0.45156878 0.66254175 0.24169419
 [37] 0.41832230 0.40573671 0.10964232 0.57374327 0.60591836 0.49851603
 [43] 0.71050024 0.42926114 0.30774440 0.64887874 0.41322087 0.84041925
 [49] 0.48518413 0.69108348 0.62787527 0.41843974 0.86534972 0.68866886
 [55] 0.58390250 0.61928118 0.69556115 0.58391618 0.00000000 0.62085933
 [61] 0.49734602 0.60198483 0.67710093 0.83201648 0.42914991 0.81360756
 [67] 0.63051231 0.76359814 0.74129067 0.70343832 0.36933122 0.54981991
 [73] 0.68499800 0.38630093 0.46408788 0.67694425 0.71239721 0.65474069
 [79] 0.39913246 0.35859925 0.85341342 0.61575273 0.58365241 0.54400330
 [85] 0.34069154 0.68281583 0.52577449 0.53228678 0.74368052 0.72254817
 [91] 0.83057314 0.47671255 0.69007977 0.83038261 0.35651430 0.40386446
 [97] 0.35254634 0.29052139 0.58205049 0.69062066

Logarithmic Transformation

Applying logarithmic transformations helps mitigate skewness issues prevalent in certain types of financial data (i.e., exponential growth patterns).

Benefits of Logarithmic Transformation

Stabilises variances.
Linearises relationships among variables.

R Implementation

Utilise the log() function to implement logarithmic transformations.

Example in R:

```{R}
set.seed(42)
y <- exp((rnorm(100, mean = 1, sd = 1)))
log_y <- log(y + 1)  # Adding a constant prevents taking logs of negative numbers
```

Differencing Time Series Data

Differencing is a technique often applied to stationarize nonstationary time series data.

Stationarity in Time Series

Stationarity implies consistent statistical properties throughout the entire dataset.
Subtract consecutive observations to compute returns.

R Implementation

Leverage the lag() and diff() functions to execute differencing.

Example in R:

```{R}
set.seed(42)
closing_prices <- cumprod(rnorm(100, mean = 0.01, sd = 0.01))
returns <- (closing_prices-lag(closing_prices)) / lag(closing_prices)
head(returns)
```

[1] 0 0 0 0 0 0

Binning Continuous Variables

Binning continuous variables categorizes quantitative values into distinct intervals or bins.

Strategies for Binning

Equal width
Equal frequency
Clustering algorithms

R Implementation

Employ the cut() and findInterval() functions to implement basic forms of binning.

Example in R:

```{R}
set.seed(42)
age <- runif(1000, min = 0, max = 100)
age_binned <- cut(age, breaks = seq(0, 100, by = 10), labels = FALSE)
```

Merging Multiple Datasets

Merging multiple datasets enables integration of complementary pieces of information scattered across various sources.

Types of Merges

Horizontal merges: Matching keys shared among records.
Vertical merges: Appending rows from one database onto another.

R Implementation

Apply the merge(), bind_rows(), bind_cols()

Example in R:

```{R}
library(dplyr)
set.seed(42)
dataset1 <- data.frame(id = sample(1:5, size = 5, replace = TRUE), x = runif(5))
dataset2 <- data.frame(id = sample(1:5, size = 5, replace = TRUE), y = runif(5))
merged_dataset <- merge(dataset1, dataset2, by = "id")
stacked_dataset <- bind_rows(dataset1, dataset2)
```

Changing the Shape of DataFrames: Long to Wide Using R

Sometimes it becomes necessary to change the shape of a dataset from long to wide in financial data.

Tools in R

tidyr
tidyquant
janitor

R Implementation

First, ensure you have installed and loaded the necessary packages.

Example in R:

```{r}
library(tidyquant)
library(tidyr)
library(janitor)
```

Retrieve the historical stock data using the tidyquant API.

Next, retrieve the historical stock data using the tidyquant API:

```{r}
tickers <- c("AAPL", "MSFT")
start_date <- as.Date("2020-01-01")
end_date <- Sys.Date()
financial_data <- tq_get(tickers, from = start_date, to = end_date)
print(head(financial_data))
```

# A tibble: 6 × 8
  symbol date        open  high   low close    volume adjusted
  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 AAPL   2020-01-02  74.1  75.2  73.8  75.1 135480400     73.2
2 AAPL   2020-01-03  74.3  75.1  74.1  74.4 146322800     72.4
3 AAPL   2020-01-06  73.4  75.0  73.2  74.9 118387200     73.0
4 AAPL   2020-01-07  75.0  75.2  74.4  74.6 108872000     72.7
5 AAPL   2020-01-08  74.3  76.1  74.3  75.8 132079200     73.8
6 AAPL   2020-01-09  76.8  77.6  76.6  77.4 170108400     75.4

Converting Long to Wide

Use the pivot_wider() function to convert the data to a wide structure.

Example in R:

```{R}
financial_data_wide <- financial_data |>
  mutate(date = ymd(date)) |>
  select(date,symbol,adjusted) |>
  pivot_wider(names_from =symbol, values_from = adjusted) |>
  clean_names() |>
  remove_empty(which = "rows") |>
  relocate(date, .before = everything())

print(head(financial_data_wide))
```

# A tibble: 6 × 3
  date        aapl  msft
  <date>     <dbl> <dbl>
1 2020-01-02  73.2  155.
2 2020-01-03  72.4  153.
3 2020-01-06  73.0  153.
4 2020-01-07  72.7  152.
5 2020-01-08  73.8  154.
6 2020-01-09  75.4  156.

Toolkit

Outline

Reproducible financial analytics

Toolkit for reproducibility

R packages

Tour: R and Posit IDE

A short list (for now) of R essentials

R essentials (continued)

Tour: Quarto

Environments

Environments (Your turn)

Quarto help

How will we use Quarto?

What’s with all the hexes?

Reading Data In (Your Turn)

Reading Data In (Your Turn)

APIs and Databases

APIs and Databases (Your Turn)

Scaling Numerical Variables

R Implementation Scaling

Logarithmic Transformation

Benefits of Logarithmic Transformation

R Implementation

Differencing Time Series Data

Stationarity in Time Series

R Implementation

Binning Continuous Variables

Strategies for Binning

R Implementation

Merging Multiple Datasets

Types of Merges

R Implementation

Changing the Shape of DataFrames: Long to Wide Using R

Tools in R

R Implementation

Converting Long to Wide

Recap Question