Toolkit

Advanced Financial Data Analytics

Barry Quinn

2024-02-01

Outline

  • Credibility and reproducible analytics
  • Data Collection
  • Data Processing
  • Growth Mindset

Reproducible financial analytics

What does it mean for a data analysis to be “reproducible”?

Near-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done?

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit for reproducibility

  • Scriptability \(\rightarrow\) R
  • Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto
    • Version control \(\rightarrow\) Git / GitHub{.incremental}
    • Version control \(\rightarrow\) Git / GitHub

R packages

  • Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data1

  • As of September 2020, there are over 16,000 R packages available on CRAN (the Comprehensive R Archive Network)2

  • We’re going to work with a small (but important) subset of these!

1 Wickham and Bryan, R Packages. 2 CRAN contributed packages.

Tour: R and Posit IDE

A short list (for now) of R essentials

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
```{r eval=FALSE}
do_this(to_this)
do_that(to_this, to_that, with_those)
```
  • Packages are installed with the install.packages function and loaded with the library function, once per session:
```{r eval=FALSE}
install.packages("package_name")
library(package_name)
```

R essentials (continued)

  • Columns (variables) in data frames are accessed with $:
```{r eval=FALSE}
dataframe$var_name
```
  • Object documentation can be accessed with ?
```{r eval=FALSE}
?mean
```

Tidyverse.org

  • The tidyverse is an opinionated collection of R packages designed for data science
  • All packages share an underlying philosophy and a common grammar

quarto.org

  • Quarto and the various packages that support it enable R users to write their code and prose in reproducible computational documents

  • We will generally refer to Quarto documents (with .qmd extension), e.g. “Do this in your Quarto document” and rarely discuss loading the Quarto

Tour: Quarto

Environments

The environment of your Quarto document is separate from the

  • Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!

Environments (Your turn)

  • Go to the Q-RaP and create a new project in the Spaces name FIN7028….
  • Click here to access Q-RaP
  • First, run the following in the console
x <- 2
x * 3
  • All looks good, eh?
  • Then, add the following in an R chunk in your R Markdown document
x * 3

What happens? Why the error?

Quarto help

OR

  • In the IDE go to the dropdown menu

Help -> Markdown Quick Reference

How will we use Quarto?

  • Every lab / homework / project / mock /exam etc. is an Quarto document
  • You’ll always have a template Quarto document to start with
  • The amount of scaffolding in the template will decrease over the semester

What’s with all the hexes?

Mitchell O’Hara-Wild, useR! 2018 feature wall

Reading Data In (Your Turn)

  1. Visit JKP Factors.
  2. Select the desired options for your data download, such as the region/country (e.g., World), theme/factor, data frequency (daily), and weighting method.
  3. Click the ‘Download’ button to download the data in CSV format.

Reading Data In (Your Turn)

  • Once you have downloaded the CSV file, you can load it into R using the following R code:
data <- read.csv("path_to_your_downloaded_file.csv")

APIs and Databases

  • APIs and Databases are important in finance because of:
  1. Real-Time Data Access: APIs enable efficient retrieval of live financial data from markets and institutions, essential for timely analysis and decision-making.

  2. Data Accuracy and Automation: APIs ensure direct access to accurate source data, reducing errors, and automating data collection for efficiency.

  3. Structured Data Management: Databases provide structured storage and management of large volumes of financial data, enabling quick retrieval and manipulation.

  4. Data Integrity and Security: Databases ensure data integrity and security, crucial for maintaining the reliability and confidentiality of financial information.

  5. Historical Data Analysis: Databases are key for storing and analyzing historical financial data, aiding in trend analysis and predictive modeling.

  6. Scalability and Integration: Both APIs and databases offer scalability for growing data volumes and integrate seamlessly with analytical tools for advanced financial analysis.

APIs and Databases (Your Turn)

  • Connecting to a financial API to fetch real-time stock data.
```{r}
# Assuming a package like tidyquant is installed
library(tidyquant)
library(tidyr)
library(janitor)

symbol <- "AAPL"
start_date <- as.Date("2020-01-01")
end_date <- Sys.Date()

# Get stock data
appl<-tq_get(symbol, from = start_date, to = end_date)
```

Data transformations play a crucial role in preparing raw financial data for analysis, modeling, visualisation, and presentation purposes.

Scaling Numerical Variables

  • Scaling numerical variables involves normalizing the range of variables to facilitate comparisons across disparate measures.

  • Two widely used scaling methods include standardization and normalization.

  • Standardization: Converts variables to zero-centered distributions with unit variance.

  • Normalization: Scales features between defined intervals (e.g., 0 to 1).

R Implementation Scaling

Implement scaling using functions found in the scale() and tsfe::rescale() functions, both part of the built-in base package.

Example in R:

```{R}
set.seed(42)
x <- rnorm(100, mean = 10, sd = 2)
std_x <- scale(x)
tsfe::rescale_variable(x,0,1)
```
  [1] 0.82656575 0.45994575 0.63567929 0.68676787 0.64347133 0.54680118
  [7] 0.85318897 0.54897278 0.94919789 0.55502326 0.81404831 1.00000000
 [13] 0.30384654 0.51409797 0.54165000 0.68735271 0.51306305 0.06375976
 [19] 0.10466872 0.81693552 0.50882312 0.22951560 0.53433979 0.79696508
 [25] 0.92585766 0.48536919 0.51817382 0.23295239 0.65404554 0.44568430
 [31] 0.65316534 0.70040013 0.76295368 0.45156878 0.66254175 0.24169419
 [37] 0.41832230 0.40573671 0.10964232 0.57374327 0.60591836 0.49851603
 [43] 0.71050024 0.42926114 0.30774440 0.64887874 0.41322087 0.84041925
 [49] 0.48518413 0.69108348 0.62787527 0.41843974 0.86534972 0.68866886
 [55] 0.58390250 0.61928118 0.69556115 0.58391618 0.00000000 0.62085933
 [61] 0.49734602 0.60198483 0.67710093 0.83201648 0.42914991 0.81360756
 [67] 0.63051231 0.76359814 0.74129067 0.70343832 0.36933122 0.54981991
 [73] 0.68499800 0.38630093 0.46408788 0.67694425 0.71239721 0.65474069
 [79] 0.39913246 0.35859925 0.85341342 0.61575273 0.58365241 0.54400330
 [85] 0.34069154 0.68281583 0.52577449 0.53228678 0.74368052 0.72254817
 [91] 0.83057314 0.47671255 0.69007977 0.83038261 0.35651430 0.40386446
 [97] 0.35254634 0.29052139 0.58205049 0.69062066

Logarithmic Transformation

  • Applying logarithmic transformations helps mitigate skewness issues prevalent in certain types of financial data (i.e., exponential growth patterns).

Benefits of Logarithmic Transformation

  • Stabilises variances.
  • Linearises relationships among variables.

R Implementation

Utilise the log() function to implement logarithmic transformations.

Example in R:

```{R}
set.seed(42)
y <- exp((rnorm(100, mean = 1, sd = 1)))
log_y <- log(y + 1)  # Adding a constant prevents taking logs of negative numbers
```

Differencing Time Series Data

  • Differencing is a technique often applied to stationarize nonstationary time series data.

Stationarity in Time Series

  • Stationarity implies consistent statistical properties throughout the entire dataset.
  • Subtract consecutive observations to compute returns.

R Implementation

Leverage the lag() and diff() functions to execute differencing.

Example in R:

```{R}
set.seed(42)
closing_prices <- cumprod(rnorm(100, mean = 0.01, sd = 0.01))
returns <- (closing_prices-lag(closing_prices)) / lag(closing_prices)
head(returns)
```
[1] 0 0 0 0 0 0

Binning Continuous Variables

Binning continuous variables categorizes quantitative values into distinct intervals or bins.

Strategies for Binning

  • Equal width
  • Equal frequency
  • Clustering algorithms

R Implementation

Employ the cut() and findInterval() functions to implement basic forms of binning.

Example in R:

```{R}
set.seed(42)
age <- runif(1000, min = 0, max = 100)
age_binned <- cut(age, breaks = seq(0, 100, by = 10), labels = FALSE)
```

Merging Multiple Datasets

Merging multiple datasets enables integration of complementary pieces of information scattered across various sources.

Types of Merges

  • Horizontal merges: Matching keys shared among records.
  • Vertical merges: Appending rows from one database onto another.

R Implementation

Apply the merge(), bind_rows(), bind_cols()

Example in R:

```{R}
library(dplyr)
set.seed(42)
dataset1 <- data.frame(id = sample(1:5, size = 5, replace = TRUE), x = runif(5))
dataset2 <- data.frame(id = sample(1:5, size = 5, replace = TRUE), y = runif(5))
merged_dataset <- merge(dataset1, dataset2, by = "id")
stacked_dataset <- bind_rows(dataset1, dataset2)
```

Changing the Shape of DataFrames: Long to Wide Using R

Sometimes it becomes necessary to change the shape of a dataset from long to wide in financial data.

Tools in R

  • tidyr
  • tidyquant
  • janitor

R Implementation

First, ensure you have installed and loaded the necessary packages.

Example in R:

```{r}
library(tidyquant)
library(tidyr)
library(janitor)
```

Retrieve the historical stock data using the tidyquant API.

Next, retrieve the historical stock data using the tidyquant API:

```{r}
tickers <- c("AAPL", "MSFT")
start_date <- as.Date("2020-01-01")
end_date <- Sys.Date()
financial_data <- tq_get(tickers, from = start_date, to = end_date)
print(head(financial_data))
```
# A tibble: 6 × 8
  symbol date        open  high   low close    volume adjusted
  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 AAPL   2020-01-02  74.1  75.2  73.8  75.1 135480400     73.2
2 AAPL   2020-01-03  74.3  75.1  74.1  74.4 146322800     72.4
3 AAPL   2020-01-06  73.4  75.0  73.2  74.9 118387200     73.0
4 AAPL   2020-01-07  75.0  75.2  74.4  74.6 108872000     72.7
5 AAPL   2020-01-08  74.3  76.1  74.3  75.8 132079200     73.8
6 AAPL   2020-01-09  76.8  77.6  76.6  77.4 170108400     75.4

Converting Long to Wide

Use the pivot_wider() function to convert the data to a wide structure.

Example in R:

```{R}
financial_data_wide <- financial_data |>
  mutate(date = ymd(date)) |>
  select(date,symbol,adjusted) |>
  pivot_wider(names_from =symbol, values_from = adjusted) |>
  clean_names() |>
  remove_empty(which = "rows") |>
  relocate(date, .before = everything())

print(head(financial_data_wide))
```
# A tibble: 6 × 3
  date        aapl  msft
  <date>     <dbl> <dbl>
1 2020-01-02  73.2  155.
2 2020-01-03  72.4  153.
3 2020-01-06  73.0  153.
4 2020-01-07  72.7  152.
5 2020-01-08  73.8  154.
6 2020-01-09  75.4  156.

Recap Question