Advanced Financial Data Analytics
2024-02-01
What does it mean for a data analysis to be “reproducible”?
Near-term goals:
Long-term goals:
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data1
As of September 2020, there are over 16,000 R packages available on CRAN (the Comprehensive R Archive Network)2
We’re going to work with a small (but important) subset of these!
1 Wickham and Bryan, R Packages. 2 CRAN contributed packages.
install.packages
function and loaded with the library
function, once per session:$
:?
Quarto and the various packages that support it enable R users to write their code and prose in reproducible computational documents
We will generally refer to Quarto documents (with .qmd
extension), e.g. “Do this in your Quarto document” and rarely discuss loading the Quarto
The environment of your Quarto document is separate from the
OR
Help -> Markdown Quick Reference
Mitchell O’Hara-Wild, useR! 2018 feature wall
Real-Time Data Access: APIs enable efficient retrieval of live financial data from markets and institutions, essential for timely analysis and decision-making.
Data Accuracy and Automation: APIs ensure direct access to accurate source data, reducing errors, and automating data collection for efficiency.
Structured Data Management: Databases provide structured storage and management of large volumes of financial data, enabling quick retrieval and manipulation.
Data Integrity and Security: Databases ensure data integrity and security, crucial for maintaining the reliability and confidentiality of financial information.
Historical Data Analysis: Databases are key for storing and analyzing historical financial data, aiding in trend analysis and predictive modeling.
Scalability and Integration: Both APIs and databases offer scalability for growing data volumes and integrate seamlessly with analytical tools for advanced financial analysis.
Data transformations play a crucial role in preparing raw financial data for analysis, modeling, visualisation, and presentation purposes.
Scaling numerical variables involves normalizing the range of variables to facilitate comparisons across disparate measures.
Two widely used scaling methods include standardization and normalization.
Standardization: Converts variables to zero-centered distributions with unit variance.
Normalization: Scales features between defined intervals (e.g., 0 to 1).
Implement scaling using functions found in the scale()
and tsfe::rescale()
functions, both part of the built-in base
package.
Example in R:
```{R}
set.seed(42)
x <- rnorm(100, mean = 10, sd = 2)
std_x <- scale(x)
tsfe::rescale_variable(x,0,1)
```
[1] 0.82656575 0.45994575 0.63567929 0.68676787 0.64347133 0.54680118
[7] 0.85318897 0.54897278 0.94919789 0.55502326 0.81404831 1.00000000
[13] 0.30384654 0.51409797 0.54165000 0.68735271 0.51306305 0.06375976
[19] 0.10466872 0.81693552 0.50882312 0.22951560 0.53433979 0.79696508
[25] 0.92585766 0.48536919 0.51817382 0.23295239 0.65404554 0.44568430
[31] 0.65316534 0.70040013 0.76295368 0.45156878 0.66254175 0.24169419
[37] 0.41832230 0.40573671 0.10964232 0.57374327 0.60591836 0.49851603
[43] 0.71050024 0.42926114 0.30774440 0.64887874 0.41322087 0.84041925
[49] 0.48518413 0.69108348 0.62787527 0.41843974 0.86534972 0.68866886
[55] 0.58390250 0.61928118 0.69556115 0.58391618 0.00000000 0.62085933
[61] 0.49734602 0.60198483 0.67710093 0.83201648 0.42914991 0.81360756
[67] 0.63051231 0.76359814 0.74129067 0.70343832 0.36933122 0.54981991
[73] 0.68499800 0.38630093 0.46408788 0.67694425 0.71239721 0.65474069
[79] 0.39913246 0.35859925 0.85341342 0.61575273 0.58365241 0.54400330
[85] 0.34069154 0.68281583 0.52577449 0.53228678 0.74368052 0.72254817
[91] 0.83057314 0.47671255 0.69007977 0.83038261 0.35651430 0.40386446
[97] 0.35254634 0.29052139 0.58205049 0.69062066
Utilise the log()
function to implement logarithmic transformations.
Example in R:
Leverage the lag()
and diff()
functions to execute differencing.
Example in R:
Binning continuous variables categorizes quantitative values into distinct intervals or bins.
Employ the cut()
and findInterval()
functions to implement basic forms of binning.
Example in R:
Merging multiple datasets enables integration of complementary pieces of information scattered across various sources.
Apply the merge()
, bind_rows()
, bind_cols()
Example in R:
```{R}
library(dplyr)
set.seed(42)
dataset1 <- data.frame(id = sample(1:5, size = 5, replace = TRUE), x = runif(5))
dataset2 <- data.frame(id = sample(1:5, size = 5, replace = TRUE), y = runif(5))
merged_dataset <- merge(dataset1, dataset2, by = "id")
stacked_dataset <- bind_rows(dataset1, dataset2)
```
Sometimes it becomes necessary to change the shape of a dataset from long to wide in financial data.
tidyr
tidyquant
janitor
First, ensure you have installed and loaded the necessary packages.
Example in R:
Retrieve the historical stock data using the tidyquant API.
Next, retrieve the historical stock data using the tidyquant API:
```{r}
tickers <- c("AAPL", "MSFT")
start_date <- as.Date("2020-01-01")
end_date <- Sys.Date()
financial_data <- tq_get(tickers, from = start_date, to = end_date)
print(head(financial_data))
```
# A tibble: 6 × 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AAPL 2020-01-02 74.1 75.2 73.8 75.1 135480400 73.2
2 AAPL 2020-01-03 74.3 75.1 74.1 74.4 146322800 72.4
3 AAPL 2020-01-06 73.4 75.0 73.2 74.9 118387200 73.0
4 AAPL 2020-01-07 75.0 75.2 74.4 74.6 108872000 72.7
5 AAPL 2020-01-08 74.3 76.1 74.3 75.8 132079200 73.8
6 AAPL 2020-01-09 76.8 77.6 76.6 77.4 170108400 75.4
Use the pivot_wider()
function to convert the data to a wide structure.
Example in R:
```{R}
financial_data_wide <- financial_data |>
mutate(date = ymd(date)) |>
select(date,symbol,adjusted) |>
pivot_wider(names_from =symbol, values_from = adjusted) |>
clean_names() |>
remove_empty(which = "rows") |>
relocate(date, .before = everything())
print(head(financial_data_wide))
```
# A tibble: 6 × 3
date aapl msft
<date> <dbl> <dbl>
1 2020-01-02 73.2 155.
2 2020-01-03 72.4 153.
3 2020-01-06 73.0 153.
4 2020-01-07 72.7 152.
5 2020-01-08 73.8 154.
6 2020-01-09 75.4 156.