R Google Trends Equity Analysis Showcase

Introduction

This is designed as an example of equity analysis I’ve conducted looking at correlations of search terms (via Google Trends data) and revenue. The goal is to find strong correlations that allow generation of revenue predictions during a given quarter and before the company actually reports that financial data to the rest of the market.

For this example, I will predict the revenue that Indian online travel company MakeMyTrip (NASDAQ:MMYT) will report for its fiscal 2021 Q4 by using the Google Trends data of the travel search term “flight”. To accomplish this I will pull the past 4+ years of Google Trends data as well as the historical financials for MMYT.

Call Packages

library(formatR)
library(ggplot2)
library(tidyverse)
library(gtrendsR)
library(lubridate)

Set The Parameters For Google Trends Data Pull

#Search Variables 
#Note that only "flight" will be used in this example
searchterms <- c("hotel", 
                 "makemytrip", 
                 "flight")

#Location and Time
gtrends_query <- gtrends(searchterms, 
               geo = "IN",
               time = "2017-01-01 2021-03-31",
               gprop = "web",)

Prep Data For Plot

#Extract the search interest data from Google Trends
 interest <- gtrends_query %>%
  pluck("interest_over_time") %>% 
  mutate(hits = as.numeric(hits))

Create Visual of Search Interest for Terms

interest %>% 
  ggplot(aes(date,hits, color = keyword))+
    geom_line()+
    scale_color_manual(values=c("#83c8c7","#936ed1", "#799fcc"))+
    theme(legend.position = "bottom",
          legend.background = element_rect(fill="#303443"),
          legend.key = element_rect(fill = "#303443", colour = "#303443"),
          legend.title = element_text(colour = "azure"),
          legend.text = element_text(colour = "azure"),
          plot.background = element_rect(fill = "#1f212a"),
          panel.background = element_rect(fill = "#303443"),
          panel.grid.major = element_line(size = 0.05, linetype = 'solid',
                                          colour = "#b2b7ce"), 
          panel.grid.minor = element_line(size = 0.00, linetype = 'solid',
                                          colour = "#b2b7ce"),
          plot.title = element_text(colour = "azure"),
          axis.text = element_text(colour = "azure"),
          axis.title.x = element_blank(),
          axis.title.y = element_text(colour = "azure"))+
    labs(title = "India Search Interest Over Time",
         x = "Date",
         y = "Search Interest")

The above visual shows the impact of COVID-19 related lock downs during March 2020 having a dramatic impact on travel related search results. Recall that only the search term “flight” will be used for analysis in this example.

Reshape Google Trends Data For Merging With Financial Data

#Adjust Google Trends dates for fiscal Quarter Income Statement Data for "flight"
res <- interest %>%
        select(date, hits, keyword) %>% 
        filter(keyword == "flight") %>% 
        mutate(date = quarter(date, with_year = TRUE, fiscal_start = 4 ))

#Get Average Search Interest by Quarter
trends_data <- res %>% 
                group_by(date) %>% 
                summarize(average_hits = round(mean(hits),2))
#Sample Output
as.data.frame(trends_data)

##      date average_hits
## 1  2017.4        46.38
## 2  2018.1        49.23
## 3  2018.2        49.77
## 4  2018.3        56.29
## 5  2018.4        51.67
## 6  2019.1        49.15
## 7  2019.2        44.71
## 8  2019.3        46.15
## 9  2019.4        46.69
## 10 2020.1        46.54
## 11 2020.2        49.31
## 12 2020.3        53.69
## 13 2020.4        49.62
## 14 2021.1        25.38
## 15 2021.2        27.62
## 16 2021.3        34.92
## 17 2021.4        39.54

Note that the average_hits during fiscal 2021 Q4 is 39.54. I’ll come back to this value once we are ready to use our model for prediction.

Query/Setup Historical Revenue Data For MakeMyTrip (NASDAQ:MMYT)

#Get Data
mmyt_fd <- read.csv("/Users/brenton/Documents/Data Portfolio/MMYT Revenue.csv", header = TRUE) 

#Adjust Periods to follow MMYT fiscal periods
fd <- mmyt_fd %>%
        mutate(
              period = ym(period),
              period = quarter(period, with_year = TRUE, fiscal_start = 4 )) 

#Sample Output
fd

##    period revenue
## 1  2021.3   56806
## 2  2021.2   21052
## 3  2021.1    6361
## 4  2020.4  104946
## 5  2020.3  146889
## 6  2020.2  117957
## 7  2020.1  141737
## 8  2019.4  120177
## 9  2019.3  124815
## 10 2019.2  103609
## 11 2019.1  137410
## 12 2018.4  157806
## 13 2018.3  172477
## 14 2018.2  152917
## 15 2018.1  192056
## 16 2017.4  120033
## 17 2017.3  123248
## 18 2017.2   83109
## 19 2017.1  121226
## 20 2016.4   88000

Join Google Trends Data With Income Statement Financial Data

merged_data <- trends_data %>% 
                  inner_join(fd, by = c("date" = "period")) %>% 
                      arrange(desc(date))


#Sample Output
as.data.frame(merged_data)

##      date average_hits revenue
## 1  2021.3        34.92   56806
## 2  2021.2        27.62   21052
## 3  2021.1        25.38    6361
## 4  2020.4        49.62  104946
## 5  2020.3        53.69  146889
## 6  2020.2        49.31  117957
## 7  2020.1        46.54  141737
## 8  2019.4        46.69  120177
## 9  2019.3        46.15  124815
## 10 2019.2        44.71  103609
## 11 2019.1        49.15  137410
## 12 2018.4        51.67  157806
## 13 2018.3        56.29  172477
## 14 2018.2        49.77  152917
## 15 2018.1        49.23  192056
## 16 2017.4        46.38  120033

Look at a Plot of Merged Dataset With Linear Line

merged_data %>% 
    ggplot(aes(x = average_hits, y = revenue,)) +
    geom_point(color = "#936ed1") +
    geom_smooth(method = "lm", se = FALSE, size = .5, color = "royalblue")+
    theme(
          plot.background = element_rect(fill = "#1f212a"),
          panel.background = element_rect(fill = "#303443"),
          panel.grid.major = element_line(size = 0.05, linetype = 'solid',
                                          colour = "#b2b7ce"), 
          panel.grid.minor = element_line(size = 0.00, linetype = 'solid',
                                          colour = "#b2b7ce"),
          plot.title = element_text(colour = "azure"),
          axis.text = element_text(colour = "azure"),
          axis.title.x = element_text(colour = "azure"),
          axis.title.y = element_text(colour = "azure"))+
    labs(title = "Quarterly Revenue vs India Search Interest Over Time",
         x = "Search Interest",
         y = "Revenue (Thousands)")

## `geom_smooth()` using formula 'y ~ x'

Call Linear Model

linear_model <- merged_data %>%
                lm(revenue ~ average_hits,. )

summary(linear_model)

## 
## Call:
## lm(formula = revenue ~ average_hits, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35030  -5351  -2202   4430  54196 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -129349.0    27142.8  -4.765 0.000302 ***
## average_hits    5427.8      587.2   9.243 2.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19820 on 14 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8491 
## F-statistic: 85.43 on 1 and 14 DF,  p-value: 2.454e-07

The result of the linear regressions is an R^2 of .87 and a small p-value (below the standard threshold of .05) indicating a strong and statistically significant correlation between the “flight” search term and MakeMyTrip’s revenue.

View Model Prediction and Actuals

#Add the predicted
merged_data %>% 
    mutate(predicted = linear_model$fitted.values) %>% 
      as.data.frame()

##      date average_hits revenue  predicted
## 1  2021.3        34.92   56806  60188.349
## 2  2021.2        27.62   21052  20565.717
## 3  2021.1        25.38    6361   8407.539
## 4  2020.4        49.62  104946 139976.389
## 5  2020.3        53.69  146889 162067.363
## 6  2020.2        49.31  117957 138293.784
## 7  2020.1        46.54  141737 123258.895
## 8  2019.4        46.69  120177 124073.059
## 9  2019.3        46.15  124815 121142.069
## 10 2019.2        44.71  103609 113326.098
## 11 2019.1        49.15  137410 137425.343
## 12 2018.4        51.67  157806 151103.293
## 13 2018.3        56.29  172477 176179.534
## 14 2018.2        49.77  152917 140790.553
## 15 2018.1        49.23  192056 137859.563
## 16 2017.4        46.38  120033 122390.454

This output shows the revenue MMYT actually reported as well as the predictions for the previous quarters.

Predict Revenue for Fiscal 2021 Q4

#Create data frame to use as value for prediction
#Recall that 39.54 is the average search interest for fiscal 2021 Q4
prediction_df <- data.frame(average_hits = 39.54) 

#Predict current quarter revenue
predict(linear_model, prediction_df )

##        1 
## 85264.59

This output shows a predicted revenue for fiscal 2021 Q4 of 85264.59 using the linear model.

Conclusion

From the final above output, the predicted revenue for the latest quarter came in at 85265 and the actual revenue MMYT reported on their earnings report (months after this data was gathered) came in at 79221. This shows how powerful a simple single variable linear model can be as a starting point for building a predictive model. That said, I’ve built far stronger models in the past by combining multivariate analysis of Google Trends search terms with other publicly accessible data from other sources.