This is designed as an example of equity analysis I’ve conducted
looking at correlations of search terms (via Google Trends data) and
revenue. The goal is to find strong correlations that allow generation
of revenue predictions during a given quarter and before the company
actually reports that financial data to the rest of the market.
For
this example, I will predict the revenue that Indian online
travel company MakeMyTrip (NASDAQ:MMYT) will report for its fiscal 2021
Q4 by using the Google Trends data of the travel search term
“flight”. To accomplish this I will pull the past 4+ years of
Google Trends data as well as the historical financials for MMYT.
library(formatR)
library(ggplot2)
library(tidyverse)
library(gtrendsR)
library(lubridate)
#Search Variables
#Note that only "flight" will be used in this example
<- c("hotel",
searchterms "makemytrip",
"flight")
#Location and Time
<- gtrends(searchterms,
gtrends_query geo = "IN",
time = "2017-01-01 2021-03-31",
gprop = "web",)
#Extract the search interest data from Google Trends
<- gtrends_query %>%
interest pluck("interest_over_time") %>%
mutate(hits = as.numeric(hits))
%>%
interest ggplot(aes(date,hits, color = keyword))+
geom_line()+
scale_color_manual(values=c("#83c8c7","#936ed1", "#799fcc"))+
theme(legend.position = "bottom",
legend.background = element_rect(fill="#303443"),
legend.key = element_rect(fill = "#303443", colour = "#303443"),
legend.title = element_text(colour = "azure"),
legend.text = element_text(colour = "azure"),
plot.background = element_rect(fill = "#1f212a"),
panel.background = element_rect(fill = "#303443"),
panel.grid.major = element_line(size = 0.05, linetype = 'solid',
colour = "#b2b7ce"),
panel.grid.minor = element_line(size = 0.00, linetype = 'solid',
colour = "#b2b7ce"),
plot.title = element_text(colour = "azure"),
axis.text = element_text(colour = "azure"),
axis.title.x = element_blank(),
axis.title.y = element_text(colour = "azure"))+
labs(title = "India Search Interest Over Time",
x = "Date",
y = "Search Interest")
The above visual shows the impact of COVID-19 related lock
downs during March 2020 having a dramatic impact on travel related
search results. Recall that only the search term “flight” will be used
for analysis in this example.
#Adjust Google Trends dates for fiscal Quarter Income Statement Data for "flight"
<- interest %>%
res select(date, hits, keyword) %>%
filter(keyword == "flight") %>%
mutate(date = quarter(date, with_year = TRUE, fiscal_start = 4 ))
#Get Average Search Interest by Quarter
<- res %>%
trends_data group_by(date) %>%
summarize(average_hits = round(mean(hits),2))
#Sample Output
as.data.frame(trends_data)
## date average_hits
## 1 2017.4 46.38
## 2 2018.1 49.23
## 3 2018.2 49.77
## 4 2018.3 56.29
## 5 2018.4 51.67
## 6 2019.1 49.15
## 7 2019.2 44.71
## 8 2019.3 46.15
## 9 2019.4 46.69
## 10 2020.1 46.54
## 11 2020.2 49.31
## 12 2020.3 53.69
## 13 2020.4 49.62
## 14 2021.1 25.38
## 15 2021.2 27.62
## 16 2021.3 34.92
## 17 2021.4 39.54
Note that the average_hits during fiscal 2021 Q4 is 39.54. I’ll come back to this value once we are ready to use our model for prediction.
#Get Data
<- read.csv("/Users/brenton/Documents/Data Portfolio/MMYT Revenue.csv", header = TRUE)
mmyt_fd
#Adjust Periods to follow MMYT fiscal periods
<- mmyt_fd %>%
fd mutate(
period = ym(period),
period = quarter(period, with_year = TRUE, fiscal_start = 4 ))
#Sample Output
fd
## period revenue
## 1 2021.3 56806
## 2 2021.2 21052
## 3 2021.1 6361
## 4 2020.4 104946
## 5 2020.3 146889
## 6 2020.2 117957
## 7 2020.1 141737
## 8 2019.4 120177
## 9 2019.3 124815
## 10 2019.2 103609
## 11 2019.1 137410
## 12 2018.4 157806
## 13 2018.3 172477
## 14 2018.2 152917
## 15 2018.1 192056
## 16 2017.4 120033
## 17 2017.3 123248
## 18 2017.2 83109
## 19 2017.1 121226
## 20 2016.4 88000
<- trends_data %>%
merged_data inner_join(fd, by = c("date" = "period")) %>%
arrange(desc(date))
#Sample Output
as.data.frame(merged_data)
## date average_hits revenue
## 1 2021.3 34.92 56806
## 2 2021.2 27.62 21052
## 3 2021.1 25.38 6361
## 4 2020.4 49.62 104946
## 5 2020.3 53.69 146889
## 6 2020.2 49.31 117957
## 7 2020.1 46.54 141737
## 8 2019.4 46.69 120177
## 9 2019.3 46.15 124815
## 10 2019.2 44.71 103609
## 11 2019.1 49.15 137410
## 12 2018.4 51.67 157806
## 13 2018.3 56.29 172477
## 14 2018.2 49.77 152917
## 15 2018.1 49.23 192056
## 16 2017.4 46.38 120033
%>%
merged_data ggplot(aes(x = average_hits, y = revenue,)) +
geom_point(color = "#936ed1") +
geom_smooth(method = "lm", se = FALSE, size = .5, color = "royalblue")+
theme(
plot.background = element_rect(fill = "#1f212a"),
panel.background = element_rect(fill = "#303443"),
panel.grid.major = element_line(size = 0.05, linetype = 'solid',
colour = "#b2b7ce"),
panel.grid.minor = element_line(size = 0.00, linetype = 'solid',
colour = "#b2b7ce"),
plot.title = element_text(colour = "azure"),
axis.text = element_text(colour = "azure"),
axis.title.x = element_text(colour = "azure"),
axis.title.y = element_text(colour = "azure"))+
labs(title = "Quarterly Revenue vs India Search Interest Over Time",
x = "Search Interest",
y = "Revenue (Thousands)")
## `geom_smooth()` using formula 'y ~ x'
<- merged_data %>%
linear_model lm(revenue ~ average_hits,. )
summary(linear_model)
##
## Call:
## lm(formula = revenue ~ average_hits, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35030 -5351 -2202 4430 54196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -129349.0 27142.8 -4.765 0.000302 ***
## average_hits 5427.8 587.2 9.243 2.45e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19820 on 14 degrees of freedom
## Multiple R-squared: 0.8592, Adjusted R-squared: 0.8491
## F-statistic: 85.43 on 1 and 14 DF, p-value: 2.454e-07
The result of the linear regressions is an R^2 of .87 and a small p-value (below the standard threshold of .05) indicating a strong and statistically significant correlation between the “flight” search term and MakeMyTrip’s revenue.
#Add the predicted
%>%
merged_data mutate(predicted = linear_model$fitted.values) %>%
as.data.frame()
## date average_hits revenue predicted
## 1 2021.3 34.92 56806 60188.349
## 2 2021.2 27.62 21052 20565.717
## 3 2021.1 25.38 6361 8407.539
## 4 2020.4 49.62 104946 139976.389
## 5 2020.3 53.69 146889 162067.363
## 6 2020.2 49.31 117957 138293.784
## 7 2020.1 46.54 141737 123258.895
## 8 2019.4 46.69 120177 124073.059
## 9 2019.3 46.15 124815 121142.069
## 10 2019.2 44.71 103609 113326.098
## 11 2019.1 49.15 137410 137425.343
## 12 2018.4 51.67 157806 151103.293
## 13 2018.3 56.29 172477 176179.534
## 14 2018.2 49.77 152917 140790.553
## 15 2018.1 49.23 192056 137859.563
## 16 2017.4 46.38 120033 122390.454
This output shows the revenue MMYT actually reported as well as the predictions for the previous quarters.
#Create data frame to use as value for prediction
#Recall that 39.54 is the average search interest for fiscal 2021 Q4
<- data.frame(average_hits = 39.54)
prediction_df
#Predict current quarter revenue
predict(linear_model, prediction_df )
## 1
## 85264.59
This output shows a predicted revenue for fiscal 2021 Q4 of 85264.59 using the linear model.
From the final above output, the predicted revenue for the latest quarter came in at 85265 and the actual revenue MMYT reported on their earnings report (months after this data was gathered) came in at 79221. This shows how powerful a simple single variable linear model can be as a starting point for building a predictive model. That said, I’ve built far stronger models in the past by combining multivariate analysis of Google Trends search terms with other publicly accessible data from other sources.