Factor Investing Using Alternative Data

1. INTRODUCTION

In traditional trading algorithms, researchers trade stocks based on their fundamental and technical analysis, which is however very hard to generate excess earning in an efficient market. Especially in today’s market with tons of new information changes, alternative data and unstructured data are transforming the financial industry in crowdfunding, insurance, investment fields.

Alternative data, from unconventional sources or of unstructured nature, thus play an increasingly important role in today’s investment strategies. Alternative data could be data from different sources and with different categories, such as textual analysis, image processing, digital footprints from social media and mobile devices, and emerging data from the Internet of Things (Cong et al. 2019). Though alternative data has already been a strong driving factor of stock price movement by some investors (Jagtiani, J. and Lemieux, C., 2019), there are very few mature theories or models describing how to develop investment strategies with alternative data factors.

Therefore, this paper uses alternative data as one of the most important factors that could predict stocks and excess earning. By nature, stock price changes are driven by people’s expectations of or confidence in the corresponding company. In order to learn better people’s expectations, knowing how popular a particular stock is would be very useful. Nowadays, search engines became a primary source where millions of people could find answers to all the questions they have. Additionally, the search engine provides the “truth” in zero moments in terms of finding and analyzing the information. Therefore, we selected a particular alternative data source –Google search engine, trying to find the popular information that stays behind the millions of everyday search requests and to utilize it to help people to make investment decisions.

In this paper, I will reflect on the relationship between alternative data from search engines and stock return first. Then, the alternative data should be added to the traditional model to improve the predicting ability of the model we build.

2. LITERATURE REVIEW

In early research, alternative data has been used in intrusion detection systems to distinguish between legitimate and illegitimate activities (C. Warrender et. al.,1999). In 2011, Bollen, J. et.al. researched the relationship between sentiment indicators implied by Twitter and the DJIA (Dow Jones Industrial Average). They extracted keywords related to mood indicators from Twitter and processed these keywords into six factors: Calm, Alert, Certainty, Enthusiasm, Kindness, Happy. Orbital insight (Mittal, P., 2020) predicted Wal-Mart’s sales by counting the number of cars in front of the store’s parking lot and estimating global oil reserves by analyzing satellite images of oil refining plants and storage sites using shadow areas. Green et al. (2019) used the data from Glassdoor.com researching the relationship between the evaluation of stuff to employers and stock yield in companies and stated that the change of the evaluating score can predict the expected return of a stock. Da, Huang, and Jin (2021) used the weekly score of retail investors on Forcerank App to study the negative correlation between investors’ excessive extrapolation of beliefs and future stock returns. However, due to data limitations, the sample of this article is only from 2016/2 to 2017/12, involving less than 300 stocks and about 1,000 users. In this research, the author also emphasizes the investors in the sample cannot represent all trades in these stocks. In this paper, I will reflect on the relationship between alternative data from search engines and stock return first. Then, the alternative data should be added to the traditional model to improve the predicting ability of the model we build. Besides, I would combine the alternative data with the traditional model to move forward with a single step of predicting stock return.

3. HYPOTHESIS DEVELOPMENT

Alternative data could be used to guide the investment. First, compared with traditional factors analysis based on historical transaction data, market data, etc, alternative data could be acquired in more ways, with a larger amount of data and richer types (Kolanovic, M. and Krishnamachari, R.T., 2017). Second, alternative data could reflect potential information that is not existing in traditional data (Monk, A.H.B., et.al., 2018), such as social concern degree, consumer evaluation, and other external factors. This kind of information could be embedded in alternative data such as search trends (CHOI, H. and VARIAN, H., 2012) and online chatting history. For these reasons, alternative data could be collected, wrangled, and analyzed to guide the investment behavior, obtaining the external market factors which cannot be obtained through traditional analyzing methods.

Besides, the influence of alternative data is different in industries. Alternative data could reflect potential factors compared with the traditional analyzing method. Different industries have different sensitives to market (Pérez, A., et.al., 2020). For example, the degree of social concern could impact emerging industries more, while the impact on stable industries is less. This paper would analyze this through 11 industry classifications of GISG.

Finally, applying alternative data and traditional data together to construct the prediction model could get better results (Hasan, M., et.al., 2020). Alternative data provides external market information compared to traditional data (Monk, A., et.al., 2019). This information could be added to get a more accurate picture of how the market is currently functioning. As we all know, information is the cornerstone of prediction. The more accurately the influencing factors are described, the more accurate the prediction results will be. Therefore, the addition of alternative data can help to improve the level of equity returns forecast.

4. DATA

4.1 Alternative Data Source

We would use the Google Trends data as the inputs for our alternative data factor in this paper, which reflects the concerning degree of specific terms from society. For each topic about the components in the SP 500 index, we have weekly “interest over time” data. Scale-up of the modified data using a method like Min-Max Standardized to get the interest score, which represents the search interest relative to the highest point on the chart in a given area and time. 100 is the most popular value. A value of 50 means the term has halved in popularity. A score of 0 means there is not enough data for this term. These alternative data could be used to obtain the comprehensive sentiment index to calculate the standardized alternative data and find the relationship between the standardized alternative data factors and the returns of individual stocks.

4.2 Data Source for other factors

We could obtain the data for the market factor, size factor, and value factor from Kenneth R. French - Data Library. Also, we can obtain the adjusted close price data for the stocks of all the SP 500 components using Intrinio API. And then we can calculate the weekly returns, to match the weekly popularity score obtained from Google Trends.

4.3 Data Wrangling

We can take several steps to clean the Google Trends data and to calculate a “Popularity” indicator for each company as inputs for our alternative data factor:

obtain the company name list of all the components in the SP 500 index.
use Google trends API to obtain the top 25 popular keywords1 for each company (e.g. the top 25 keywords that are most frequently searched for Amazon include Amazon Prime, Amazon Video, and Amazon Stock, etc.).
obtain the search count for each keyword(topic).
sum up the search counts of the 25 keywords for each company to get the aggregated popularity score for each company.
use the popularity score as an alternative data factor, analyze its relationship with daily stock prices.

5. FACTOR MODELING WITH ALTERNATIVE DATA FACTORS

5.1 One-factor Model

First, we could build the simplest one-factor model to see the relationship between stock returns and popularity score for a company, trying to see in which industry the estimate of alternative data factor coefficient is most significant. For each industry in GISC, the popularity score is taken as an independent variable and the positive and negative of stock returns are taken as dependent variables to build a logistics regression model (Julien, H., 2019). Compared with the regression model, the classifier model could get better predictive accuracy in this model. Although logistic regression could not reflect the accurate quantitative relationship between independent variables and dependent variables, it could judge the correlation between them according to the prediction accuracy. The regression model could choose the SVM kernel model, neural network (Kangxian Y., 2018), etc. These methods have good classification results for nonlinear problems. KNN, decision tree, and other classification results are spatially regular shapes, which are not applicable to this case. The model is evaluated by cross-validation on the dataset to ensure the stability of the experiment results.

The model accuracy data obtained from the evaluation can reflect the different impacts of alternative data factors on different industries. If the accuracy is close to 1, the relationship between the two is very close. If the accuracy is close to 0.5 (as the accuracy of purely random predictions is 0.5), the two are almost unrelated.

5.2 Multi-Factor Model

Then, we could generate a comprehensive “popularity” factor using a list of keywords related to our proposed companies as a main independent variable of the alternative data. We collected all related keywords searched and using google trend statistics for the past year. As a dependent variable, we used stock prices of all the components in the SP 500 for the same time horizon. Then we could build a “popularity” regression model with both traditional factors and alternative data factors. Also, establish a regression model that only includes traditional factors. Here we choose the ElasticNet regression method (De Mol, C., et.al. 2009). ElasticNet regression method performs better when there are multiple independent variables and the independent variables have a correlation. The two models were trained in the same way on the dataset, and the prediction deviations of the two models were compared to carry out a performance comparison. Then we could see whether adding alternative data can improve the prediction performance.

6. EXPECTED RESULTS

6.1 Results for One-Factor Model

In One-Factor Model, we could judge the impact of alternative data on different industries according to the accuracy of the model in each industry. we expect to see in some industries the stock returns are more sensitive to the alternative data factor, and vice versa in other industries. In some specific industries, the stock returns fluctuate more if people search more of the company and its related topics. But in some other industries, the stock returns maintain relatively stable regardless of people’s interest in the company.

6.2 Results for Multi-Factor Model

In our modified multi-factor Model, we take into consideration of both traditional factors and alternative data factors. The result of this model would reflect whether alternative data can improve the predictive ability of the model under the same condition of other factors. We excepted its predictive power should become stronger if we use alternative data as an additional factor. Plus, the more diverse the alternative data used, the predictive power could be better.

7. CRITICAL DISCUSSION

First, the wise investment decision should base on alternative data we collected, which means alternative data factor correlation with the investment decision closely. Therefore, if the data from websites is unbiased is a factor that should be emphasized, the alternative data would produce errors due to the errors of the selected websites and data sources themselves.

Second, the data would be biased in the production process. The data from different sources could produce different results (Amenc, N., et.al., 2003). Therefore, reliability analysis of alternative data needs to be emphasized, and cross-validation with data from different sources should be considered.

Besides, alternative data in this paper may combine seemingly related factors without actually explaining the underlying logic of events, and would require the linkage of quantitative and macro analyses, with quantitative and qualitative analyses complementing each other.

In addition, alternative data from a wide range of sources will cause the problem that data is not representative, and the more alternative data selected, the worse the prediction ability will be. Therefore, when selecting data, we should analyze logically whether the alternative data is rational and helpful for investment, rather than choosing all data as independent variables at once.

Last, the stability of alternative data is also a major factor affecting its use. Data format changes, API revisions, and even data source changes could lead to the unsustainable use of alternative data. Fortunately, many data providers have realized the importance and commercial value of data and started to establish relevant data service platforms (Belissent, J., 2017). This facilitates future studies on alternative data.

8. CONCLUSIONS

In this report, we developed a factor modeling investment strategy to source good deals to invest in. Based on our strategy, we suggest using alternative data and people’s interests to assist investment decision making. The results of the One-factor reflect the impact of alternative data in different industries. The Multi-Factor regression model shows that alternative could improve the predictive power. Although there are still some issues in the acquisition path and stability of alternative data, its application in improving the accuracy of predicting stock returns deserves attention.

9. REFERENCES

Cong, L., Li, B. and Zhang, Q. (2019) Alternative Data for FinTech and Business Intelligence. SSRN Electronic Journal.

Warrender, C., Forrest, S. and Pearlmutter, B. (1999) May. Detecting intrusions using system calls: Alternative data models. In Proceedings of the 1999 IEEE symposium on security and privacy (Cat. No. 99CB36344) (pp. 133-145). IEEE. 3.

Mittal, P. (2020) Automatic Classification of Retinal Pathology in Optical Coherence Tomography Scan Images Using Convolutional Neural Network. Journal of Advanced Research in Dynamical and Control Systems, 12(SP3), pp.936–942.

Green, T. C., R. Huang, Q. Wen, and D. Zhou (2019) Crowdsourced employer reviews and stock returns. Journal of Financial Economics 134(1), 236 – 251.

Jagtiani, J. and Lemieux, C. (2019) The roles of alternative data and machine learning in fintech lending: evidence from the LendingClub consumer platform. Financial Management, 48(4), pp.1009-1029.

Da, Z., X. Huang, and L. Jin (2021) Extrapolative beliefs in the cross-section: What can we learn from the crowds? Journal of Financial Economics forthcoming.

Bollen, J., Mao, H. and Zeng, X. (2011) Twitter mood predicts the stock market. Journal of Computational Science, 2(1), pp.1–8.

Monk, A.H.B., Prins, M. and Rook, D. (2018) Rethinking Alternative Data in Institutional Investment. SSRN Electronic Journal. [online] Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3193805.

Kung, S.Y. (2014) Kernel methods and machine learning. Cambridge ; New York: Cambridge University Press.

Kolanovic, M. and Krishnamachari, R.T. (2017) Big data and AI strategies: Machine learning and alternative data approach to investing. JP Morgan Global Quantitative & Derivatives Strategy Report.

Belissent, J. (2017) The Age of Alt: Data Commercialization Brings Alternative Data To Market. [online] Forrester. Available at: https://www.forrester.com/blogs/17-06- 22-the_age_of_alt_data_commercialization_brings_alternative_data_to_market/ (Accessed 4 Jan. 2022).

Pérez, A., García de los Salmones, M. del M. and López-Gutiérrez, C. (2020) Market reactions to CSR news in different industries. Corporate Communications: An International Journal, 25(2), pp.243–261.

有贺康顕. Kangxian Youhe, Xintai Zhongshan, Linxiao Xi and Jihong Liu (2018) 机器学习应用系统设计 / Ji qi xue xi ying yong xi tong she ji. 中国电力出版社, Beijing: Zhong Guo Dian Li Chu Ban She.

Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep learning. Cambridge, Massachusetts: The Mit Press.

De Mol, C., De Vito, E. and Rosasco, L. (2009) Elastic-net regularization in learning theory. Journal of Complexity, 25(2), pp.201-230.

Monk, A., Prins, M. and Rook, D. (2019) Rethinking alternative data in institutional investment. The Journal of Financial Data Science, 1(1), pp.14-31.

CHOI, H. and VARIAN, H. (2012) Predicting the Present with Google Trends. Economic Record, 88, pp.2–9.

Hasan, Md.M., Popp, J. and Oláh, J. (2020) Current landscape and influence of big data on finance. Journal of Big Data, [online] 7(1). Available at: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00291-z.

Julien H. (2019) Basic biostatistics for medical and biomedical practitioners. London, United Kingdom: Academic Press.

Amenc, N., Martellini, L. and Vaissié, M. (2003) Benefits and risks of alternative investment strategies. Journal of Asset Management, 4(2), pp.96-118.