Content analysis has emerged as a useful tool for conducting CSR research. While having been useful for CSR research, the popular content analysis approaches alone are less effective for the analysis of big data, which may include millions (or billions) of rows of text.
Today, corporate reports and sustainability disclosures are increasingly available in digital formats (e.g., GRI Initiative, CSRHUB). Also, large volumes of CSR-related communication and information flow through social media platforms (Twitter, Facebook and blogs).
As CSR-related data increases in volume, computational text analysis emerges as an effective tool for quickly processing, visualizing, and analyzing such large CSR datasets. Computational text analysis uses natural language-processing techniques (NLP) and advanced machine learning (ML) algorithms, which are often borrowed from computer science and statistics.
Supervised Machine Learning
Two types of Machine Learning-based approaches are considered for advanced content analysis: supervised and unsupervised. To use supervised ML-based text analysis, researchers need the text corpora containing known categories or labels, and to build predictive models using such algorithms as Support Vector Machine (SVM) and Naïve Bayes.
Text classification (e.g., if a stakeholder’s Facebook post about a company’s CSR is positive or negative) is a popular application of supervised Machine Learning. The key challenge with the supervised approach is that known categories (or a training dataset) may not be readily available in the research data and preparing a training dataset to build the predictive models, especially for big data analysis, takes a great deal of time.
Unsupervised Machine Learning
On the other hand, unsupervised Machine Learning approaches do not need the data with known categories (a training dataset). The unsupervised text analysis takes raw texts as the input, preprocesses and transforms them through NLP techniques (e.g., removing stopwords, converting words to numbers) and, finally, attempts to discover categories or topics from the text data using advanced statistical algorithms.
This unsupervised approach can be scaled to large datasets and has appeared attractive to social science researchers. For example, traditional clustering algorithms reveal similarities between documents or texts, and recent algorithms, such as Latent Dirichlet Allocation (LDA) and Correlated Topic Model (CTM), uncover latent topics from large amounts of text data.
Topic modeling
Topic modeling is an unsupervised machine learning-based content analysis technique focusing on automatically discovering hidden latent structure from large text corpora. In topic modeling, a document is considered a collection of words containing multiple topics in different proportions. For example, a social media post may be largely about natural environments, while it is also about health and supply chain (i.e., 70% natural environments, 20% health, and 10% supply chain).
A growing number of social enterprise systems increasingly rely on Machine learning algorithms, and this is just the beginning of adapting Machine Learning techniques for CSR.
This article is part of a series on ‘Artificial Intelligence for Social Good and CSR’