The Language of Brands in Social Media

Methodology

We base our approach of identifying topics of conversation on the topic modeling approach using LDA, a generative probabilistic model that allows clustering words into groups (namely, "topics") on the basis of other words to which they relate. This results in groups that tend to have similar or related meanings. More specifically, according to LDA, each document contains a mixture of a small set of topics, and each word (observed) within the document is attributable to one of the (unobserved) topics. The topics underlying these words are latent dimensions. With this model, the researcher can infer topics (words along with their probability of being in a given topic) on the basis of statistical distributions of words across documents. The words in a topic are topic definitions - namely, the model posteriors that indicate the probability of a topic given a word p(topic|word).

Beyond topic modeling, our interest is in examining how different topics correlate with different brands, and we use two approaches to construct these relationships. We determine this by assessing the average topic probabilities (averaged at the brand level). We propose various applications stemming from this first approach. The second approach to assessing brand–topic relationships involves a significance testing framework, which identifies the significance associated with positively and negatively correlated topics for each brand. We use the term "differential language analysis" to refer to this second approach. A typical application of differential language analysis is the correlation of language usage in UGC with known characteristics of the people posting, such as age, gender, or personality, with the ultimate goal of predicting these on the basis of language characteristics. For example, leveraging distinctive words, phrases, and topics, computational linguists can predict gender, age, personality, emotion, sentiment, and flirting. Monroe, Colaresi, and Quinn examine how political conflicts can be understood by examining language used by Republicans and Democrats. In the present context, we use differential language analysis to identify the topics that most distinguish brands/brand characteristics; this in turn forms the basis for identifying a set of topics that is uniquely related to each brand.

The framework used here can be described in distinct stages, namely (1) preprocessing and tokenization of the data, (2) generating and extracting topics using LDA, and (3) linking topics and brands using both average topic probabilities and significance tests using logistic regressions. We briefly outline each of these stages next, as it pertains to our empirical data analysis. Figure 1 provides a graphical summary of the key steps in this approach.

Figure 1. Overview of framework for mapping online brand image using differential language analysis of social media conversat

Figure 1. Overview of framework for mapping online brand image using differential language analysis of social media conversations.


Preprocessing and Tokenization

Preprocessing is a key step before topic modeling. Typically, this involves extracting words or phrases (one- to three-word sequences more likely to occur together than by chance). In our case, given that we examined Twitter data (restricted to 140 characters per message), we elected to focus primarily on 1-grams (single words) rather than phrases. We used Schwartz et al.'s social media tokenizer, which appropriately handles extra-linguistic social media content such as hashtags, emoticons, and "at-mentions" of other users.

In the second step, we calculate each brand's use of a topic, or the probability of a topic's usage at the brand level. We follow Griffiths and Steyver, who use a generative model and envision each document as produced by choosing a distribution over topics. Alternatively, they view each document as having a latent structure consisting of a set of topics and use LDA to uncover the topics within each document (Equations 5 and 7 of Griffiths and Steyvers highlight the use of LDA to uncover topics within documents).

In our setting, each message (which is an individual message pertaining to a brand) can be viewed as a mixture of topics. We derive the average topic probability for each message pertaining to a brand, and the average of topic probabilities for each message for each brand provides us the average brand–topic probability. To some extent, the conceptual framework for brand–topic relationships is also akin to author–topic models, where brands (instead of authors) are linked to various topics.

We also use a second approach to assess the brand–topic relationships by estimating the significant positive and negative topic correlates of each brand. To do this, we conduct significance testing to identify which topics significantly predict that a message pertains to a given brand through a series of logistic regressions, where the outcome of interest was whether a given message pertained to a particular brand name (vs. not), and used message-topic probability as an independent variable, as summarized next:

\operatorname{Pr}(\text { is brand } \mathrm{x})=\frac{\mathrm{e}^{\left.\beta_0+\beta_1 \text { (topi } c_k\right)}}{1+\mathrm{e}^{\left.\beta_0+\beta_1 \text { (topi } c_k\right)}}

(1)

We use LDA outputs from MALLET as well as differential language analysis toolkit to assess differences in linguistic features across groups or factors. A key advantage of the Schwartz et al. toolkit is that it uses the Benjamini and Hochberg correction for false discovery rates by default. Given the large number of regressions needed to estimate each brand–topic relationship, this correction is a critical step in ensuring the validity of significance testing.