The Language of Brands in Social Media

Brand managers have only recently started using social media listening to gain insight into consumer brand perceptions. Therefore, the impact of social media listening on brand strategic decision-making is still being determined. This research details the significant effectiveness of social media listening (consumer "language") in describing the brand image. Create an interactive chart assessing the existing social media data collection research versus Figure 1 in this article, which tracked social media conversations. Use this comparison chart to evaluate the power of social listening as a consumer insight tool.

Model Estimation and Findings

The data set contains 134,953 tweets about specific brands (i.e., including a brand's Twitter handle) from a larger corpus of 1.2 million tweets collected over a three-week period beginning mid-February 2015 using Twitter application programming interface. We identified messages that referenced brand names (and explicitly included brand Twitter handles). Our data set of messages referenced 193 brand names in a range of categories. These categories included airlines, banks, beer, health and beauty, beverages, clothing, coffee, computers, consumer/household goods, credit cards, online and brick-and-mortar retailers, entertainment, fast food/restaurants, food, beverage, grocery, hotels, household goods, insurance, shoes, television shows, soft drinks, sports, grocery stores, telecommunications, and travel. An initial frequency analysis of the 1.2 million tweets across brands demonstrated significant differences in the total number of messages. This situation poses a challenge for topic modeling because of the likelihood that the topic model we estimate will be dominated by a few brands with a large volume of tweets. Therefore, we used random samples of 2,000 Tweets per brand, which we then subjected to a series of preprocessing steps described subsequently.

To reduce noise in the data (to ensure that a tweet was actually about a particular brand), we restricted our focus to tweets that listed a defined Twitter handle pertaining to the brand. We also considered an alternative approach that involved searching for tweets containing the brand name itself. However, this process also yielded messages that pertained to commonly used words in the English language for certain brand names (e.g., visa, tide) that were unrelated to the brand name. Identifying the brand-related tweets would involve manually combing through vast numbers of social media messages or spending time separately developing a brand disambiguator. Instead, our approach of using the Twitter handle accurately identified brand-related tweets while minimizing the need for a human coder to comb through the social media posts to remove any unrelated posts. Thus, we believed our approach was justifiable because it is scalable to new or larger data.

Our results are based on brand name handles contained within the tweets, but the topic models excluded the actual brand names and handles themselves, to concentrate on the context in which the brand appeared and not to have the name itself skew the results (i.e., putting all brands on equal footing). After the preprocessing steps, the average sample size per brand was around 700 tweets (SD = 286). We then estimated topic models based on this overall sample (N = 134,953).


Data Processing

Constraints based on number of tweets per user

Twitter bots and trolls can skew results, and identifying which posts originated from legitimate users and which were posted by trolls or bots is difficult. One potential way to pinpoint these bots and trolls is to identify Twitter users who have a high volume of posts within a certain period. As our data set was already constrained to a three-week period, we removed all tweets beyond the tenth tweet in the data set for any given user. Although this removal may not eliminate bots entirely, it significantly minimizes the issue without completely removing legitimate users who post frequently (we compromised between Type 1 and Type 2 errors for identifying bots).


Filtering tweets with special characters or URLs and sponsored tweets

We removed 1-grams, which are characterized by special characters (e.g., %@%), or tweets that included URLs (%http% or %<URL>%). We also removed any sponsored tweets or promotional messages originating from the brand (i.e., messages that indicate that the tweets are sponsored or are ads).


Removing stop words

Stop words are commonly occurring words such as "the," "is," "at," "which," and "on". Because our concern was with content rather than style, consistent with standard approaches using LDA, we removed these from the data before extracting topics.


Removing brand names

We removed the actual brand names present in the tweets, as these brand names do not have direct relevance to our assessment of topics. To do this, we searched for all instances of the brand names in tweets identified for the brand and removed these words. The removal of brand names also has an important rationale from an empirical standpoint. Brand names appear excessively frequently in the data and do not fall within the Dirichlet distributions assumed by the LDA model. Because brand names are the criteria for tweet collection, such words are overrepresented in the data and essentially become outliers in the distribution of word counts. To demonstrate this further, we ran the model again including brand names and found that the topics observed were less meaningful. For brevity, we do not present these results, but they are available on request.


Creating a topic file

After filtering out the brand names, we focused on extracting topics from a pooled sample consisting of all brands. To choose the number of topics, we specified the number of topics that LDA can generate. Across multiple topic models based on 25, 50, 100, 200, 250, 500, 1,000, and 2,000 topics, the 100-topic model generated the ideal solution, based on perplexity, coherence scores, and a qualitative assessment of the topics generated.


Model selection using perplexity and coherence

To identify the optimal number of topics that we need to choose, we calculated these scores across a range of models estimated by specifying different numbers of topics (25, 50, 100, 200, 250, 500, 1,000, and 2,000). We calculated the perplexity score using the following formula:

\mathrm{PP}(\mathrm{W})=\mathrm{P}\left(\mathrm{w}_1, \mathrm{w}_2, \ldots, \mathrm{w}_{\mathrm{N}}\right)^{1 / \mathrm{N}}=\mathrm{N} \sqrt{\frac{1}{\mathrm{p}\left(\mathrm{w}_1 \mathrm{w}_2 \ldots \mathrm{w}_{\mathrm{N}}\right)}}

(2)

Alternatively, this specification can be rewritten as PP(W) = 2−l.

To calculate this, we first estimated a topic model by specifying a number of topics (e.g., 100, 200). We then read all the messages to calculate N (total corpus size or number of messages). From this topic model estimation, we outputted the probabilities associated with each message calculated as log[p(w1,w2,w3)] = log[p(w1) × p(w2) × p(w3)] = log[p(w1)] + log[p(w2)] + log[p(w3)], and so on, until arriving at log p(wN), where N is the number of messages and W1 is the first message.

In general, lower perplexity scores indicate a better model. We provide the perplexity scores across a range of topic models in Figure WA1 of the Web Appendix. As we examine the perplexity scores, across the number of topics, the 200-topic model has the lowest score of 18.0. This is identical to the 250-topic model score (18.0), with the 100-topic model also providing similar, though slightly higher, scores (18.3). Because the 100-topic model is a more parsimonious representation without a significant trade-off in terms of the perplexity calculation, we chose the 100-topic model based on this analysis.

A second metric we examined is the coherence score, or topic coherence, which is based on the semantic coherence of the words. Research has found ways to automatically measure topic coherence with humanlike accuracy using a score based on pointwise mutual information. We calculated the average coherence score of each of the topic models across various numbers of topics (see Web Appendix Figure WA2). We considered two measures of topic coherence: C_v and C_umass. Based on C_umass, the 50-topic model also had the best topic coherence, which was closest to zero (–980.6). This was followed by the 100-topic model (–996.7). Optimizing across the two metrics of perplexity and coherence required us to make some trade-offs. We deemed the 100-topic model as offering the best combination of these two metrics.


Visualizing the data

We used word clouds to visualize the topics. Clusters of words that are semantically related form topics, and each brand is associated with the topics to varying degrees. Within the word cloud, the size corresponds to the prevalence of the word within the topic (i.e., p(word|topic)).


Evaluating brand positioning using topic probabilities

A straightforward approach to assessing the relationships between topics and brands is to use the topic probabilities averaged at the brand level. We conduct a correlation analysis of the various brands using the average topic probability across the 100 topics. We specify these correlations as one approach that managers can use to identify brands that are similar or dissimilar on the basis of social media topics.


Assessing lexical differences across brands using significance testing

One of the goals of our research is to use social media language and topics to predict the presence or absence of brand names in a conversation. After uncovering topics and connecting these topics with brands, we use Schwartz et al.'s software to conduct regressions of topic use across brands, via significance testing (see Appendix A for details on this step). We then use this analysis of positive and negative topic correlates of a given brand and describe applications for brand management derived from this analysis.


Results

Given our open-vocabulary approach, we present words associated with each of the 100 topic IDs. We do not focus on labeling these topics in this article, though there are approaches available to do so using the incidence of words within a topic. Rather, our focus is on using these topics of conversation as a way to identify brand similarities and differences. However, a few general observations are worth noting regarding the types of topics described:

  1. Positive or negative incidents: Some of the topics identified described a news story or feature, some of which were positive (e.g., topic 17) and others negative (e.g., topic 21).
  2. Activities and lifestyles: Topics were also identified by different activities and lifestyles, including music and entertainment (topic 82) and holidays/special occasions (topic 73).