Developing Insights from Social Media

While formal market research has historically represented consumer-centric information, social listening is now considered the most insightful. Review the model, Figure 1, the schematics of social listening, and its application to the social media platform Twitter. List the advantages and disadvantages of social listening for strategic insight, paying particular attention to the error factors discovered from this study.

Twitter is a social media sharing site considered a snapshot of consumer and industry sentiment. Review Figure 1 in this article, then identify how the collective opinion fits into the signal of opportunity. Pay attention to the secondary retweets that measure the interest level of those who have chosen to follow the tweet trails.

Discovering social trends in a target audience

Methodology

We present the details of how to discover a target audience of Twitter users and their collective voice from raw Twitter data. First, in order to identify candidate users that meet certain criteria, we explore available Twitter resources for data collection and existing approaches to user profiling. Next, we discuss enriching user profiles utilizing hashtags in the tweets posted by the target users. Lastly, we present developing topical and social insights from the collective voice of the target users.

Before we go into details, we first present formal modeling of the data space that we analyze in this paper. Our Twitter data space can be noted as {\mathcal {U}} \times {\mathcal {T}} \times {\mathcal {H}}, where {\mathcal {U}} is a set of users on Twitter, {\mathcal {T}} is a set of tweets created by the users, and {\mathcal {H}} is a set of hashtags used in the tweets by the users. This implies that a user u \in {\mathcal {U}} creates a tweet t \in {\mathcal {T}} using a set of hashtags {\mathcal {H}}_{u,t} \subset {\mathcal {H}}.

User profiling is an essential component to our approach, which defines user attributes needed for a study and populates the attribute values for each user. We define the profile of a Twitter user u∈U as a set of tuples consisting of an attribute and its value where, with respect to user u for an attribute a∈A, its value p(u, a) is computed by a user profiling function p, as in Eq. (1):

\begin{aligned} P_{u}=\{(a,p(u,a)) \mid a\in A, u\in U\}, \end{aligned}

 (1)

where A is a set of user attributes. Determining the user profiling function p for each user attribute is the goal of the user profiling phase.

Fig. 1

The flow map of our unified scheme for developing social insights from the collective voice of target users


The flow map of our unified scheme for developing social insights from the collective voice of target users

Figure 1 illustrates the flow of our unified scheme for developing social insights from the collective voice of target users. First, attributes of Twitter users are identified in the user profiling stage such as demographic attributes and other personal attributes. When some user attributes are missing due to data availability, researchers can consider developing their own customized solution to a specific user profiling task. A supervised machine learning model can be built by utilizing hashtags as the features for prediction. Second, once this user profiling phase is completed, researchers select only the users of interest based on the identified user attributes. Finally, researchers proceed to develop topical and social insights from the collective voice of these target users.


User profiling

In general, sampling of Twitter users is less common than sampling of tweets due to the limited functionality of Twitter API for collecting users. For this reason, we begin with a large pool of random tweets, which are known to be much easier to collect via Twitter API mentioned earlier in "Introduction" section. Each tweet collected contains author information describing the user who created the tweet. Some user attributes for the users in the pool are already known or can be easily acquired, while other attributes need to be inferred, are difficult, or impossible to identify. It is worth noting that raw user data collected from Twitter via Twitter API provides surprisingly useful information about users. Table 2 lists native Twitter objects and their fields along with user attributes that can be derived from the fields. Twitter API provides several types of objects encoded in JavaScript Object Notation (JSON), of which User and Tweet objects are the most useful in user profiling.

Table 2 Summary of the user attributes derivable from native Twitter objects

Object

Field

Description

Derivable user attributes

User

name

Name of the user

Name, gender, age, race/ethnicity

location

User-defined location for the account's profile

Location

url

URL provided by the user in association with their profile

Web site, blog, or other social media accounts

description

User-defined description of their account

Demographics, expertise, hobbies, interests, personality traits, political orientation

verified

Whether Twitter has verified that the account of public interest is authentic

Popularity

followers_count

Number of users following the account

Popularity

friends_count

Number of users the account is following

Sociability

listed_count

Number of public lists that the user is a member of

Popularity

favourites_count

Number of tweets the user has liked in the account's lifetime

Posting activeness

statuses_count

Number of tweets (including retweets) issued by the user

Posting activeness

created_at

UTC datetime that the user account was created on Twitter

Account age

profile_image_url_https

HTTPS-based URL pointing to the user's profile image

Gender, age, race/ethnicity

followers*

List of users following the account

Network

friends*

List of users the account is following

Network

Tweet

created_at

UTC time when the tweet was created

Behavior

text

Actual text of the status update

Demographics, expertise, interests, personality traits, political orientation

coordinates

Geographic location of the tweet as longitude and latitude coordinates

Location, behavior

place

Known place as city, state, or country

Location, behavior

reply_count

Number of times the tweet has been replied to

Popularity

retweet_count

Number of times the tweet has been retweeted by other users

Popularity

favorite_count

Number of times the tweet has been liked by other users

Popularity

lang

Machine-detected language of the tweet

Language

retweeted_status

Original tweet object if the tweet is a retweet

Typical tweet or retweet


A User object, which describes an individual user on Twitter, has several fields that can be directly used as user attributes, such as name, location, and url, while the other fields can be analyzed to infer new attributes. For example, from the description field that has a user-defined description or bio of an account, one can infer many different types of user attributes, such as demographic attributes (e.g., age, education, gender, location, marital status, language, occupation, and race/ethnicity) and other personal attributes (e.g., expertise, hobbies, interests, personality traits, and political orientation), depending on the information included in the text of the field. A wide range of natural language processing (NLP) and text mining techniques can be applied to this field. The other fields in a User object can be good indicators of the account's popularity, sociability, or activeness. For example, the followers_count and the listed_count fields indicate how popular the account is, while the friends_count field indicates how sociable the account is. One may want to compare the followers_count to the friends_count, to see if there is a large or small gap between the two fields. For example, celebrities tend to have a very large number of followers but a smaller number of friends, whereas spam accounts or bot accounts tend to have many friends but few followers.

The favourites_count and the statuses_count fields can be used to measure how active the account is in terms of posting tweets. The created_at field can be used to calculate the account age in days, months, or years, which can be combined with other fields for normalization. For example, users who have been using Twitter for ten years would probably have more followers or have posted more tweets than those who just began to use Twitter. In this case, one may need to divide the number of followers or number of statuses by the account age, so that the indicators can be normalized for each user.

A profile image from the profile_image_url_https field can be used to identify gender, age, or race/ethnicity of the user by applying state-of-the-art image analysis techniques. The followers field contains the lists of users following the account, while the friends field contains the list of users the account is following, both of which present the relationship network of the user. Note that the two fields, each marked with an asterisk, are not actually linked to the User object as its fields. Twitter API separates these two fields from the User object for some reason. But we link them as fields of the User object, as we believe those fields should also be treated as user attributes. The two fields provide direct information about who are the followers and friends of a user. The verified field is a unique feature of Twitter, which indicates whether Twitter has verified that the account of public interest is authentic. A verified account has a blue verified badge on Twitter. This can serve as another indicator of the user's popularity or authority.

A Tweet object describes an individual tweet posted by a user. An individual tweet could not be directly used as an attribute of a user due to its limited information. When aggregated, however, they can be a powerful source for a researcher to understand the user. While a Tweet object has a number of fields, the bottom half of Table 2 lists a few of those that can be used to infer user attributes. The text field is the most important one among all fields, as it provides raw tweet text written by the user. It is worth noting that tweet text can have up to only 280 characters (the length limit was increased from 140 to 280 in 2017), which is why Twitter is called a micro-blogging service. The short text has its own pros and cons. In some cases, tweet text might be too short to convey meaningful information from an analysis perspective, while in other cases a single short tweet can have enough information to understand the user. On the other hand, the short text is what has made people freely use Twitter. From a Big Data perspective, the more tweet text we have for a user, the better understanding of the user we will have. The text field can be used to infer most of the demographic attributes and personal attributes mentioned earlier. As with the description field of a User object, this field can benefit from text analysis techniques.

The created_at, coordinates, and place fields can bring a temporal or a geo-spatial aspect to the study. While every tweet has a value in its created_at field, not all tweets have values in the coordinates and place fields. It depends on whether the user had activated location sharing in their applications. It is known that, as already discussed earlier, only a small fraction of tweets are geo-tagged or geo-referenced. The three fields reply_count, retweet_count, favorite_count are considered to be good indicators for the popularity of the tweet, which can also translate into the popularity of the user. The lang field indicates which language the user is primarily using or able to use. It is also worth noting that users can retweet other users' tweets, and those retweets are considered to be the user's tweets, although they were originally created by others (users can also add their own comments to the original tweet when retweeting). If we analyze tweets to understand the user, however, those retweets could be of no help, because they were not originally created by the user. In this case, by referring to the retweeted_status field, those retweets can be excluded from any analysis, so that only the normal tweets created by the user are considered.

The Twitter objects and their associated fields listed in Table 2 provide insight into some heuristics for user profiling before attempting to apply advanced methodologies. In particular, the description field of a User object can be directly used to extract various user attributes like gender, location, occupation, and so on. The following description from a Twitter user account, which is open to the public, is a good example:

Senior Narrative Designer @UbiMassive - cats, books, games and scones - Brit in Sweden - opinions all mine - She/her.

This short bio tells much about the user, such as gender, occupation, hobby, nationality, and location. The user is female from the phrase “She/her”; she is a narrative designer at a game company; she likes cats, books, games, and scones; she is British; she lives in Sweden. While not all Twitter users describe themselves in such detail, it is apparent that the description field can serve as a primary source for understanding users. In order to extract the right information from the description text, a string pattern matching technique called regular expression can be employed.

If the approaches relying on some raw user attributes provided by Twitter are too simple to work for a research study, one should consider employing advanced techniques for user profiling listed in Table 1. As described in "Related literature" section, previous works have explored different ways of profiling Twitter users. When applying the advanced methodologies, note again that different methodologies use different data for user profiling, depending on their proposed approaches. For example, to identify the location of a user, some methodologies  consider only tweet text, whereas other methodologies use not only tweet text but also use follow relationship of users or tweet context. Note also that the methodologies targeted at the same user attribute do not always yield exactly the same outcome, as each methodology has its own research questions to address. Depending on objectives of the study, a subset of the user attributes listed in Table 2 can be considered in user profiling. For the market research project example mentioned in "Introduction" section, the researchers should only focus on such attributes of users as age, gender, and interest, and thus examine which methodologies would fit the data they currently have. Again, they should be aware that different methodologies use different data. Once this user profiling task is performed over all users in the data pool, they now can select only the users that meet the criteria they have set for the study. This initial set of selected users can be further analyzed to be selected as the final set of target users.

Customized user profiling

If the user profiling task was perfectly done and ended up properly populating all user attributes needed, we can move on to selection of target users based on the user attributes. In many cases, however, it is possible that there are no resources available for some user attributes, leaving their values missing. This can happen when (1) there are no available resources at all, (2) the existing resources do not fit the data we have, or (3) the performance of the available resource is not satisfactory.

To resolve this issue, we propose to consider developing a customized solution to a specific user profiling task, especially if it is a supervised machine learning problem. For example, suppose we want to classify each Twitter user by their political orientation, i.e., conservative or liberal. While there are some available resources for political orientation classification, as listed in Table 1, one might find that those existing resources do not work well with the recent Twitter data. This leads us to consider developing our own political orientation classifier as long as we can make labeled data that can be used for training and testing machine learning models. Inspired by the observation that some Twitter users explicitly share their political orientation in their bio, we can collect a set of those users and label them as conservative or liberal. We then can use the labeled data as training data and test data for machine learning by selecting a set of features for prediction. Specifically, we propose to utilize hashtags as the features for political orientation prediction, based on the idea that conservatives and liberals are believed to be interested in different topics to some extent, thereby using somewhat different hashtags. Once a machine learning model is built, one can apply the model to populate the values in the target user attribute. While we cannot say that this approach would work for all user profiling tasks, we believe that it can work for supervised machine learning tasks, such as classification and regression, and that it can be a good complement to the existing user profiling solutions. We call this phase text-based customized user profiling, as opposed to the primary user profiling performed in the first phase, as this customized user profiling task can complement what is missing from the primary user profiling task.

In order to utilize hashtags as features for prediction, we first need to collect the tweets posted by users and mine hashtags from the tweets. The Twitter API allows researchers to retrieve up to 3200 most recent tweets of a user account, as long as the account is set to public.Footnote 9 Alternatively, one can consider web scraping to retrieve more than 3200 tweets from an account, although this option does not provide easy access to the web data in a structured manner unlike using an API. While all words in tweets are meaningful in one way or another, we particularly focus on hashtags in tweets. A hashtag is a word starting with a hash (#) symbol as its prefix such as #metoo, #nowplaying, and #earthday. Hashtags were originally introduced by Twitter and have been used to index keywords or topics on social media, which allow users to easily follow topics of interest. As mentioned, the goal of a hashtag is to facilitate search and aggregation of messages related to the same topic. With the wide adoption of hashtags on Twitter, a number of studies have investigated hashtags on Twitter. Tsur et al. attempt to predict the spread of thoughts and ideas, called memes, using hashtags. Ferragina et al. address hashtag relatedness and classification.

One of the reasons why we focus on hashtags, instead of all words or phrases in tweets, is that they are easy to handle. As users explicitly create a hashtag with the hash symbol and a hashtag allows no space in it, they are easy to extract and aggregate from text. In fact, Twitter API provides a list of hashtags identified in a tweet as a Hashtag object, thus API users do not have to extract hashtags themselves, which otherwise should be done with the help of a text analysis technique like regular expression. The main drawback to using hashtags is its sparsity; as pointed out by Godin et al., not all tweets have hashtags and not all users use hashtags. Nevertheless, this sparsity can be overcome when a large number of hashtags are aggregated, mainly because of the fact that a hashtag tends to be adopted by a significant number of users who want to join a virtual community that is interested in a certain topic.

Once all hashtags are extracted from tweets, they are aggregated such that the total frequency for each hashtag is calculated. Based on the hashtag frequency, one can have a hashtag popularity ranking sorted by frequency in descending order. This hashtag ranking can be a basis for researchers to manually select top-k popular hashtags that will be used as features for prediction, where k can be determined empirically. When top-k hashtags are selected as features, their frequencies are the values that should be put into the machine learning model. This way, one can build a model that is able to predict the value of a user attribute for a user. Building a machine learning model should always be followed by evaluating the model performance, using commonly used machine learning metrics.

Discovering social trends

Once the user profiling is completed and all values of the user attributes needed for the study are properly populated, one can now select the target users of interest, using the user attributes. For the market research example mentioned earlier, the researchers can simply select the users in their pool, who are young, female, and interested in fashion. Given that the target users have been identified, researchers can now proceed with in-depth analysis on the collective voice of these targeted users. While this final phase should completely depend on the objectives of the study, i.e., what the researchers want to know about their target audience, we focus on hashtags from a topical perspective to discover popular or rising topics among people and also on relationship networks from a social perspective to identify influencers.

Popular hashtags among the target users can be captured in a similar way that we used earlier to identify popular hashtags for the customized user profiling. A simple frequency ranking from tweets will work for popular hashtags, while one may want to consider advanced techniques to detect a trend over time with hashtags. Influencers in a social network can be identified as well, based on the network structure among the target users. A variety of centrality measures, such as degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality, can be applied, as previously mentioned in "Related literature" section.