Developing Insights from Social Media

Case studies

Having established a procedure to identify a target audience in Twitter and discover social trends from their collective voice, we now move on to two in-depth case studies that demonstrate how a research study can benefit from our approach. The first case study will provide details on the market research project example we have mentioned throughout this paper, while the second case study performs a comparative analysis on the effects of political orientation on a gender issue.

Table 3 Monthly data statistics of the pools of random Twitter users and tweets

Month	User count	Tweet count
12/2021	22,569,110	133,387,546
11/2021	21,876,935	129,462,997
10/2021	22,175,272	133,334,050
09/2021	21,446,941	127,009,377
08/2021	21,708,191	133,447,209
07/2021	21,979,242	133,358,039
06/2021	21,611,226	128,414,906
05/2021	22,651,068	133,741,215
04/2021	22,138,958	129,235,713
03/2021	22,441,309	133,544,952
02/2021	21,529,017	120,703,913
01/2021	22,317,570	133,754,300
12/2020	21,107,115	120,627,976
11/2020	21,950,691	129,635,445
10/2020	21,889,317	133,221,211
09/2020	22,344,474	128,867,950
08/2020	22,643,060	133,302,754
07/2020	22,930,209	133,609,303
06/2020	22,419,694	128,885,150
05/2020	23,554,291	133,389,857
04/2020	23,420,878	129,330,138
03/2020	23,400,803	133,474,640
02/2020	21,260,800	125,029,995
01/2020	22,275,681	133,666,464
Total (Unique)	138,845,242	3,132,435,100

To identify two sets of target users for the case studies, we first need a large pool of random Twitter users and tweets, from which target users are extracted by user profiling. To that end, we rely on the Twitter Streaming API, mentioned in "Introduction" section, for large-scale data collection. The API allows users to filter real time tweets on a set of keywords. Due to its real time nature, users begin collecting tweets at the moment of calling the API in a way that relevant tweets are streaming into the computer that has called the API. While setting a set of specific keywords of interest is a normal use of the API, setting a set of extremely general words such as stop words (e.g., 'a', 'an', and 'the') as keywords is a commonly used trick to collect random tweets. Each Tweet object from the API has a User object, which describes the user who created the tweet, as described in "User profiling" section. This means that we can collect a set of random users from a set of random tweets. As shown in Table 3, we have collected real time tweets for two years from January 2020 to December 2021, which leads to a large-scale data collection of approximately 3.1 billion (3,132,435,100) unique English tweets posted by approximately 138.8 million (138,845,242) unique users. Note that, while these pools of users and tweets are indeed big enough to be called Big Data, they do not always have to be this big. Smaller sets of random users and tweets could be enough depending on the objectives of the study, although smaller data sets could suffer from the under-coverage error mentioned in "Introduction" section.

Young women's fashion market research

Marketers want to know what their customers are currently interested in and who are the influencers among them, so that they can have insights into new business opportunities and focus their marketing effort and resources on specific people who could influence others. In this in-depth case study, we aim to first identify young, female users in Twitter who are interested in fashion and then discover popular topics and influential users among them.

In order to find the target audience of female users interested in fashion, we first begin by searching our random pool of tweets for tweets that have the hashtags #fashion and #style. As mentioned earlier, each Tweet object has a User object that indicates the user who created the tweet, which allows us to identify all users in our pool who have ever used the two hashtags. Here, mentioning the hashtags is assumed to be their interest in the topic. This step can be understood as a simplified implementation of the interests attribute in Table 1. Note that, one can consider adding more hashtags as search terms that are similar to #fashion and #style such as #beauty and #clothing. The search allows us to find 111,913 users in total. Using the Twitter API, we further check if each of these users still has a valid, public account, which leaves 89,437 users.Footnote 10 Next, we remove users whose total number of tweets posted is fewer than 100, based on the idea that we would need at least 100 tweets to understand a user by their tweets. This results in 51,276 users in total, i.e., |U|=51276. We then collect up to 3200 most recent tweets from each user using the Twitter API, which totals 107,002,581 tweets, i.e., |T|=107002581.

After finding users interested in fashion and collecting their recent tweets, the next step is to identify each user's gender and age, which will allow us to select young female users. Before applying a gender classification solution, we first remove organization accounts, based on the belief that organizations do not represent our target customers. Note that researchers may want to include organization accounts if they believe organizations are worth being considered in their study. In this case study, we are only interested in individuals, especially young female users. This step can be considered as an implementation of the account type attribute in Table 1. In order to identify organization accounts, we leverage two open source solutions: one is called Humanizr and the other called M3-Inference. The Humanizr looks into tweets of a user along with user information in the tweets to determine whether the account in question belongs to an individual person or represents an organization, while M3-Inference uses the profile image, name, screen name, and the bio of a user, as already stated in "Related literature" section. In case the two solutions return different outcomes for the same account, in other words, one solution classifies as an organization account, whereas the other does as an individual account, we consider a user to be an organization account when at least one of the two says it is an organization. Otherwise, the account is considered an individual account. In our data, approximately 22% (11,195 out of 51,276) of the accounts turn out to be organization accounts, which is higher than 9.4%. We remove those organization accounts, which leaves 40,081 users who are believed to be individual accounts.

For gender identification, we utilize a Python library called gender-guesser, which employs a statistical approach to gender classification by considering the first name of a person, as well as the M3-Inference solution already used for the account type. The gender-guesser solution returns one of the six classes: "unknown", "androgynous", "male", "female", "mostly_male", or "mostly_female". Here, we merge "mostly_male" into "male" and "mostly_female" into "female", for simplicity. As mentioned in "User profiling" section, a User object has the name field that allows users to specify their name. As not all users provide their exact full name, it is possible that there is no first name in the field. Furthermore, even if there is the first name specified by the user, there is no guarantee that the first name is recognized by the solution, which is especially true for non-English names. The M3-Inference solution returns either "female" or "male" for a user. In order to merge the outcomes from the two solutions, we (1) label the users as "conflict" when one solution returns "female" and the other "male" and (2) label the user as the one predicted by the second solution when the first solution returns "unknown" or "androgynous" and the second solution returns "male" or "female". This results in 24,886 females, 13,910 males, and 1285 conflicts. We disregard the conflicts in our data.

For the age attribute, we continue to rely on the M3-Inference solution, which returns for each user one of the four age levels: ≤18, (18, 30), [30, 40), [40, 99). From our data, the solution results in 6,011 users for 18 or under, 12,994 for 19 to 29, 10,641 for 30 to 39, and 10,435 for 40 or above.

Now that we know all four user attributes needed for this study, i.e., interest, account type, gender, and age, we can select, from the users interested in fashion, those who are young and female. For the age attribute specifically, we define young women as those in the following two age classes: (18, 30) and [30, 40). This entire selection process of target users results in 16,011 users, who form the final target audience for this study, and 31,506,037 tweets posted by the users, which will be further analyzed. As a reminder, we identify these 16,011 young females out of all 51,276 users.

Table 4 Top-50 popular hashtags from the tweets posted by the young female users interested in fashion

Rank	Hashtag	Frequency	Rank	Hashtag	Frequency
1	#poshmark	4,993,200	26	#fitness	19,223
2	#shopmycloset	3,748,873	27	#nature	19,137
3	#fashion	2,351,297	28	#model	18,601
4	#style	1,569,501	29	#nyc	18,409
5	#giveaway	79,435	30	#summer	18,025
6	#love	73,497	31	#quote	17,928
7	#etsy	71,360	32	#tbt	17,591
8	#win	67,961	33	#blog	17,575
9	#shehnaazgill	57,871	34	#shopping	17,510
10	#beauty	54,519	35	#sidharthshukla	17,205
11	#handmade	48,495	36	#design	16,366
12	#art	39,531	37	#life	16,261
13	#vintage	36,432	38	#gifts	16,178
14	#jewelry	31,311	39	#sale	16,084
15	#ad	31,195	40	#covid19	16,066
16	#ootd	29,427	41	#sweepstakes	16,019
17	#beautiful	28,299	42	#android	15,903
18	#photography	27,432	43	#food	15,695
19	#travel	25,743	44	#mayward	15,661
20	#christmas	24,971	45	#androidgames	15,294
21	#makeup	24,902	46	#cute	15,289
22	#music	22,309	47	#health	15,187
23	#ebay	21,732	48	#sexy	14,926
24	#gameinsight	21,027	49	#tiktok	14,921
25	#repost	20,442	50	#contest	14,897

We now proceed with the last step for gaining insights into popular topics and influential users among the young women interested in fashion. To discover popular topics, we look at popular hashtags used by them in their tweets. When extracting hashtags from tweets, we exclude those hashtags that are exclusively used by a single user. Specifically, a hashtag is excluded if its frequency rate from the most contributing user is higher than or equal to 0.5. We also exclude non-English hashtags. Table 4 presents the top-50 popular hashtag ranking. All the hashtags on this ranking provide us with direct or indirect insights into young female users' interests in the fashion domain. For example, the first-, second-, and seventh-ranked hashtags #poshmark, #shopmycloset, and #etsy clearly show how popular shopping on Poshmark and Etsy is among young women. Other hashtags on the ranking are also intriguing, such as #handmade, #vintage, #jewelry, #ootd (meaning outfit of the day), #makeup, and #fitness, to name a few. Marketers can get some ideas from these popular hashtags for their marketing strategies.

Table 5 Top-50 popular user mentions from the tweets posted by the young female users interested in fashion

Rank	User	Frequency	Rank	User	Frequency
1	@poshmarkapp	4,917,306	26	@jeffreestar	12,816
2	@ebay	194,975	27	@sidharth_shukla	12,591
3	@youtube	141,356	28	@rubidilaik	12,017
4	@etsy	89,344	29	@potus	12,014
5	@realdonaldtrump	54,010	30	@hwanniepromotes	11,672
6	@ishehnaaz_gill	48,226	31	@ladyincrypto	10,052
7	@missufe	33,847	32	@weareoneexo	9932
8	@chitaglorya__	29,150	33	@barackobama	9855
9	@bts_twt	28,034	34	@originalfunko	9700
10	@maymayentrata07	27,304	35	@gemhostofficial	9549
11	@bloglovin	20,669	36	@colorstv	9385
12	@zazzle	18,945	37	@nytimes	8983
13	@pledis_17	18,372	38	@taylorswift13	8809
14	@joebiden	17,717	39	@cashapp	8526
15	@pulte	17,515	40	@shill_ronin	8336
16	@blackpink	16,611	41	@bang_garr	8062
17	@eyehinakhan	16,395	42	@prctiu	7762
18	@sof1azara03	16,147	43	@influenster	7589
19	@davelackie	14,343	44	@elonmusk	7452
20	@fineartamerica	14,292	45	@perduechicken	7404
21	@etsysocial	14,251	46	@netflix	7366
22	@barber_edward_	14,115	47	@colourpopco	7242
23	@cnn	13,872	48	@thesecret	7191
24	@amazon	13,285	49	@kamalaharris	7187
25	@giveawayhost	13,275	50	@taegiveaway	7171

Regarding the influential actors, we take two approaches. The first one is to simply identify what user accounts are mentioned the most in the tweets, which can be considered to be the popular users in this virtual community. Table 5 presents the top-50 popular user mention ranking from the tweets posted by the same young female users interested in fashion. The user @poshmarkapp is the most mentioned user account, which confirms that shopping on Poshmark is very popular. Note that not all the user accounts listed on this ranking match the young female users in our target audience. They are just the user accounts that were mentioned very frequently by them, some of whom can be outside the target audience.

The second approach to identifying influencers is to leverage two commonly-used measures: eigenvector centrality from the network theory and retweet h-index, which is an adaptive version of the traditional Hirsch index to retweets in Twitter data. For the eigenvector centrality measure, we first collect followers and followees data using the Twitter API mentioned in "User profiling" section, identify mutually following pairs of the young female users, and then build an undirected network graph. The network has 9809 nodes, which means that 9809 users out of 16,011 are connected to at least one user. This network is much denser than expected, considering that the users do not share many attributes: they only share the interest in fashion, the gender, and the age class. We finally apply the eigenvector centrality algorithm to the network graph, which basically favors users who are connected with other well-connected users in the network. This results in a centrality score for each user in the graph. It turns out that most users have very low centrality scores, whereas only a few have high centrality scores. We believe that this demonstrates a good example of the existence of influencers in a certain domain. For the retweet h-index measure, we use the Tweet object that contains the information of how many times a tweet has been retweeted by other users. This also results in a retweet h-index value for each user.

Table 6 Top-25 influential actors among the young female users interested in fashion sorted by centrality (left side) and h-index (right side), respectively, in descending order

Rank	User	Centrality	User	H-Index
1	@jacquelinerline	0.124	@makeupbyshaniah	191
2	@ofresell	0.105	@nikkitamboli	177
3	@captaincouture1	0.099	@c**********s	174
4	@heliapichardo	0.098	@m********x	171
5	@bethpaintings	0.098	@josinaanderson	161
6	@katewinstyle	0.097	@alissawahid	156
7	@trixie8181	0.095	@janeyellene	140
8	@pinkpretty16	0.094	@salmahayek	140
9	@lashea_hudnall	0.094	@g*************1	137
10	@amyposhboutique	0.091	@rubidilaikofc	135
11	@msmaverick2	0.09	@megastyleph	133
12	@micely6391	0.088	@maliibumiitch	123
13	@peanutandjojos	0.088	@ari_maj1	118
14	@chelleztreasure	0.088	@nikkisamonas	116
15	@emmasattic98	0.088	@rubiholiccs	114
16	@suzcat12	0.087	@emilykschrader	112
17	@jazziesposhmark	0.087	@famnikki	111
18	@poshmarkrebekah	0.086	@ivy_ferguson	108
19	@lifesshortbuyit	0.085	@s*************s	107
20	@shadowdogdesign	0.08	@sayyess2thejess	105
21	@rendon_patsy	0.077	@aquiboni	102
22	@krista47005550	0.076	@life_breakdown	102
23	@boondockfinds	0.075	@shivandi	98
24	@voudaux	0.075	@hinakhanstan	96
25	@michelleroseg33	0.073	@a************o	93

Table 6 presents the top-25 influential user ranking sorted by centrality (left side) and h-index (right side), respectively, in descending order. The number one user on the centrality ranking is Jacqueline Line (screen name @JacquelineRLine), who has 367K followers at the time of writing, is a popular user on Poshmark, and her timeline is filled with tweets on various fashion items. On the other hand, the number one user on the retweet h-index ranking is Shaniah (screen name @makeupbyshaniah), who has 115.4K followers at the time of writing, is a popular makeup artist and YouTuber. As shown in the table, the two influencer rankings present completely different users, which implies that the two measures exhibit different perspectives on influence.

It is worth further analyzing this case study from a perspective of the Total Twitter Error framework mentioned in "Introduction" section, which helps us to evaluate potential errors in the study. As the study completely relies on the pool of random Twitter users and tweets to identify people interested in fashion, it is not free from the under-coverage error. In other words, it is obvious that the Twitter users found never represent all people in the world interested in fashion. Here, we make a strong assumption that we are only interested in Twitter users and our study is only targeted at those people in a social media world. We do not believe that this assumption is unreasonable, as we are well-aware that many people interested in fashion are using Twitter and having conversation in the cyberspace. Again, this should completely depend on the objectives of the study. On the other hand, the 16,011 young female users found are never small as a sample, as it would be challenging to gather this number of human subjects or respondents in traditional surveys. In addition, we identified and removed organization accounts, which definitely helped to reduce the over-coverage error in our data. In terms of the query error, while we could have added other hashtags than just #fashion and #style when identifying users interested in fashion, we believe that the two hashtags alone are representative of the interest in fashion. Lastly, there is room for the interpretation error, given that the user profiling solutions used are imperfect. In order to minimize the potential interpretation error, we (1) chose the solutions that demonstrate good performances in their papers and also (2) used more than one solutions for the same attribute whenever possible.

One limitation in this case study is that it would be ideal if we could compare the trends observed on Twitter to actual observable indicators coming from out-of-Twitter. To the best of our knowledge, we are unaware of any external data sets that can be mapped to our topic and user rankings for cross-evaluation. This limitation suggests future research in this case study.

Me Too movement reaction: conservatives vs. liberals

The second case study aims to answer the question of whether the political orientation, i.e., conservative vs. liberal, affects people's reaction to a gender-related issue. We choose the recent Me Too movement as one of the noticeable gender-related topics and attempt to compare how differently conservatives and liberals react to the same issue. To define the target audience for this case study, we take the same approach as the one used in the previous case study on young women interested in fashion: identifying the Twitter users in our pool who have ever used the #metoo hashtag in their tweets. Again, mentioning the hashtag is assumed to be their interest in the topic. From our pool, 68,116 users are identified as those who (1) have ever used the #metoo hashtag, (2) still have valid and public accounts on Twitter, and (3) have posted at least 100 tweets. Formally, |U|=68116. We then collect up to 3200 recent tweets for each of the users, which totals 188,806,239 tweets, or formally |T|=188806239.

The next step is to partition the users into two groups: conservatives and liberals. To that end, we opt to develop our own hashtag-based political orientation classifier fitted to our Twitter data for the same reason stated in "Customized user profiling" section. Specifically, we collect another set of users who can be easily labeled as "conservative" or "liberal" and use hashtags of those users as the features for political orientation prediction. We again search our random pool of users and tweets for users who described themselves in their bio as "proud republican", "proud conservative", "proud democrat", or "proud liberal", based on the observation that these expressions are a common way of expressing one's political orientation and thus can serve as a strong indicator of their political orientation. In this way, we find the users who have those proud republican or conservative expressions in their bio and label them as "conservative". Similarly, we label those who describe themselves as proud democratic or liberal as "liberal". We further check if (1) the users still have valid and public accounts on Twitter and (2) have posted at least 100 tweets. This leaves 5,740 users in total, of which 4717 users are labeled as "liberal" and 1023 users are labeled as "conservative". We then collect up to 3200 recent tweets of the users, which results in 12,299,722 tweets in total. From the collected tweets, we now extract top-1000 popular hashtags, which will be used as the features for prediction. As in the first case study, hashtags exclusively used by a single user are excluded. Table 7 presents the top-50 popular hashtags. As shown in the table, most of the hashtags are directly or indirectly related to politics, which is a clear indication that the labeled users collected for machine learning are interested in politics. Many of the hashtags on the ranking appear to be discriminative between the two classes, conservative and liberal, such as #trump and #bidenharris2020.

Table 7 Top-50 popular hashtags from the tweets posted by the users labeled as "conservative" or "liberal"

Rank	Hashtag	Frequency	Rank	Hashtag	Frequency
1	#covid19	12,753	26	#imwithher	2541
2	#trump	10,706	27	#strongertogether	2524
3	#resist	6515	28	#biden2020	2502
4	#maga	6223	29	#trumpvirus	2382
5	#fbrparty	5979	30	#tiktok	2292
6	#bidenharris2020	5941	31	#trump2020	2266
7	#potus	5796	32	#resisters	2262
8	#fbr	5274	33	#buildbackbetter	2248
9	#backfiretrump	4943	34	#votebluetosaveamerica	2200
10	#vote	4915	35	#florida	2165
11	#breaking	4684	36	#traitortrump	2161
12	#fbi	4564	37	#lockhimup	2157
13	#theresistance	4061	38	#trumpcrimefamily	2153
14	#moscowmitch	3801	39	#poshmark	2133
15	#coronavirus	3752	40	#biden	2083
16	#mitchplease	3429	41	#trumprussia	2067
17	#gop	3157	42	#auschwitz	1954
18	#blacklivesmatter	3073	43	#scotus	1904
19	#smartnews	2826	44	#demdebate	1895
20	#voteblue	2770	45	#giveaway	1854
21	#newprofilepic	2706	46	#resistance	1840
22	#demvoice1	2631	47	#georgia	1834
23	#covid	2591	48	#texas	1826
24	#gh	2570	49	#txlege	1815
25	#impeachtrump	2546	50	#sotu	1777

In our data, there are more samples tagged with "liberal" (4717) than those with "conservative" (1023). To avoid any potential bias in the classifier, we transform this unbalanced data set into a balanced data set by undersampling, i.e., selecting the same number of random samples from "liberal" samples as "conservative" samples. Next, we randomly split this data set of equal numbers of "conservative" and "liberal" samples into 80% of training data (1636 samples) and 20% of test data (410 samples). Then, to build a classification model, we apply widely-used classification algorithms to the training data, such as k-Nearest Neighbors, Logistic Regression, Random Forest, XGBoost, Support Vector Machines, Neural Networks, and Deep Neural Networks, for each of which we find the best hyper-parameters that yield the best performance. Lastly, we evaluate each model on the test data.

Fig. 2

Comparison of f1-scores for the nine classification algorithms

Fig. 3

The Average Precision (AP) curve (left) and the Receiver Operating Characteristic (ROC) curve (right) for the best performing Random Forest model

Table 8 Top-50 important features for the best performing political orientation classifier using the Random Forest algorithm

Rank	Feature	Importance	Rank	Feature	Importance
1	#trump2020	0.042	26	#fbrparty	0.008
2	#fjb	0.038	27	#trumpshutdown	0.008
3	#moscowmitch	0.034	28	#impeachtrump	0.008
4	#traitortrump	0.029	29	#neverforgetjanuary6th	0.008
5	#oann	0.026	30	#deathsantis	0.007
6	#resist	0.021	31	#expeljoshhawley	0.007
7	#bidenharris2020	0.020	32	#daytona500	0.007
8	#americafirst	0.019	33	#fbi	0.006
9	#voteblue	0.017	34	#prolife	0.006
10	#2a	0.015	35	#wearamask	0.006
11	#bidenharris	0.015	36	#trump2024	0.006
12	#istandwithbiden	0.015	37	#covid19	0.006
13	#demvoice1	0.014	38	#proudboys	0.006
14	#mitchplease	0.014	39	#laurenboebertissodumb	0.005
15	#getvaccinated	0.012	40	#resisters	0.005
16	#buildbackbetter	0.012	41	#trumpvirus	0.005
17	#forthepeople	0.011	42	#votebluetosaveamerica	0.005
18	#theresistance	0.011	43	#morningjoe	0.005
19	#godblessamerica	0.011	44	#strongertogether	0.005
20	#walkaway	0.011	45	#lockhimup	0.005
21	#trumpisnotwell	0.010	46	#americasgreatestmistake	0.005
22	#antifa	0.010	47	#trumpcare	0.005
23	#maddow	0.010	48	#holocaustremembranceday	0.005
24	#arresttrumpnow	0.010	49	#trumprussia	0.005
25	#backtheblue	0.009	50	#maga2020	0.005

For model evaluation and selection, we compare the f1-scores, which are the harmonic means of precision and recall. As shown in Fig. 2, the Random Forest model yields the best performance with the f1-score of 0.91, which can be considered a very high accuracy for prediction. Figure 3 presents the Average Precision (AP) curve (left) and the Receiver Operating Characteristic (ROC) curve for the best performing Random Forest model. The Average Precision and Area Under the Curve (AUC) are 0.96 and 0.96, respectively, which confirm the excellent performance of the model. In addition, in order to identify which features (i.e., hashtags) contribute the most to prediction, we list the feature importance scores provided by the Random Forest algorithm. Table 8 presents the top-50 important features and their importance scores. The ranking shows that the #trump2020 hashtag contributes the most in terms of political orientation prediction, followed by #fjb, #moscowmitch, #traitortrump, #oann (meaning One America News Network), #resist, #bidenharris2020, and so on, which all make sense.

As the training data used for political orientation classification are biased toward the users who clearly described themselves as proud liberal/conservative, we further conduct out-of-sample performance evaluation. To create a new data set for out-of-sample evaluation, we randomly select 200 users whose bio has "democrat" or "liberal" with no "proud" and, likewise, 200 users whose bio has "republican" or "conservative" with no "proud". Next, for each of the group of 200 users, we manually check if the user is actually liberal or conservative by reading their bio, which results in 179 liberal users and 116 conservative users. We then collect up to 3200 most recent tweets from their timelines and extract hashtag frequency features from their tweets. We then apply our political orientation classifier to those users and predict their political orientations. Finally, we compare their predicted political orientations with their actual ones. This results in an f1-score of 0.76. While this performance is lower than the with-in sample performance of 0.91, which is fully expected, the performance is still high enough to be used in real-world Big Data analysis.

In order to prove that hashtag features outperform full-text features in political orientation classification, we utilize BERT as the baseline approach to compare, which is known to perform well in text classification. To clarify, our approach uses the frequencies of top-1000 popular hashtags as features, whereas BERT uses the full text of aggregated tweets of users as features for transfer learning. The f1-score we achieve from BERT is 0.61, which is far lower than 0.91 from the best-performing hashtag-based model. Our guess is that the full text of a user's tweets has too much noise that does not help in identifying their political orientation, whereas hashtags serve as surprisingly good indicators.

Now that we have our own political orientation classifier fitted to tweet data, we apply the classifier to our 68,116 users who are interested in #metoo. This results in 46,037 users labeled as "conservative" and 22,079 users labeled as "liberal". Unlike the training and test data for modeling the classifier, there are more conservatives than liberals in our Me Too data set.

Table 9 Comparison of the top-50 popular hashtags from the #metoo tweets posted by the users labeled as "liberals" and by "conservatives", respectively

Rank	Liberals		Conservatives
Rank	Hashtag	Frequency	Hashtag	Frequency
1	#metooindia	2497	#timesup	2151
2	#timesup	1909	#metooindia	1579
3	#metoogr	1314	#blm	865
4	#ge	1042	#occupy	741
5	#firstthem	787	#metoogr	712
6	#metooincest	729	#believewomen	706
7	#metooinceste	666	#ibelievetarareade	679
8	#india	529	#daca	662
9	#veterans	498	#demexit	652
10	#rape	455	#union	650
11	#believewomen	432	#oligarchs	650
12	#metoounlessitsbiden	419	#megabanks	650
13	#domesticviolence	368	#corpmedia	650
14	#rapeculture	358	#nodapl	650
15	#tarareade	342	#sdf	650
16	#saraheverard	322	#humanity	649
17	#sexualassault	313	#idiocracy	638
18	#doctorsaredickheads	291	#ibelievetara	605
19	#weasourselves	286	#timesupbiden	478
20	#blacklivesmatter	281	#maketellingsafe	473
21	#mentoo	278	#csa	469
22	#silenceisviolence	275	#dropoutbiden	469
23	#doctorsabusetoo	270	#metoounlessitsbiden	445
24	#blm	265	#firstthem	437
25	#patientchoice	262	#mentoo	407
26	#nursesabusetoo	262	#dropbiden	373
27	#metoocy	259	#feminism	366
28	#anopensecret	252	#tarnishedbadge	363
29	#believeallwomen	246	#auspol	334
30	#justiceforjohnnydepp	242	#whyididntreport	318
31	#ibelievetarareade	232	#blacklivesmatter	282
32	#violenceagainstwomen	229	#women	280
33	#churchtoo	219	#bjp	274
34	#joebiden	214	#kobebryant	266
35	#h1news	193	#koberip	264
36	#women	188	#feminist	259
37	#sexualharassment	187	#believesurvivors	258
38	#feminism	185	#joebidenisarapist	244
39	#ibelievetara	182	#biden	241
40	#metoomovement	173	#feminismiscancer	240
41	#patientdignity	173	#bringbernieback	238
42	#notallmen	165	#endviolenceagainstwomen	235
43	#covid19	164	#justice	233
44	#unstucklife	164	#survivors	227
45	#china	163	#covid19	226
46	#hr	154	#neverbiden	222
47	#awareness	151	#book	205
48	#survivor	149	#survivor	204
49	#biden	144	#london	200
50	#anuragkashyap	143	#brexit	199

We now proceed with the final step for comparing the views on the Me Too movement by political orientation. We compare the most popular hashtags that co-occur with the #metoo hashtag in the same tweet, based on the idea that there would be differences between liberals' interests and conservatives' interests in the same Me Too context. Table 9 presents the top-50 popular hashtag rankings from the tweets posted by liberals and by conservatives, respectively. Note that, while this table only shows the 50 most popular hashtags, there are much more hashtags following those top-50 hashtags.

In order to measure how different the two entire rankings are, we employ two measures: the cosine similarity and the rank correlation. For the cosine similarity measure, specifically, we transform each entire ranking into a vector of hashtag frequencies and then calculate the cosine similarity between the two vectors, which indicates the angle between the two vectors. The smaller the angle, the more similar the two vectors are. Cosine similarity ranges between 0 and 1, where being close to 1 means very similar and being close 0 means dissimilar. From the two hashtag ranking vectors, we get the cosine similarity of 0.65. For the second rank correlation coefficient measure, we calculate both the Spearman correlation coefficient and the Kendall correlation coefficient on the two entire rankings. A rank correlation coefficient ranges from −1 and 1, where being close to 1 indicates a positive correlation, being close −1 a negative correlation, and being close to 0 no correlation. We achieve −0.24 and −0.23, respectively, which are both closer to 0 than to 1 or −1. The cosine similarity and the rank correlation coefficients indicate the dissimilarity of the two rankings, which implies that the two groups' interests are not the same.

To get an idea of specifically how the two rankings are different, Figs. 4 and 5 present the top-50 popular hashtag clouds for the liberals and the conservatives, respectively, in which larger hashtags represent more popular ones. Noticeably, the two hashtag clouds present somewhat different hashtags, as they have only 16 hashtags in common.Footnote 15 Besides, many of the hashtags do not appear on the other cloud.Footnote 16 These all confirm that liberals and conservatives do not equally take the same gender-related issue showing interests in somewhat different topics.

Fig. 4

Top-50 popular hashtags used by the users labeled as "liberal" in #metoo tweets

Fig. 5

Top-50 popular hashtags used by the users labeled as "conservative" in #metoo tweets

We now evaluate potential errors in this case study from a Total Twitter Error perspective. As with the first case study, this study relies on the pool of random Twitter users and tweets to identify people interested in the Me Too movement, and thus the same argument holds for this study: we assume that the set of 68,116 Twitter users found is sufficient for the study. In terms of the query error, we believe that the #metoo hashtag is the one and only hashtag we can think of and is representative of the interest in the Me Too movement, although there is a possibility that some users did not use the #metoo hashtag in their tweets. In this case, one may consider searching for any other expressions than just hashtags in tweet text that represent Me Too. Lastly, given the very high accuracy of our political orientation classifier, we believe that there is not much room for the interpretation error caused by customized profiling.