Interactive Exploratory Data Analysis with Bokeh

Introduction

If one would recall, I talked about how to collect and store Twitter data in the last post (Figure 1). Now it is time to do some fun and exciting experiments with all the data available! First up is some fundamental exploratory data analysis that tells us the big story behind the data. Such as answering questions like, which game is the most popular one on Twitter? The game is more prevalent in which language? How often is the game being mentioned during a specific time window of the day? How does the community think about the game when people send tweets talking about it? Positively or negatively? Etc. All these questions will be answered by the end of this post. 

Figure 1. A relational database that stores information of more than 160,000 tweets.
Figure 1. A relational database that stores information of more than 160,000 tweets.

Game Ranking Analysis

The first question is, how the games are distributed on Twitter by counting the number of mentions for each game? Rigorously speaking, since one tweet could have multiple fields that contain text information, including text, extended_tweet, retweet_text, retweet_extended_text, quoted_tweet_text, and quoted_extended_tweet_text, to inspect the number of mentions for each game, one should check if the name of one specific game appears in any of these fields. To get a rough feeling about the total distribution, however, I approached through a shortcut by looking only at the hashtags field, which is already saved as an individual table in our database. The drawback is that not every tweet has a hashtag. Therefore, the number of tweets here is relatively smaller. A slightly smaller sample space should not matter too much, though, considering that almost every observation follows Normal Distribution in the real world.

Use the following query to extract game hashtags from the hashtags table in the database:

In the query above, first, create a view to avoid denormalizing any of our existing tables in the database and set a new column game in the view containing the video-game hashtags that are wanted. Also create new columns if_quoted and if_retweeted to indicate if the tweet is a retweet or quoted_tweet. Recall from the last post, when building the relational database, all tables are connected through multiple foreign keys in the base_tweets table. To extract the information, recombine tables by joining the hashtags table back to the base_tweets table, then left join the resulting table with retweeted_id table and quoted_id table. Then query all the columns and rows from the view to export and transport them into the Python environment to visualize the distribution. And the first ten records the query returns look like this: 

Figure 2. Extracted relevant video-game names from the hashtags.
Figure 2. Extracted relevant video-game names from the hashtags.

As shown in Figure 2, specific video-game names that we care about is extracted from all the hashtags, and the value is set to NULL if none of the names is mentioned in the hashtags, and true or false to check if it is from a retweet or a quoted tweet. Now, migrate this temporary table into a Pandas DataFrame to get some nice-looking visualization of the distribution as shown in Figure 3:

Bokeh Plot
Figure 3. Bar chart for all the games.

While there are some beautiful colors to differentiate between games, the amount of information conveyed in Figure 3 is limited. It is clear that Fortnite is the most popular game on Twitter since the majority of tweets are mentioning it. Surprise, GTA5 is the one that players mentioned for the least amount of times on Twitter, which is on the contrary to what is observed on Twitch. Well, if we think about it, it is not difficult to figure it out: GTA5 is a console game that has a limited size of the online community, but it is fun to watch on Twitch due to its game design.

Now we know that, on Twitter, the most popular game is Fortnite, and the least favorite game is GTA5. What else? Here is another more advanced bar chart in Figure 4:

Bokeh Plot
Figure 4. Another two bar charts for all the games.

The top chart in Figure 4. shows the number of quote tweets, retweets, and original tweets individually for each game, and the bottom chart is plotted in log scale for better comparison with smaller numbers. It turns out that, for most of the games, more than half of the total tweets are retweeted; even though Hearthstone has more mentions than GTA5, it has less original tweets than the latter; and Dota2 has the smallest original tweets percentage. 

Gaming Language Analysis

Besides the number of mentions for each game, similarly, we can query for the language information from the database, and analyze the language used in the gaming community globally. There are 42 languages used in the tweets streamed, including unknown languages marked as und, and some of them are plotted in the figure below:

Bokeh Plot
Bokeh Plot
Figure 5. Left: Pie chart for the top10 languages used. Right: Log-scaled bar chart for the top 20 languages.

In Figure 5, English dominates the language-distribution by almost half of the total amount, and the one following is Japanese with nearly 1/3 of the total number. After French and Spanish with comparable percentages, the rest of the languages are Undetermined and some other relatively non-common languages such as Portuguese and German, etc.  In the bar chart on the right-hand side, the numbers are scaled by taking the log value, and the detailed rank information for more languages are shown.

But how about the detailed language distribution inside of a specific gaming community? What if we want to know which language has been used the most often, say, in Fortnite? How much more or less is it compares to other gaming communities? Figure 6 answers these questions. Again, you can hover your mouse over the plot to check the value for each game-language combination.

Gaming Community Languages
Figure 6. Advanced pie plot mimicking Burtin’s Antibiotics plot.

By inspecting Figure 6, one can tell that even though English is the most often used language for most of the games, for Apex Legends, Japanese is the most frequently used language over English around 300 tweets. (If you hover the mouse over, you will see the number 7291 for Japanese and 6885 for English. ) Similarly, Hearthstone seems more prevalent in Spanish-speaking regions than French and Japanese since it is the second often used language for Hearthstone. 

Time series Analysis

One of the critical features of streaming tweets in real-time is time-series analysis. We can count all the mentions for each game during a time window, let’s say 1 hour, then shift this window throughout the 48 hours streaming duration, to visualize the time-varying mentioning frequencies for different games. I picked the top-4 most popular games on Twitter (Figure 3) and plotted their hourly mentioning frequency in Figure 7:

Bokeh Plot
Figure 7. Mentioning frequency averaged hourly for 48 hours.

The stream started at 1 am, May 9th, and ended at 1 am May 11th, as shown in Figure 7. We know that Fortnite is the most popular game (from Figure 4), so it is not surprising to see Fortnite has the highest frequency at all time. However, what is not expected is the peak for Fortnite around 8 am May 9th, while the frequencies for all three other games are going down. Based on what we have learned so far, it is not difficult to figure it out: from Figure 6 we know that the second most frequently used language for Fortnite is Japanese, which means there is another massive Fortnite online community on the other side of the earth existing, in Japan. While it was 8 am for the US; the night-time after working had just begun for Japanese Fortnite lovers. (Further investigation shows that there was an official Twitter event going on for Japanese Fortnite community.) Therefore there is a prominent peak shows up around that time stamp. It also makes a certain amount of sense why there is only one downhill at 8 am May 9th, because May 9th was a Friday, people still need to work on that day. And after that, the weekend was there so players can play video games all night! 

Another interesting experiment to do with tweets that have DateTime information is sentiment analysis, which can be crucial for commercialized marketing applications. I use the Natural Language Toolkit(NLTK) to extract the sentiment information from a tweet. After applying sentiment analysis to all the English tweets, the hourly averaged time-series sentiment score plot is shown in Figure 8:

Bokeh Plot
Figure 8. Sentiment scores averaged hourly for 48 hours.

Given the context that sentiment score above 0 means positive and below 0 means negative, interestingly, while Fortnite got mentioned in most of the tweets, it also got relatively more criticism around Friday night (Japanese time). It seems like that official event was not going on quite well. In contrast, Apex Legends was mentioned in more positive tweets later on around 9 am May 10th. And Overwatch and League of Legends were doing a good job maintaining a slightly positive sentiment score consistently. 

Conlusion

Twitter is an excellent resource not only for sharing our daily lives but also for data gathering and social-event analysis. In this post, I talked about how to use the pre-stored Twitter data to do some fundamental Exploratory Data Analysis. Specifically:

  1. Explicitly visualized ranking for eight hand-choosing video games. Fortnite is the most popular video game on Twitter. There are 90570 out of 165910 (54.59%) tweets mentioned Fortnite.
  2. language-distribution for these video games was extensively inspected. English (49.17%) and Japanese  (27.60%) dominate the Twitter community for these games.
  3. Mentioning frequency and sentiment scores are plotted in the time domain with an hourly-average window. Even though Fortnite has been mentioned more often, people were happier with Apex Legends and two other games with higher averaged sentiment scores.

Now we have a better understanding of the entire global video-game community, but only a tip of the iceberg is revealed. In the next post, I will explain how to utilize the geographical information contained in the tweets to build an interactive map application to reinforce our data analysis along with some interesting network analysis.

You can find the code here for plotting all the figures shown in this post. And welcome to reach out to me if you have any questions.

Thanks for reading 🙂