Geographical Visualization and Network Analysis for Twitter Data
Introduction
In the previous post, I did some fundamental Exploratory Data Analysis (EDA) with the streamed Twitter data. While several essential pieces of information were extracted from those data, we were only looking at a tiny portion of the entire database. This post examines two other hugely impactful data sources: geographical data and social network data. One can convert the geographical information to a figure by visualizing Twitter data on a geospatial map (like this demo shown here). Meanwhile, network analysis can be implemented by inspecting several user
-based tables in our database.
Why put Twitter data on a map? First, it allows us to observe the concentration of the tweets collected spatially. Second, it enables us to work with the area of interest (if have one) more efficiently. Also, why network analysis? First, it helps to identify the user (or users) that has the most significant potential to impact a field. Second, it also helps clustering user base depend on some preliminary conditions. Of course, there are many other advantages of doing geographical and social-network analysis on Twitter data, and some of them are right in this post.
Visualize Twitter Data on a map
First, let us visualize the tweets collected geographically by denoting each one of them as a dot (circle) on a Mercator projected map. There are a few steps for data preparation:
- Only $1-3%$ of total tweets have geographical information embedded in their JSON file, hence we need to combine the
base_tweets
table with theplace
table, then filter out those tweets that do not have geographical information, to get those tweets that can be displayed on our map. - When hovering the mouse on to a specific point shown on the map, it would be better to show some extra information related to that tweet, such as which game it is about, which user sent that tweet, how many followers that user currently has (thus how big is the impact of that tweet), etc. To do this, combine more tables into the
base_tweets
table, includingtweet_user, hashtags, text
, etc. - Sometimes the keyword for the tweet is not contained in the hashtags or text field, and they might be mentioned in the quote tweet or retweet instead. To handle tweets like this, more tables need to combined into the
base_tweets
table, such asretweeted_tweet
table.
The following query does all the work I just mentioned. And the query results (10 records) are shown in Table 1.
From the query, seven tables in the relational database are combined through LEFT JOIN
on different keys to provide the information needed. Now, export the query results as a CSV file and load them in Python.
Note that the geographical information for each tweet is stored as a bounding box on the map, and the coordinates
column contains the four corners of the bounding box. To get the corresponding x and y coordinates on a Mercator-projected map, some mathematical calculations are necessary. For your reference, the functions I wrote for converting a bounding box into Mercator coordinates is shown below:
def get_lon_lat(df, geo_type):
"""
Get longitude and latitude from the dataframe
Append two new columns into the original dataframe
Parameters
----------
df: pandas dataframe
input dataframe containing geographical information
geo_type: string
the type of the geographical dataframe
Returns
-------
df: pandas dataframe
with longitude and latitude extracted
"""
# if the input dataframe is geo
if geo_type == 'geo':
# then the longitude is the second value in the list
df['longitude'] = df['coordinates'].apply(lambda x: literal_eval(x)[1])
# and the latitude is the first value
df['latitude'] = df['coordinates'].apply(lambda x: literal_eval(x)[0])
# if the input dataframe is coordinates
elif geo_type == 'coordinates':
# it is the other way around
# NOTE: geo and coordinates contains the same information
df['longitude'] = df['coordinates'].apply(lambda x: literal_eval(x)[0])
df['latitude'] = df['coordinates'].apply(lambda x: literal_eval(x)[1])
# if the input dataframe is place
# then list contains 4 points of a bounding box
# we need to calculate the centroid of the bounding box
elif geo_type == 'place':
# define a function to extract the centroid
def get_centroid(row):
# literalize the str to a list
lst = literal_eval(row)
# get the minimal and maximal latitude
lat0 = lst[0][0][1]
lat1 = lst[0][1][1]
# calculate the center
lat = (lat0 + lat1)/2
# same for the longitude
lon0 = lst[0][0][0]
lon1 = lst[0][2][0]
lon = (lon0 + lon1)/2
# return the centroid
return (lon, lat)
# apply the defined function to the whole dataset
df['longitude'] = df['coordinates'].apply(lambda x: get_centroid(x)[0])
df['latitude'] = df['coordinates'].apply(lambda x: get_centroid(x)[1])
# return the new dataset with longitude and latitude as new columns
return df
def lonlat2merc(row):
"""
Convert the longitude and latitude of a geographical point to
x and y coordinates on a Mercator projected map
Parameters
----------
row: float, one row of a pandas dataframe
Before apply this function to a dataframe
make sure the dataframe contains two columns correspond
to longitudinal and latitudinal numbers
Returns
-------
(x, y): float tuple
x and y coordinates of the points
"""
# extract longitudinal and latitudinal info out of the dataframe row
lon = row.iloc[0]
lat = row.iloc[1]
# do the conversion
r_major = 6378137.000
x = r_major * math.radians(lon)
scale = x/lon
y = 180.0/math.pi * math.log(math.tan(math.pi/4.0 + lat * (math.pi/180.0)/2.0)) * scale
# return the X and Y coordinates
return (x, y)
def add_xy_col(df):
# simple function that calls the Merc conversion function to each row
# and save x and y seperately to the dataframe
df['merc_x'] = df[['longitude', 'latitude']].apply(lambda x: lonlat2merc(x)[0], axis=1)
df['merc_y'] = df[['longitude', 'latitude']].apply(lambda x: lonlat2merc(x)[1], axis=1)
return df
Once the x and y coordinates are available, one can put the tweets on to the map. Figure 1 shows the demo I have created:
In this stand-alone bokeh plot, the circles for all the tweets are shaded in different colors for different games, and the size of the circle is determined by the number of followers of the user who sent that tweet.
By looking at Figure 1, one can quickly summarize several conclusions:
- Fortnite is super popular in North America and Europe, and Eastern South America compares to other regions.
- Dota 2 is more popular in South East Asia.
- Most North-Asian and African players do not use Twitter.
- Even though Apex Legends and Fortnite are comparably popular in Japan area as well, not many people like to attach their geo-location when they post on Twitter hence very few points on that side of the map.
On the other hand, if one zoom-in the map to the U.S. specifically, as shown in Figure 2:
The players for these games are mainly concentrated around Texas, California, and the Bay Area. You might already notice that there are only seven different colors for all the circles on the map while there are eight games in total, and the missing one is GTA5. As mentioned before, only a very small portion of all the tweets have geo-information embedded, unfortunately, none of the tweets about GTA5 contains geographical information. In fact, only 709 (out of more than 160k) tweets have their geo-location enabled. And the detailed numbers for all the point on the map are shown in Figure 3:
In Figure 3, Fortnite has the most points on the map, which make it a little tricky if one wants to examine the geographical distribution of other games. Bokeh provides powerful tools that enable us to interact with the plot with an additional adjustment which the user can define themselves. These interactive plots are called Bokeh applications, which need to be hosted by connecting the code to a Bokeh server. When the user adjusts the parameters of the scheme, the server updates the data accordingly and replot the figure on the client end.
Figure 4 is a brief demonstration of the application. You can play with the app through this link, which is hosted on an AWS (Amazon Web Service) EC2 instance.
Now since there are more controls added to the plot, the geo-information from the Twitter data is much better interpreted:
- In the demo, there is a drop-down menu named Shading, which allows us to choose the way to shade the circles. It has two options, one called Game Name, which draws the colors by the name of the game, and you can refer to the legend on the bottom of the plot to distinguish between different games. The other option for the drop-down menu is Sentiment Score, which shades the circles depend on the sentiment scores of the tweets, the more negative the tweet is, the darker the circle would be.
- A group of checkboxes below the Shading drop-down menu allows the user to pick which games they want to inspect individually on the map. This additional feature makes it easier to examine a specific game or a group of games.
- The last option of control is a slider called Hour, and it indicates the time window when the tweets were posted. Only those tweets posted during that one-hour window are shown on the map, by which one can examine the data on another domain: time domain. Set Hour equal to 50 to show all the tweets from all the hours.
With all these tools available, here are another bunch of interesting conclusions about the data:
- CSGO is more prevalent in Europe than any other region in the world, while ApexLegends is more prevalent in North America.
- Dota2 and Hearthstone have higher sentiment score across the board.
- Most players’ tweets are neutral or slightly more favorable for these games.
Network Analysis
There is a toolkit called NetworkX comes in handy when visualizing Network data. And in this section, I will explain how to extract the network information from the massive amount of Twitter data stored in our database and create highly interactive Bokeh plots to visualize the network.
There are three types of network existing in Twitter data, including reply-network, retweet-network, and quote-network. Each of these networks can convey different information about the community so we need to handle them individually. Firstly, extract the reply-network from the base_tweets
table, specifically, the tweet_user_id
and in_reply_to_user_id
column. Then, use NetworkX
to generate a network object from these two columns automatically.
For this experiment, I extracted the reply information only for Fortnite and removed the records where the Twitter user was replying to himherself, which results in a network with 1123 nodes and 860 edges. And the network is visualized in Figure 5 with two different network layouts. Note that you can check the id of the user by hovering your mouse on the nodes, and click-and-drag to select a group of nodes, and the edges and nodes connected to the selected nodes will be highlighted as well.
The network visualization is highly informative in terms of highlighting the nodes that have a higher degree (the number of adjacent edges) in a network, which can be crucial for network analysis. In Figure 5, the number of replies for a specific user is characterized by the size of the node. Also, in the spring layout (plot on the left side), the denser a cluster of nodes is, the more users are connected to the node, and it is also expressed in the Kamada-Kawai layout (plot on the right side) by putting all the highly connected nodes to the center of the concentric circle. And Figure 6 is another example of the retweet-network for game CSGO.
Alternatively, to reduce the number of irrelevant nodes and make the plot cleaner, plotting only the nodes that have a high degree-centrality value with their corresponding neighbors might be a better idea. And Figure 7 shows the sub-graph formed from the top-10 nodes (in terms of degree centrality) and their neighbors.
Unlike the previous two figures, the plot on the left-hand side is plotted in a circular layout so we can spot out the nodes that have the highest degree centrality. Also, on the right hand side, the sub-communities and sub-reply-networks are better clustered, which makes it much easier to locate the high-impact Twitter user.
Conlusion
People often ignore or underestimate the power of visualizing data, whereas data interpretation is no less critical than data-processing or data-modeling by all means. Visualizing data in a more informative and reliable way is the key to find better solutions. In this post, I talked about how to create interactive applications to visualizing Twitter data on a geospatial map using Bokeh, as well as how to plot different types of networks using NetworkX. These demos shown here are just an illustration of the concept with minimal data, especially for the map application; thus, there is still a lot of space to be improved for each of them.
As always, you can find the code for generating all these demos here. Feel free to let me know if you have any questions.
Thanks for reading 🙂