Geographical Visualization and Network Analysis for Twitter Data

Introduction

In the previous post, I did some fundamental Exploratory Data Analysis (EDA) with the streamed Twitter data. While several essential pieces of information were extracted from those data, we were only looking at a tiny portion of the entire database. This post examines two other hugely impactful data sources: geographical data and social network data.  One can convert the geographical information to a figure by visualizing Twitter data on a geospatial map (like this demo shown here). Meanwhile, network analysis can be implemented by inspecting several user-based tables in our database.

Why put Twitter data on a map? First, it allows us to observe the concentration of the tweets collected spatially. Second, it enables us to work with the area of interest (if have one) more efficiently. Also, why network analysis? First, it helps to identify the user (or users) that has the most significant potential to impact a field. Second, it also helps clustering user base depend on some preliminary conditions. Of course, there are many other advantages of doing geographical and social-network analysis on Twitter data,  and some of them are right in this post.  

Visualize Twitter Data on a map

First, let us visualize the tweets collected geographically by denoting each one of them as a dot (circle) on a Mercator projected map. There are a few steps for data preparation:

  1. Only $1-3%$ of total tweets have geographical information embedded in their JSON file, hence we need to combine the base_tweets table with the place table, then filter out those tweets that do not have geographical information, to get those tweets that can be displayed on our map.
  2. When hovering the mouse on to a specific point shown on the map, it would be better to show some extra information related to that tweet, such as which game it is about, which user sent that tweet, how many followers that user currently has (thus how big is the impact of that tweet), etc. To do this, combine more tables into the base_tweets table, including tweet_user, hashtags, text, etc.
  3. Sometimes the keyword for the tweet is not contained in the hashtags or text field, and they might be mentioned in the quote tweet or retweet instead. To handle tweets like this, more tables need to combined into the base_tweets table, such as retweeted_tweet table. 

The following query does all the work I just mentioned. And the query results (10 records) are shown in Table 1.

Table1. Queried data for visualizing tweets geographically.
Table1. Queried data for visualizing tweets geographically.

From the query, seven tables in the relational database are combined through LEFT JOIN on different keys to provide the information needed. Now, export the query results as a CSV file and load them in Python.

Note that the geographical information for each tweet is stored as a bounding box on the map, and the coordinates column contains the four corners of the bounding box. To get the corresponding x and y coordinates on a Mercator-projected map, some mathematical calculations are necessary. For your reference, the functions I wrote for converting a bounding box into Mercator coordinates is shown below:

def get_lon_lat(df, geo_type):
    """
    Get longitude and latitude from the dataframe
    Append two new columns into the original dataframe
    Parameters
    ----------
        df: pandas dataframe
            input dataframe containing geographical information
        geo_type: string
            the type of the geographical dataframe
    Returns
    -------
        df: pandas dataframe
            with longitude and latitude extracted
    """
    # if the input dataframe is geo
    if geo_type == 'geo':
        # then the longitude is the second value in the list
        df['longitude'] = df['coordinates'].apply(lambda x: literal_eval(x)[1])
        # and the latitude is the first value
        df['latitude'] = df['coordinates'].apply(lambda x: literal_eval(x)[0])
    # if the input dataframe is coordinates
    elif geo_type == 'coordinates':
        # it is the other way around
        # NOTE: geo and coordinates contains the same information
        df['longitude'] = df['coordinates'].apply(lambda x: literal_eval(x)[0])
        df['latitude'] = df['coordinates'].apply(lambda x: literal_eval(x)[1])
    # if the input dataframe is place
    # then list contains 4 points of a bounding box
    # we need to calculate the centroid of the bounding box
    elif geo_type == 'place':
        # define a function to extract the centroid
        def get_centroid(row):
            # literalize the str to a list
            lst = literal_eval(row)
            # get the minimal and maximal latitude
            lat0 = lst[0][0][1]
            lat1 = lst[0][1][1]
            # calculate the center
            lat = (lat0 + lat1)/2
            # same for the longitude
            lon0 = lst[0][0][0]
            lon1 = lst[0][2][0]
            lon = (lon0 + lon1)/2
            # return the centroid
            return (lon, lat)
        # apply the defined function to the whole dataset
        df['longitude'] = df['coordinates'].apply(lambda x: get_centroid(x)[0])
        df['latitude'] = df['coordinates'].apply(lambda x: get_centroid(x)[1])
    # return the new dataset with longitude and latitude as new columns
    return df

def lonlat2merc(row):
    """
    Convert the longitude and latitude of a geographical point to
    x and y coordinates on a Mercator projected map
    Parameters
    ----------
        row: float, one row of a pandas dataframe
            Before apply this function to a dataframe
            make sure the dataframe contains two columns correspond
            to longitudinal and latitudinal numbers
    Returns
    -------
        (x, y): float tuple
            x and y coordinates of the points
    """
    # extract longitudinal and latitudinal info out of the dataframe row
    lon = row.iloc[0]
    lat = row.iloc[1]
    # do the conversion
    r_major = 6378137.000
    x = r_major * math.radians(lon)
    scale = x/lon
    y = 180.0/math.pi * math.log(math.tan(math.pi/4.0 + lat * (math.pi/180.0)/2.0)) * scale
    # return the X and Y coordinates
    return (x, y)
def add_xy_col(df):
    # simple function that calls the Merc conversion function to each row
    # and save x and y seperately to the dataframe
    df['merc_x'] = df[['longitude', 'latitude']].apply(lambda x: lonlat2merc(x)[0], axis=1)
    df['merc_y'] = df[['longitude', 'latitude']].apply(lambda x: lonlat2merc(x)[1], axis=1)
    return df

 

Once the x and y coordinates are available, one can put the tweets on to the map. Figure 1 shows the demo I have created:

figure1
Figure 1. Geographical map forTwitter data of 8 video games.

In this stand-alone bokeh plot, the circles for all the tweets are shaded in different colors for different games, and the size of the circle is determined by the number of followers of the user who sent that tweet.

By looking at Figure 1, one can quickly summarize several conclusions:

  1. Fortnite is super popular in North America and Europe, and Eastern South America compares to other regions.
  2. Dota 2 is more popular in South East Asia.
  3. Most North-Asian and African players do not use Twitter.
  4. Even though Apex Legends and Fortnite are comparably popular in Japan area as well, not many people like to attach their geo-location when they post on Twitter hence very few points on that side of the map.

On the other hand, if one zoom-in the map to the U.S. specifically, as shown in Figure 2:

figure2
Figure 2. Geographical map zoom in at the U.S.

The players for these games are mainly concentrated around Texas, California, and the Bay Area. You might already notice that there are only seven different colors for all the circles on the map while there are eight games in total, and the missing one is GTA5. As mentioned before, only a very small portion of all the tweets have geo-information embedded, unfortunately, none of the tweets about GTA5 contains geographical information. In fact, only 709 (out of more than 160k) tweets have their geo-location enabled. And the detailed numbers for all the point on the map are shown in Figure 3:

Bokeh Plot
Figure 3. Bar chart for the count of each game shown on the map.

In Figure 3, Fortnite has the most points on the map, which make it a little tricky if one wants to examine the geographical distribution of other games. Bokeh provides powerful tools that enable us to interact with the plot with an additional adjustment which the user can define themselves. These interactive plots are called Bokeh applications, which need to be hosted by connecting the code to a Bokeh server. When the user adjusts the parameters of the scheme, the server updates the data accordingly and replot the figure on the client end.

Figure 4 is a brief demonstration of the application. You can play with the app through this link, which is hosted on an AWS (Amazon Web Service) EC2 instance.

Figure 4. Interactive Bokeh application for video-game geographical visualization. The demo is recorded by Bandicam (free version), so you see its watermark on top of the image.
Figure 4. Interactive Bokeh application for video-game geographical visualization. The demo is recorded by Bandicam (free version), so you see its watermark on top of the image.

Now since there are more controls added to the plot, the geo-information from the Twitter data is much better interpreted:

  1. In the demo, there is a drop-down menu named Shading, which allows us to choose the way to shade the circles. It has two options, one called Game Name, which draws the colors by the name of the game, and you can refer to the legend on the bottom of the plot to distinguish between different games. The other option for the drop-down menu is Sentiment Score, which shades the circles depend on the sentiment scores of the tweets, the more negative the tweet is, the darker the circle would be. 
  2. A group of checkboxes below the Shading drop-down menu allows the user to pick which games they want to inspect individually on the map. This additional feature makes it easier to examine a specific game or a group of games.
  3. The last option of control is a slider called Hour, and it indicates the time window when the tweets were posted. Only those tweets posted during that one-hour window are shown on the map, by which one can examine the data on another domain: time domain. Set Hour equal to 50 to show all the tweets from all the hours. 

With all these tools available, here are another bunch of interesting conclusions about the data:

  1. CSGO is more prevalent in Europe than any other region in the world, while ApexLegends is more prevalent in North America.
  2. Dota2 and Hearthstone have higher sentiment score across the board.
  3. Most players’ tweets are neutral or slightly more favorable for these games. 

Network Analysis

There is a toolkit called NetworkX comes in handy when visualizing Network data. And in this section, I will explain how to extract the network information from the massive amount of Twitter data stored in our database and create highly interactive Bokeh plots to visualize the network.

There are three types of network existing in Twitter data, including reply-network, retweet-network, and quote-network. Each of these networks can convey different information about the community so we need to handle them individually. Firstly, extract the reply-network from the base_tweets table, specifically, the tweet_user_id and in_reply_to_user_id column. Then, use NetworkX to generate a network object from these two columns automatically.

For this experiment, I extracted the reply information only for Fortnite and removed the records where the Twitter user was replying to himherself, which results in a network with 1123 nodes and 860 edges. And the network is visualized in Figure 5 with two different network layouts. Note that you can check the id of the user by hovering your mouse on the nodes, and click-and-drag to select a group of nodes, and the edges and nodes connected to the selected nodes will be highlighted as well.

Bokeh Plot
Figure 5. Reply-network for Fortnite in Spring layout (left) and Kamada-Kawai layout (right).

The network visualization is highly informative in terms of highlighting the nodes that have a higher degree (the number of adjacent edges) in a network, which can be crucial for network analysis. In Figure 5, the number of replies for a specific user is characterized by the size of the node. Also, in the spring layout (plot on the left side), the denser a cluster of nodes is, the more users are connected to the node, and it is also expressed in the Kamada-Kawai layout (plot on the right side) by putting all the highly connected nodes to the center of the concentric circle. And Figure 6 is another example of the retweet-network for game CSGO.

Bokeh Plot
Figure 6. Retweet-network for CSGO in Spring layout (left) and Kamada-Kawai layout (right).

Alternatively, to reduce the number of irrelevant nodes and make the plot cleaner, plotting only the nodes that have a high degree-centrality value with their corresponding neighbors might be a better idea. And Figure 7 shows the sub-graph formed from the top-10 nodes (in terms of degree centrality) and their neighbors.

Bokeh Plot
Figure 7. Sub-retweet-network for CSGO in circular layout (left) and Kamada-Kawai layout (right).

Unlike the previous two figures, the plot on the left-hand side is plotted in a circular layout so we can spot out the nodes that have the highest degree centrality. Also, on the right hand side, the sub-communities and sub-reply-networks are better clustered, which makes it much easier to locate the high-impact Twitter user.

Conlusion

People often ignore or underestimate the power of visualizing data, whereas data interpretation is no less critical than data-processing or data-modeling by all means. Visualizing data in a more informative and reliable way is the key to find better solutions. In this post, I talked about how to create interactive applications to visualizing Twitter data on a geospatial map using Bokeh, as well as how to plot different types of networks using NetworkX. These demos shown here are just an illustration of the concept with minimal data, especially for the map application; thus, there is still a lot of space to be improved for each of them.

As always, you can find the code for generating all these demos here. Feel free to let me know if you have any questions.

Thanks for reading 🙂