Building a Relational Database from Twitter Scraping

Introduction

Twitter is one of the largest social media nowadays. The amount of information hiding in tons of tweets can be enormous, and a tremendous amount of useful information can be extracted if approached correctly. Knowing how to search and manage data efficiently is especially crucial for a data scientist. Before doing any analysis, the data to be analyzed should be stored in a database that is easy to work with. In this post, I will talk about how to scrape data from millions of tweets posted in 48 hours using Twitter API and a Python package Tweepy. Followed by how to save the data into a PostgreSQL or SQLite database automatically for better management. Lastly, I will talk about how to add constraints to the relational-database to make it well-structured. Once the data is acquired and stored in good shape, the down-stream work can be handled more efficiently.

As a video-game enthusiast, I will take advantage of these data science tools to investigate the current trend of the most popular video games through scraping twitter data. Such as figuring out the distribution of the population of each video-game community globally, creating a geospatial map that displays which region has the highest live activity for each game, plot the sentiment score of tweets towards each video game varying along with different timestamps, etc. All these information behind the massive twitter data can be used for multiple potential applications to predict the future of various video games.

Scrape Twitter Data

Unlike scraping other websites where hard-coding is necessary to extract information from specific sections or divs on a webpage, by summarizing all information related to a particular tweet into a JSON file, Twitter APIs make data extraction a lot easier. Before exploring Twitter data, let’s first take a look at how Twitter API works.

Twitter APIs includes Searching API, Ads API, and Streaming API. Searching API allows us to search for specific tweets during the past week. Ads API provides information about the advertisements on Twitter. And the Streaming API provides functionalities that will enable us to collect tweets in real-time with:

  1. Specific keywords, user ID, or locations, called filtering.
  2. 1% of the total tweets randomly sampled, called sampling. 

And I only use Streaming API which can be set up by Python package Tweepy easily. The code showing below sets up the stream connection and can be executed through a command-line tool:

# -*- coding: utf-8 -*-
"""
This script sets up tweepy and hears twitter streams
@author: Shihao Ran
         STIM Laboratory
"""
from tweepy import OAuthHandler
from tweepy import API
from tweepy import Stream
from slistener import SListener
from urllib3.exceptions import ProtocolError
# consumer key authentication
consumer_key = 'YOUR CONSUMER KEY'
consumer_secret = 'YOUR CONSUMER SECRET'
auth = OAuthHandler(consumer_key, consumer_secret)
# access key authentication
access_token = 'YOUR ACCESS TOKEN'
access_token_secret = 'YOUR TOKEN SECRET'
auth.set_access_token(access_token, access_token_secret)
# set up the API with the authentication handler
api = API(auth)
# set up words to hear
keywords_to_hear = ['#GrandTheftAutoV',
                    '#LeagueOfLegends',
                    '#Fortnite',
                    '#dota2',
                    '#CSGO',
                    '#ApexLegends',
                    '#Hearthstone',
                    '#overwatch']
# instantiate the SListener object
listen = SListener(api)
# instantiate the stream object
stream = Stream(auth, listen)
# begin collecting data
while True:
    # maintian connection unless interrupted
    try:
        stream.filter(track=keywords_to_hear)
    # reconnect automantically if error arise
    # due to unstable network connection
    except (ProtocolError, AttributeError):
        continue

 

As shown in the code, import the tweepy package first, then authenticate Twitter API with the consumer key-secrete pair as well as the access token-secrete pair of the account that will be using to stream the tweets. Next, specify the keywords that you want to capture in the tweets. In this case, the hashtags for eight video games. Finally, keep the stream.filter() command in a while True loop to keep the streaming connected unless interrupted manually. The except sentence handles poor network connections from time to time, which may cause disconnections between our workstation and Twitter API, and reconnects the streaming automatically if an error pops up. Before launching the streaming, specify a SListener object, which informs tweepy how to handle the incoming tweets. And the code for SListener is shown below:

# -*- coding: utf-8 -*-
"""
this script defines the streaming behavior passed to tweepy
@author: Shihao Ran
         STIM Laboratory
"""
# import packages
from tweepy.streaming import StreamListener
import json
import time
import sys
# inherit from StreamListener class
class SListener(StreamListener):
    def __init__(self, api = None, fprefix = 'streamer'):
        # define the filename with time as prefix
        self.api = api or API()
        self.counter = 0
        self.fprefix = fprefix
        self.output  = open('%s_%s.json' % (self.fprefix, time.strftime('%Y%m%d-%H%M%S')), 'w')
    def on_data(self, data):
        if  'in_reply_to_status' in data:
            self.on_status(data)
        elif 'delete' in data:
            delete = json.loads(data)['delete']['status']
            if self.on_delete(delete['id'], delete['user_id']) is False:
                return False
        elif 'limit' in data:
            if self.on_limit(json.loads(data)['limit']['track']) is False:
                return False
        elif 'warning' in data:
            warning = json.loads(data)['warnings']
            print("WARNING: %s" % warning['message'])
            return
    def on_status(self, status):
        # if the number of tweets reaches 20000
        # create a new file
        self.output.write(status)
        self.counter += 1
        if self.counter >= 20000:
            self.output.close()
            self.output  = open('%s_%s.json' % (self.fprefix, time.strftime('%Y%m%d-%H%M%S')), 'w')
            self.counter = 0
        return
    def on_delete(self, status_id, user_id):
        print("Delete notice")
        return
    

Twitter Json File

Now let’s take a look at how Twitter API passes information about a specific tweet into our streamed files. Once the streaming pipeline is established, the code provided above will automatically capture every single tweet that contains any of the keywords specified. Take one of my tweets for example:

Since this tweet has a user mention of @LeagueOfLegends, and I did run the program before I posted this tweet, it was successfully captured by our program and is saved as a JSON file into my computer.

If we open the output file by a notepad, the tweet above looks like something shown here:

{'created_at': 'Sun May 12 16:03:53 +0000 2019',
 'id': 1127605322581987329,
 'id_str': '1127605322581987329',
 'text': 'I LOVE LEAGUE OF LEGENDS! @LeagueOfLegends',
 'source': 'Twitter Web Client',
 'truncated': False,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 4790635572,
          'id_str': '4790635572',
          'name': 'Shihao Ran',
          'screen_name': 'shihao1007',
          'location': 'Houston, TX',
          'url': 'https://shihaojran.com',
          'description': 'PhD :D',
          'translator_type': 'none',
          'protected': False,
          'verified': False,
          'followers_count': 4,
          'friends_count': 43,
          'listed_count': 0,
          'favourites_count': 126,
          'statuses_count': 3,
          'created_at': 'Wed Jan 20 22:54:19 +0000 2016',
          'utc_offset': None,
          'time_zone': None,
          'geo_enabled': False,
          'lang': 'en',
          'contributors_enabled': False,
          'is_translator': False,
          'profile_background_color': 'F5F8FA',
          'profile_background_image_url': '',
          'profile_background_image_url_https': '',
          'profile_background_tile': False,
          'profile_link_color': '1DA1F2',
          'profile_sidebar_border_color': 'C0DEED',
          'profile_sidebar_fill_color': 'DDEEF6',
          'profile_text_color': '333333',
          'profile_use_background_image': True,
          'profile_image_url': 'http://pbs.twimg.com/profile_images/1115448619878973440/Chg2Uabn_normal.jpg',
          'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1115448619878973440/Chg2Uabn_normal.jpg',
          'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4790635572/1554778651',
          'default_profile': True,
          'default_profile_image': False,
          'following': None,
          'follow_request_sent': None,
          'notifications': None},
 'geo': None,
 'coordinates': None,
 'place': None,
 'contributors': None,
 'is_quote_status': False,
 'quote_count': 0,
 'reply_count': 0,
 'retweet_count': 0,
 'favorite_count': 0,
 'entities': {'hashtags': [],
              'urls': [],
              'user_mentions': [{'screen_name': 'LeagueOfLegends',
                                 'name': 'League of Legends',
                                 'id': 577401044,
                                 'id_str': '577401044',
                                 'indices': [26, 42]}],
              'symbols': []},
 'favorited': False,
 'retweeted': False,
 'filter_level': 'low',
 'lang': 'en',
 'timestamp_ms': '1557677033454'}

 

The JSON file is essentially a list of key: value pairs that contain information about every field of that single tweet, such as the time when the tweet was created, the text of that tweet, the user who sent that tweet, etc. In Python, these key:value pairs can be conveniently stored in dictionaries. From there, we can use Pandas to convert a dictionary into a Pandas DataFrame without manually specify the name of each column. Then, we can use Sqlalchemy to create our database and convert all the records in the DataFrame into a table inside our database. The complete workflow is shown in Figure 1.

Figure 1. Workflow for creating a relational database from scraping twitter data.
Figure 1. Workflow for creating a relational database from scraping twitter data.

If you noticed, the JSON file could be considered as a dictionary of dictionaries since some values in the top-level dictionary is also a dictionary, 'user' for example. In fact, if one tweet contains geographical information such as the place where the tweet as created, or if the tweet were just a retweet or quoted tweet, the JSON file would contain more complex fields that contain the corresponding information. While Pandas DataFrame can handle a dictionary as a row element, databases can not. Therefore, one needs to add some tweaks when converting the raw JSON file into Pandas DataFrames, that is, to save these children dictionaries as individual tables into the database, and connect them to the parent table through some constraints such as the common field tweet_id (More on later).

Examining and extracting useful information from a large JSON file can be tedious and varies from application to application. To create a new database and store all these DataFrames as individual tables, one can use SqlAlchemy in a Python environment. SqlAlchemy also allows creating tables without specifying column names and column types manually for each table if it is created from a Pandas DataFrame, which can be a life-savior if there are lots of tables and lots of columns to be inserted into a database.

Once the database is installed in your workstation, you can create a new database using SqlAlchemy. Here I will use Postgresql as an example. As shown in the following code, the first thing is to create an engine that is used to connect to the database by specifying the dialect, username, password, host, and port of the database.

# create a new database
engine = create_engine('postgresql://username:password@host:port')
# connect to the database
con = engine.connect()
# end open transaction
con.execute('commit')
# create new databse my_databse
con.execute('create database my_databse')
# close the connection
con.close()

Then, make a connection to the default database. Since the connection is under a transaction with the default database right now (Sqlalchemy always runs queries during a transaction), end the open transaction by calling con.execute('commit'). Then create the new database followed by closing the connection. Once the database is created, the last step would be creating tables from the DataFrames, which can be easily handled by Pandas Pandas.DataFrame.to_sql() function.

You can refer to my code of saving JSON files of raw tweets into tables in a Postgresql database here.

Building a relational database

For my video-game geospatial map application, I have streamed more than 160,000 tweets within 48 hours with 8 keywords correspond to the top 8 most popular video games (rank information provided by Twitch.com). By running the code linked above, the tweets are loaded into a database.

Cool, now all streamed tweets stored in a database that has multiple tables corresponding to different fields of the tweets, which look like this:

Figure 2. Existing tables in our database
Figure 2. Existing tables in our database

In Figure 2, all the tables are presented as one individual entity, they do not have any explicit relationship between them, nor do they have their primary keys or any foreign keys specified. While this database can be used without any defect, there is A LOT of redundant information hiding in most of the tables.

Take tweet_user table as an example: what would happen if we run the following query through pgadmin 4?

This SQL query asks for the number of distinct tweet_id in base_tweets table, as well as the number of distinct id in tweet_user table. It essentially counts the number of non-duplicate tweets and different Twitter users in our database. Note that there is a column called tweet_id in every table, so that the base_tweets table and tweet_user table have the same amount of records. However, a Twitter user might tweet more than one tweets during the streaming process.  The number of distinct users should be less than the total number of tweets. And the query results is shown in Figure 3.

Figure 3. The query results
Figure 3. The query results

Well, that is quite interesting. The number of total tweets streamed is almost twice the number of users, from which we can draw a bulk conclusion that in the gaming community, by average, people usually tweet two tweets per day. While finding out this side information is fun, it also informs us that nearly half of the records in the tweet_user table is redundant. But before deleting all the duplicate rows, save the mapping information (between a specific tweet to its user) to the base_tweets table for referring between these two tables later. To add a new column to a table with values from another table, execute the following query:

Now there would be a new column called twitter_user_id in base_tweets table, which can be used later as a foreign key to link itself to twitter_user table. To delete the duplicates in twitter_user table, we can use the following query:

In the query above, by specifying a.tweet_id < b.tweet_id, the statement only keeps the record with the highest tweet_id and all the records with lower tweet_id are removed.

Now all the duplicated id in the tweet_user table is removed thus it only contains unique values. We can now use it as the primary key of this table and be referenced in base_tweets table as a foreign key to establish the connection. To set it as the primary key of tweet_user table and foreign key of base_tweets table, execute the following query:

Now the “relationship” between these two tables are established, that is, extra information about the user who sent the tweet in the base_tweets table can now be provided by tweet_user table by referring the tweet_user_id column. Likewise, repeat this process for other tables reduces data redundancy. For those tables that do not have their distinct primary key, just use the tweet_id column.

After removing duplicate records, cutting off unnecessary columns, and adding constraints to all the tables, our database should look like this:

Figure 4. Relational database diagram showing all columns.
Figure 4. Relational database diagram showing all columns.

And it can also be presented with cleaner form as shown here:

Figure 5. Relational database diagram showing only primary keys.
Figure 5. Relational database diagram showing only primary keys.
Figure 6. Relational database diagram showing only relationships.
Figure 6. Relational database diagram showing only relationships.

Conlusion

Twitter is an excellent resource for us data scientists in terms of all kinds of data mining needs. Meanwhile, it can also be overwhelming to deal with all the complex JSON fields returned by Twitter API. How to collect and manage the data from tons of tweets efficiently can be crucial and indispensable.

As a summary, in this post, I talked about:

  1. Use tweepy to set up a streaming pipeline with Twitter Stream API to collect tweets with specific keywords, user ID, or location.
  2. Use pandas and sqlalchemy to load JSON files of raw tweets and save them into a Postgres database.
  3. Use SQL queries to build a relational database and minimize the data-redundancy.

So far, all these works are just data preprocessing, in the next post, I will talk about some exciting work that utilizes all the information gathered and do some fundamental exploratory data analysis based on these data.

Again, the code used for this post can be found here.

Thanks for reading 🙂