Building a Relational Database from Twitter Scraping
Introduction
Twitter is one of the largest social media nowadays. The amount of information hiding in tons of tweets can be enormous, and a tremendous amount of useful information can be extracted if approached correctly. Knowing how to search and manage data efficiently is especially crucial for a data scientist. Before doing any analysis, the data to be analyzed should be stored in a database that is easy to work with. In this post, I will talk about how to scrape data from millions of tweets posted in 48 hours using Twitter API and a Python package Tweepy. Followed by how to save the data into a PostgreSQL or SQLite database automatically for better management. Lastly, I will talk about how to add constraints to the relational-database to make it well-structured. Once the data is acquired and stored in good shape, the down-stream work can be handled more efficiently.
As a video-game enthusiast, I will take advantage of these data science tools to investigate the current trend of the most popular video games through scraping twitter data. Such as figuring out the distribution of the population of each video-game community globally, creating a geospatial map that displays which region has the highest live activity for each game, plot the sentiment score of tweets towards each video game varying along with different timestamps, etc. All these information behind the massive twitter data can be used for multiple potential applications to predict the future of various video games.
Scrape Twitter Data
Unlike scraping other websites where hard-coding is necessary to extract information from specific sections or divs on a webpage, by summarizing all information related to a particular tweet into a JSON file, Twitter APIs make data extraction a lot easier. Before exploring Twitter data, let’s first take a look at how Twitter API works.
Twitter APIs includes Searching API, Ads API, and Streaming API. Searching API allows us to search for specific tweets during the past week. Ads API provides information about the advertisements on Twitter. And the Streaming API provides functionalities that will enable us to collect tweets in real-time with:
- Specific keywords, user ID, or locations, called filtering.
- 1% of the total tweets randomly sampled, called sampling.
And I only use Streaming API which can be set up by Python package Tweepy easily. The code showing below sets up the stream connection and can be executed through a command-line tool:
# -*- coding: utf-8 -*- """ This script sets up tweepy and hears twitter streams @author: Shihao Ran STIM Laboratory """ from tweepy import OAuthHandler from tweepy import API from tweepy import Stream from slistener import SListener from urllib3.exceptions import ProtocolError # consumer key authentication consumer_key = 'YOUR CONSUMER KEY' consumer_secret = 'YOUR CONSUMER SECRET' auth = OAuthHandler(consumer_key, consumer_secret) # access key authentication access_token = 'YOUR ACCESS TOKEN' access_token_secret = 'YOUR TOKEN SECRET' auth.set_access_token(access_token, access_token_secret) # set up the API with the authentication handler api = API(auth) # set up words to hear keywords_to_hear = ['#GrandTheftAutoV', '#LeagueOfLegends', '#Fortnite', '#dota2', '#CSGO', '#ApexLegends', '#Hearthstone', '#overwatch'] # instantiate the SListener object listen = SListener(api) # instantiate the stream object stream = Stream(auth, listen) # begin collecting data while True: # maintian connection unless interrupted try: stream.filter(track=keywords_to_hear) # reconnect automantically if error arise # due to unstable network connection except (ProtocolError, AttributeError): continue
As shown in the code, import the tweepy
package first, then authenticate Twitter API with the consumer key-secrete pair as well as the access token-secrete pair of the account that will be using to stream the tweets. Next, specify the keywords
that you want to capture in the tweets. In this case, the hashtags for eight video games. Finally, keep the stream.filter()
command in a while True
loop to keep the streaming connected unless interrupted manually. The except
sentence handles poor network connections from time to time, which may cause disconnections between our workstation and Twitter API, and reconnects the streaming automatically if an error pops up. Before launching the streaming, specify a SListener
object, which informs tweepy how to handle the incoming tweets. And the code for SListener
is shown below:
# -*- coding: utf-8 -*- """ this script defines the streaming behavior passed to tweepy @author: Shihao Ran STIM Laboratory """ # import packages from tweepy.streaming import StreamListener import json import time import sys # inherit from StreamListener class class SListener(StreamListener): def __init__(self, api = None, fprefix = 'streamer'): # define the filename with time as prefix self.api = api or API() self.counter = 0 self.fprefix = fprefix self.output = open('%s_%s.json' % (self.fprefix, time.strftime('%Y%m%d-%H%M%S')), 'w') def on_data(self, data): if 'in_reply_to_status' in data: self.on_status(data) elif 'delete' in data: delete = json.loads(data)['delete']['status'] if self.on_delete(delete['id'], delete['user_id']) is False: return False elif 'limit' in data: if self.on_limit(json.loads(data)['limit']['track']) is False: return False elif 'warning' in data: warning = json.loads(data)['warnings'] print("WARNING: %s" % warning['message']) return def on_status(self, status): # if the number of tweets reaches 20000 # create a new file self.output.write(status) self.counter += 1 if self.counter >= 20000: self.output.close() self.output = open('%s_%s.json' % (self.fprefix, time.strftime('%Y%m%d-%H%M%S')), 'w') self.counter = 0 return def on_delete(self, status_id, user_id): print("Delete notice") return
Twitter Json File
Now let’s take a look at how Twitter API passes information about a specific tweet into our streamed files. Once the streaming pipeline is established, the code provided above will automatically capture every single tweet that contains any of the keywords specified. Take one of my tweets for example:
Since this tweet has a user mention of @LeagueOfLegends
, and I did run the program before I posted this tweet, it was successfully captured by our program and is saved as a JSON file into my computer.
If we open the output file by a notepad, the tweet above looks like something shown here:
{'created_at': 'Sun May 12 16:03:53 +0000 2019', 'id': 1127605322581987329, 'id_str': '1127605322581987329', 'text': 'I LOVE LEAGUE OF LEGENDS! @LeagueOfLegends', 'source': 'Twitter Web Client', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 4790635572, 'id_str': '4790635572', 'name': 'Shihao Ran', 'screen_name': 'shihao1007', 'location': 'Houston, TX', 'url': 'https://shihaojran.com', 'description': 'PhD :D', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 4, 'friends_count': 43, 'listed_count': 0, 'favourites_count': 126, 'statuses_count': 3, 'created_at': 'Wed Jan 20 22:54:19 +0000 2016', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': '', 'profile_background_image_url_https': '', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1115448619878973440/Chg2Uabn_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1115448619878973440/Chg2Uabn_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4790635572/1554778651', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'LeagueOfLegends', 'name': 'League of Legends', 'id': 577401044, 'id_str': '577401044', 'indices': [26, 42]}], 'symbols': []}, 'favorited': False, 'retweeted': False, 'filter_level': 'low', 'lang': 'en', 'timestamp_ms': '1557677033454'}
The JSON file is essentially a list of key: value
pairs that contain information about every field of that single tweet, such as the time when the tweet was created, the text of that tweet, the user who sent that tweet, etc. In Python, these key:value
pairs can be conveniently stored in dictionaries. From there, we can use Pandas to convert a dictionary into a Pandas DataFrame without manually specify the name of each column. Then, we can use Sqlalchemy to create our database and convert all the records in the DataFrame into a table inside our database. The complete workflow is shown in Figure 1.
If you noticed, the JSON file could be considered as a dictionary of dictionaries since some values in the top-level dictionary is also a dictionary, 'user'
for example. In fact, if one tweet contains geographical information such as the place where the tweet as created, or if the tweet were just a retweet or quoted tweet, the JSON file would contain more complex fields that contain the corresponding information. While Pandas DataFrame can handle a dictionary as a row element, databases can not. Therefore, one needs to add some tweaks when converting the raw JSON file into Pandas DataFrames, that is, to save these children dictionaries as individual tables into the database, and connect them to the parent table through some constraints such as the common field tweet_id
(More on later).
Examining and extracting useful information from a large JSON file can be tedious and varies from application to application. To create a new database and store all these DataFrames as individual tables, one can use SqlAlchemy in a Python environment. SqlAlchemy also allows creating tables without specifying column names and column types manually for each table if it is created from a Pandas DataFrame, which can be a life-savior if there are lots of tables and lots of columns to be inserted into a database.
Once the database is installed in your workstation, you can create a new database using SqlAlchemy. Here I will use Postgresql as an example. As shown in the following code, the first thing is to create an engine that is used to connect to the database by specifying the dialect, username, password, host, and port of the database.
# create a new database engine = create_engine('postgresql://username:password@host:port') # connect to the database con = engine.connect() # end open transaction con.execute('commit') # create new databse my_databse con.execute('create database my_databse') # close the connection con.close()
Then, make a connection to the default database. Since the connection is under a transaction with the default database right now (Sqlalchemy always runs queries during a transaction), end the open transaction by calling con.execute('commit')
. Then create the new database followed by closing the connection. Once the database is created, the last step would be creating tables from the DataFrames, which can be easily handled by Pandas Pandas.DataFrame.to_sql()
function.
You can refer to my code of saving JSON files of raw tweets into tables in a Postgresql database here.
Building a relational database
For my video-game geospatial map application, I have streamed more than 160,000 tweets within 48 hours with 8 keywords correspond to the top 8 most popular video games (rank information provided by Twitch.com). By running the code linked above, the tweets are loaded into a database.
Cool, now all streamed tweets stored in a database that has multiple tables corresponding to different fields of the tweets, which look like this:
In Figure 2, all the tables are presented as one individual entity, they do not have any explicit relationship between them, nor do they have their primary keys or any foreign keys specified. While this database can be used without any defect, there is A LOT of redundant information hiding in most of the tables.
Take tweet_user
table as an example: what would happen if we run the following query through pgadmin 4?
This SQL query asks for the number of distinct tweet_id
in base_tweets
table, as well as the number of distinct id
in tweet_user
table. It essentially counts the number of non-duplicate tweets and different Twitter users in our database. Note that there is a column called tweet_id
in every table, so that the base_tweets
table and tweet_user
table have the same amount of records. However, a Twitter user might tweet more than one tweets during the streaming process. The number of distinct users should be less than the total number of tweets. And the query results is shown in Figure 3.
Well, that is quite interesting. The number of total tweets streamed is almost twice the number of users, from which we can draw a bulk conclusion that in the gaming community, by average, people usually tweet two tweets per day. While finding out this side information is fun, it also informs us that nearly half of the records in the tweet_user
table is redundant. But before deleting all the duplicate rows, save the mapping information (between a specific tweet to its user) to the base_tweets
table for referring between these two tables later. To add a new column to a table with values from another table, execute the following query:
Now there would be a new column called twitter_user_id
in base_tweets
table, which can be used later as a foreign key to link itself to twitter_user
table. To delete the duplicates in twitter_user
table, we can use the following query:
In the query above, by specifying a.tweet_id < b.tweet_id
, the statement only keeps the record with the highest tweet_id
and all the records with lower tweet_id
are removed.
Now all the duplicated id
in the tweet_user
table is removed thus it only contains unique values. We can now use it as the primary key of this table and be referenced in base_tweets
table as a foreign key to establish the connection. To set it as the primary key of tweet_user
table and foreign key of base_tweets
table, execute the following query:
Now the “relationship” between these two tables are established, that is, extra information about the user who sent the tweet in the base_tweets
table can now be provided by tweet_user
table by referring the tweet_user_id
column. Likewise, repeat this process for other tables reduces data redundancy. For those tables that do not have their distinct primary key, just use the tweet_id
column.
After removing duplicate records, cutting off unnecessary columns, and adding constraints to all the tables, our database should look like this:
And it can also be presented with cleaner form as shown here:
Conlusion
Twitter is an excellent resource for us data scientists in terms of all kinds of data mining needs. Meanwhile, it can also be overwhelming to deal with all the complex JSON fields returned by Twitter API. How to collect and manage the data from tons of tweets efficiently can be crucial and indispensable.
As a summary, in this post, I talked about:
- Use
tweepy
to set up a streaming pipeline with Twitter Stream API to collect tweets with specific keywords, user ID, or location. - Use
pandas
andsqlalchemy
to load JSON files of raw tweets and save them into a Postgres database. - Use SQL queries to build a relational database and minimize the data-redundancy.
So far, all these works are just data preprocessing, in the next post, I will talk about some exciting work that utilizes all the information gathered and do some fundamental exploratory data analysis based on these data.
Again, the code used for this post can be found here.
Thanks for reading 🙂