Automated Twitter User Tagging Using Natural Language Processing

Introduction

Previously, some EDA and interactive data visualization have been implemented to Twitter data. In this post, I will talk about how to use unstructured text data from Twitter to build an automated user-tagging application. Since Twitter data is text-rich, the primary approach to handle Twitter data is Natural Language Processing (NLP). NLP teaches the computer about how to understand natural languages such as English. Combined with Machine Learning, it is possible to train a model that makes decisions based on the text input. In other words, based on NLP, we can train classifiers to do various tasks such as user clustering. For example, in this post, I will walk you through the process of building a binary classifier to predict if the Twitter user is a programmer or a gamer using that user’s profile as the input.

Data Ingestion

Following the steps in my previous post about streaming and storing tweets in a PostgreSQL database, I launched another round of Twitter streaming, using keywords in [python, programming, java, C++, fortnite]. The first $4$ keywords capture users that are more likely to be programmers, and the last keyword, fortnite, on the other hand, captures users that are more likely to be gamers. The streaming last around $72$ hours results in $299741$ non-duplicate tweets tweeted from $195660$ Twitter users. To ensure the data purity, two additional conditions are applied to clean the raw samples:

The user’s profile description is not empty and is written in English.
The user’s tweets contain label information, in other words, the tweets under this user must include one of the five keywords specified, so the user can be correctly labeled.

After filtering, the total number of user reduce to $104303$ . Figure 1 shows the first ten users and their profile description.

Figure 1. First ten records of the user dataset.

Data Preprocessing

As mentioned in my last post, computational hardware can only handle numerical data, which is eventually converted into binary bytes. It is not possible that we simply feed the description column in Figure 1 to the machine and tell it, learn from this. The ultimate goal of text recognition is to convert the text data into numerical data and train our computer to be able to understand the underlying information. Natural Language Processing (NLP) does precisely this. NLP is a broad field of study that covers many cutting-edge challenges, such as sentiment analysis, topic recognition, and high-precision translation, etc. The model I talk about here belongs to topic-recognition, and there are several crucial steps to follow before building the machine learning model.

Tokenization

Tokenization means to split the document into individual words. In our case, not only the whole description needs to be divided into a list of different words, but it also requires a fair amount of cleaning work, so all the symbols, emojis, and web links are removed. For example, define a cleaning function that takes the raw text as input and generates a cleaned list of words as output, as shown below:

The steps followed to generate the processed data are the following:

Lowercasing all the letters in the sentence.
Tokenize the sentence (split the whole string into words).
Keep only alphabetic words (remove other characters, e.g. emojis, Japanese symbols).
Remove all stopwords (such as and, the).
Lemmatize all tokens.

Lemmatization is the process of returning only the base of a word, which is called lemma of the word. For example, the lemma of words doggy, dogs is dog.

Data-Balancing

For any classification problem, if the dataset is skewed or imbalanced, the model would be biased towards the class that has more samples. The Twitter-user dataset is pre-labeled in the database by pattern matching their tweets. The distribution of the two classes is shown in Figure 2:

Figure 2. The number of labels for all Twitter users.

Even though I used three other keywords to capture tweets from programmers, the number of programming users are still not even half of the number of gamers. There are possible approaches to balance the dataset: undersampling and oversampling. While either of them can eliminate the impact of the imbalanced dataset, they have different drawbacks and should be wisely chosen with different situations. In our case, since we do not have a massive amount of data, if we decide to undersample the gamer class, half of the samples would be dropped. For data analytics, reducing data is absolutely the least approach to go for. Hence here we choose to oversample the programmer class. The problem with this approach, though, is that the additional samples are just duplicates of random samples.

To oversample the dataset, I use the imbalanced-learn library. Imblearn provides multiple strategies for re-sampling the dataset such as Naive Random Oversampling, Synthetic Minority Oversampling Technique (SMOTE), and Adaptive Synthetic sampling method for oversampling, Prototype Generation and Prototype Selection for undersampling. After applying Naive Random Oversampling to the original dataset, the amount of samples has been balanced to $67289$ for each class. The number is smaller since after preprocessing, some profiles are removed for being empty after removing all non-alphabetical words.

Model Implementation

There are numerous machine learning models for binary classification, such as Logistic Regression, Random Forest, and Naive Bayes. To pick the best model for our application, we can write a python script to loop through the model pool and plot their performance to decide which is the best fit. With all default parameters for all the models and a small amount of training data, one can get a general idea about how the models would perform on a specific classification problem. From there, based on the initial performances, pick the best model to fine-tune its hyperparameters to achieve the best score.

To roughly train and evaluate all the models, two essential libraries are necessary: scikit-learn and yellowbrick, scikit-learn for building machine learning pipeline and yellowbrick for both model evaluation and report visualization.

Implementing a machine learning pipeline is a great point to start. With the model as the last section in the pipeline, one can easily switch between different machine learning models with the same data configuration. The first section in the pipeline is CountVectorizer.

CountVectorizer

Since the preprocessing only tokenize all the descriptions, to enable the computer to understand the text content, the next step is to transform all the text information into numbers. First, similar to a hashtable, we give each word in the user’s description a unique index, and the number is decided by the appearances of that word among all the descriptions. For instance, as shown below, we first count the number of all the words in our dataset, then assign a number to each word. And the order of the indexing is the same with the descending order of the counts, which means $0$ stands for the most frequently used word (http), and $1$ stands for the second most used word (fortnite), and so on.

It turns out that there are $89548$ unique tokens in the entire dataset, which means the indices are in the range of $[0, 89547]$ . Now we can use a vector that contains $89548$ elements to represent a specific user description by putting $1$ s to the corresponding indices. For example, If fortnite is in the description and http is not, the first element in the vector would be $0$ , and the second element would be $1$ , and so on. However, it is not possible to store vectors with $89547$ elements in memory as many as we want due to memory limit. In fact, the majority of the vector is composed of zeros, and only a small portion of it is ones, which is also called a sparse vector. Large-size sparse vectors are usually stored in sparse representations where only the indices and the corresponding non-zero values are stored. A simple example is shown below:

Here the dense vector $[0, 0, 2, 1, 0]$ is transformed into a sparse matrix by scipy. In output $61$ , it says there are only $2$ elements stored, which is the non-zero values of $2$ and $1$ . And the printed information shows that it is stored with the indices $(0, 2), (0, 3)$ (row $0$ , columns $2, 3$ ) and the value $2, 1$ .

Fortunately, the CountVectorizer() module in scikit-learn does all the heavy lifting mentioned above, including counting, indexing, and saving the results as a sparse matrix.

TfidfTransformer

While CountVectorizer handles all the work of converting text information into equivalent numerical values, which might work fine in most cases. However, it would be even better if a TfidfTransformer is attached next to the CountVectorizer. Tf stands for term-frequency, which is a normalization of the output from the CountVectorizer, and tf-idf means term-frequency times inverse document-frequency.

The formula for calculating tf-idf of a term $t$ in a document set $d$ is

$tfidf(t, d) = tf(t, d) * idf(t)$

where $tf(t, d)$ calculates the term frequency of $t$ in a document set, and

$idf(t) = log[n/df(t)] + 1$

computes the inverse ducoment frequency of term $t$ , where $df(t)$ is the document frequency of $t$ . The document frequency is the number of documents in the set that contains the term $t$ .

From the formula, when the document frequency of term $t$ is approaching the number of entire document set $n$ , the logarithmic would be approaching $0$ . And the weight for the term $t$ would go towards $1$ . Whereas the term that has a lower document frequency would have a higher $idf$ weight to magnify its term-frequency $tf(t, d)$ .

From the math behind tf-idf, one can tell that the goal of it is to scale down the impact of tokens that occur in the majority of the document set hence are empirically less informative than the tokens are occur in a smaller portion of the document set.

Similarly, TfidfTransformer module in scikit-learning receives the sparse matrix from the CountVectorizer and transforms it into an inverse-document-frequency weighted sparse matrix.

Model Selection

So far, two sections of the pipeline are implemented, to finish up the pipeline, append a machine learning model after the TfidfTransformer. The next step is to pick a specific model that fits our classification problem the best. With a pre-defined model pool, a simple for loop would do the trick:

# pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
# classification models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
# performance visualization tools
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import ConfusionMatrix
# train and visualize the results of each model individually
for model in [LogisticRegression, RandomForestClassifier, LinearSVC,
              KNeighborsClassifier, MultinomialNB]:
    # instantiate a pipeline
    pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', model()),
    ])
    # print the current model in the pipeline
    print(model)
    # for four different metrics and visualization
    for viz in [ClassificationReport, ClassPredictionError,
                ROCAUC, ConfusionMatrix]:
        # print current model and visualization combination
        print(model, viz)
        # fit and evaluate the model
        viz1 = viz(pipeline)
        viz1.fit(X_train, y_train)
        viz1.score(X_test, y_test)
        plt.figure()
        viz1.poof()

It took around $5$ mins to execute the for-loop for five models and four visualizations for each model with a single CPU core. And the results are shown in Figure 3- Figure 7.

Figure 3. Visualization of metrics for Logistic Regression Classifier.

Figure 4. Visualization of metrics for Random Forest Classifier.

Figure 5. Visualization of metrics for Linear SVC (Support Vector Classifier).

Figure 6. Visualization of metrics for K-Neighbors (KNN) Classifier.

Figure 7. Visualization of metrics for Naive Bayes Classifier.

As shown in the figures above, five models (from top to bottom: Logistic Regression, Random Forest, Linear SVC, K-Nearest Neighbor, and Multinomial Naive Bayes) are trained and evaluated individually (from left to right: prediction error, confusion matrix, precision-recall and f1-score, and roc-curve). In the prediction plots (first column), blue area denotes the number of predictions for class $0$ (programmer), and the green area stands for predictions for class $1$ (gamer).

It is clear that Random Forest, Linear SVC, and Naive Bayes classifiers can provide pretty good scores while K-Nearest Neighbors (KNN) classifier is not able to handle the task. Since the dimension of the dataset is too high (up to $80000$ dimensions), the distance between two samples in the hyperspace can be tremendous thus impact the performance of the voting strategy of KNN algorithm.

Cross-Validation & Model Tuning

Figure 8 shows the benchmark profiling for all five models. Even though Random Forest Classifier has a pretty good performance, the training time is too long to be considered. Therefore, the model pool is shrunk into two models: Linear SVC and Naive Bayes. The next step is hyperparameter tuning and cross-validation.

Figure 8. Benchmark profiling for all the models.

GridSearchCV

GridSearchCV is a submodule provided by scikit-learn, which combines grid search and cross-validation together into one entity. It explicitly searches over a specified set of parameters. For each combination of parameters, it performs k-fold cross-validation automatically. Execute the following code to find the best hyperparameters for Naive Bayes classifier:

# instantiate the pipeline with some predefined parameters
pipeline = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1, 2))),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('nb', MultinomialNB(fit_prior=False)),
])
# specify parameter grid to search over
parameters = {
    'vect__max_df': (0.25, 0.5),
    'nb__alpha': (0, 0.1, 0.2, 0.3, 0.4)
}
# initialize the GridSearchCV object
grid_search = GridSearchCV(pipeline, parameters, cv=5,
                           n_jobs=8, verbose=10)
# print progress to the console
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
# start searching
grid_search.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
print()
# print the best paramters found
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("t%s: %r" % (param_name, best_parameters[param_name]))

To reduce the search time, some basic parameters are predefined when initializing the pipeline, such as ngram_range for CountVectorizer and use_idf for TfidfTransformer. ngram_range stands for the range of n-gram to be extracted. Here I choose to only consider 1-gram and 2-grams sequences.

Two parameters for two modules in the pipeline are explicitly set in the parameter grid, including max_df for CountVectorizer and alpha for Naive Bayes classifier. Specifically, max_df stands for maximal document-frequency, which let the CountVectorizer only to use the tokens that have the lower document-frequency to build the vocabulary. In other words, all the tokens that have a document-frequency higher than this threshold will be ignored so that the classifier can be more focused on other relatively more informative tokens. On the other hand, alpha is the smoothing coefficient that counts for unseen feature points to avoid zero probabilities in downstream calculations.

The output of executing the code is shown below:

There are $10$ different combinations of parameters to fit and a 5-fold cross-validation for each model; thus, 50 fits in total, as shown in the output. For an 8-core CPU, it took $96$ seconds to find the best model with 5-fold CV. The corresponding parameters for the best model are picked, with alpha=0.1 and max_df=0.25. And the evaluation of the best model is shown in Figure 9:

Figure 9. Evaluation for Naive Bayes Classifier after grid-search and cross-validation.

Compares to the default model, the best model after hyperparameter tuning has better performance: the precision for class 1 has increased from $0.852$ to $0.92$ , the recall for class 0 has increased from $0.858$ to $0.927$ , and the AUC has increased from $0.92$ to $0.96$ , respectively.

Similarly, we can define a parameter grid for Linear SVC to find the best model, as shown in Figure 10:

Figure 10. Evaluation for Linear SVC after grid-search and cross-validation.

Conlusion

So far, the implementation of an automated Twitter user tagging model using NLP and machine learning models is done. Figure 11 illustrates a small demo of the application using the pre-trained Linear SVC model. I use the descriptions of two Twitter users as the input to generate their tags, i.e., to predict if he/she is a gamer or a programmer. One of them is Eefje Depoortere (@sjokz), who is my favorite E-sports host. And the other one is my friend Ben (@Benben_ben__), who is a programmer.

Figure 11. A demo for an automated user-tagging application using pre-trained NLP model.

While the model is able to accurately predict the label for these two users, there is still space to improve. Such as multi-labeling and coupling with Computer Vision, etc.

I had tons of fun during the process of developing this application. You can find all the code for this project here. As always, feel free to correct me if there is any mistake or concerns. Welcome to let me know if you have any questions.

Thanks for reading 🙂