This is what 10,000 tweets from the Philippine Elections look like.
When social media users are only able to engage with like-minded peers, increasingly extreme worldviews are reinforced inside bubbles, free from critique or dissent. People don’t know how to disagree anymore, and there’s no better example of this than in politics. This is my attempt to grapple with this phenomenon, in the hopes of figuring out how we might meet in the middle again.
The goal of this project is to use statistical and machine learning techniques such as large language models, principal component analysis, and network analysis, to better comprehend political polarization on social media. Specifically, I aimed to do these three things:
You can follow along with the code I used on this github repository.
The first task was to gather tweets from the time period of the Philippine elections that would represent a variety of political views. To do this, I queried:
I used twint to scrape these tweets. Afterwards, I had a look at the hashtags of the tweets to label them for a sentiment analysis task.
To prepare the dataset for sentiment analysis, I constructed a labeled dataset, where tweets were either labeled as pro-Marcos or pro-opposition/anti Marcos. These labels were determined by the hashtags in the tweet. If a tweet contained at least one hashtag from one of the two categories, it was placed in that category. (If a tweet contained one hashtag from each, it was excluded from the dataset.)
This list is neither exhaustive nor comprehensive, but it provided me with a dataset of roughly 63,000 tweets for training and evaluation.
A language model is a machine learning model that takes in text as input and outputs a numerical representation of said text. Most large language models are designed to process and understand text in a single language. There is much more support for and research into English language models than either Filipino or multilingual language models, which might hint towards the best performance in that domain. Additionally, not only did the task at hand include tweets and English and tweets in Tagalog: a good majority of them involved tweets that were written in both. Taglish — seamless code-switching between Tagalog and English mid-sentence, a pervasive linguistic historical artifact.
To address these challenges, I came up with six approaches to dealing with this data.
The first three approaches all only use one model:
The next three approaches use a combination of the Tagalog and English models:
For all models, I removed the hashtags of the tweets first before generating predictions. This way, the model would not just memorize which hashtags meant what - it would look at the text and try to understand it.
I trained all of these models, first only training the classifier head, then later fine-tuning the weights. During the fine-tuning stage, I introduced a regularization term, denoted as alpha, which penalized models for deviating largely from the original model weights during training. I experimented with different levels of regularization and with different learning rates as well.
Since these models were large, and there were a lot of different hyperparameters to tune, I trained these models on a Google Cloud Virtual Machine with a NVIDIA T4 GPU. Training took about 3 days in total to complete.
Here are the results. For each model, I took the training epoch with the best out-of-sample loss for the specified learning rate and alpha (regularization hyperparameter).
This writeup is in progress. Come back soon!