This analysis is my attempt at theorizing a solution for a problem in a situation where the complete dataset is not available yet. This work can be used as an example to generate similar theoretical proposals, which can be beneficial in the early stages of data consulting or research projects, where confidentiality has not yet been established.
Introduction and Data Analysis
Problem Description
The task at hand was to build binary classification models based on the given dichotomy of tweets that are categorized into 'real' and 'fake'. Five complete approaches to achieve this are discussed and critically analyzed. Their strengths and weaknesses are compared, and the approaches are ranked based on this analysis. The dataset provided was The MediaEval 2015 "verifying multimedia use" dataset which consisted of social media posts for which the social media identifiers were shared along with the tweet text and some additional characteristics of the post.
Dataset Characteristics
- Data Format: The data was provided in the form of a text file with the fields separated by tabs. Two separate text files were provided, one for training and the other for testing. The data was converted into a CSV file using Microsoft Excel.
- Data size and Volume: The dataset consisted of 14,483 tuples and 7 attributes to define them. The size of the (train and test) data was 3.28 MB put together.
Data Features and Quality
There are three classes in the raw dataset, namely Fake, Humor and Real. The fake class consists of reposts of real multimedia, such as real photos from the past re-posted as being associated to a current event, digitally manipulated multimedia, and synthetic multimedia, such as artworks or snapshots presented as real imagery. The humor labelled tweets may be considered as fake, but it is better to discard them to avoid exacerbating the high data bias due to the existence of significantly higher number of fake tweets than that of real tweets (9464 fake tweets with humor included and 6833 fake tweets without inclusion of the same as compared to 5004 real tweets).
77% of the tweets were in English, followed by Spanish at 8%. Remaining tweets of 30 languages were lower than 2% each. The language of tweets was detected using the 'langdetect' library and the tweets were translated to English using the 'googletrans' library.
The data quality was assessed based on six parameters. The data had very low null value rate (Completeness), outstanding Accuracy and Consistency (extracted via Twitter API), high Uniqueness (86% unique tweets), strong Validity, and only Timeliness needed improvement (2015 data). Four parameters were found to be very high, one high, and just one needed improvement — hence the quality of data was stated to be high.
Algorithm Design
Approach 1 — Conventional NLP Pipeline
Preprocessing
Stopwords removed using the 'nltk' stopwords list. Tokenization performed using the nltk tokenizer. Lemmatization performed after POS tagging — preferred over stemming because it performs morphological analysis rather than merely cutting off the suffix.
Feature Extraction
"Bag of n-grams" representation used via 'CountVectorizer'. Just unigrams considered to keep computational burden minimal. Count matrix normalized using TF-IDF via 'TfidfTransformer' to reduce the influence of tokens that occur very frequently.
Feature Selection & Dimensionality Reduction
Chi-squared test performed to determine if a particular feature and the class label were uniform. Features over the threshold were kept; the rest discarded.
Modelling
Multinomial NB (specifically made for classification with discrete features) and SVM (works great with text token classification). Both trained and the model with the higher F1-score used.
Approach 2 — Two-Level Classification
Based on a unique two-level classification model. The working is portrayed below:
Preprocessing
Stop words removed, tweets tokenized, POS tagged and lemmatized similar to the first approach.
Feature Extraction
Tokens converted to feature vectors using 'TfidfVectorizer'. Timestamps converted to Unix timestamps using 'datetime' library (making them ordinal). ImageIDs one-hot encoded for equal weightage.
Modelling
TFIDF vectors undergo KMeans clustering with Jaccard's distance. Clusters labelled as topics. All tweets classified as real or fake using SVM at the topic level, then results added to each tweet. Random Forest classifier used for final tweet-level classification — chosen because it reduces overfitting while improving accuracy and handles high dimensional data well.
Approach 3 — Deep Neural Network with Feature Engineering
Preprocessing
Tweets normalized using regular expressions to map syntactic, lexical and Twitter-specific forms. Hashtags and mentions stored as separate attributes. Tweets tokenized and stop words removed.
Feature Extraction & Selection
Text tokens transformed into TF-IDF representations. ImageId, hashtags and mentions one-hot encoded. PCA used to reduce dimensionality of only the TF-IDF representations — ensuring low redundancy through orthogonal components with high differentiating potential.
Modelling
Deep neural network with embedding layer, flatten layer, and two dense layers (ReLU and Sigmoid). Dropout regularization to avoid overfitting. Adam optimizer and binary cross-entropy loss function.
Approach 4 — Bi-directional LSTM
Preprocessing
Special characters removed, words tokenized by converting each sequence into integer encoded representation and normalizing sequence length. No stopword removal or TF-IDF — the LSTM accounts for word order.
Feature Extraction
Transfer Learning used: GloVe embeddings and Twitter embeddings stacked together using numpy.hstack — demonstrated better performance than using any single embedding or training from scratch.
Modelling
Bi-directional LSTM with embedding layer, dropout, convolutions (64,4) stacked on top of each state vector, max pooling, Dense 64 (ReLU), and output Dense layer (Sigmoid). Binary cross-entropy with Adam optimizer. Hyperparameter tuning of batch size, epochs, dropout rate, and LSTM size performed.
Approach 5 — Ensemble Learning
Uses algorithms from approaches 1–4 as sub-models in an ensemble voting scenario. Due to diversity of approaches covering different aspects (tweet text classification, inter-tweet relations, hidden patterns via DNN, and sequential patterns via LSTM), they complement each other.
Predictions from all 4 models evaluated. Class with maximum votes taken as outcome. In case of a tie, approaches 1 and 2 votes multiplied by 0.5 (deep learning models in approaches 3 and 4 act as tie breakers).
Evaluation
Strengths and Weakness Analysis
Approach 1
Strengths: Lightweight processing; lemmatization provides uniform word vectors; Multinomial NB and SVM known to be excellent for text classification.
Weaknesses: Does not preserve word order; TF-IDF doesn't capture semantics or co-occurrences; could benefit from ensembling the two models.
Approach 2
Strengths: Two-level approach accounts for inter-tweet relations; considers multiple attributes beyond just tweet text; topic clustering uncovers unique patterns.
Weaknesses: Determining optimal cluster count is difficult; computationally demanding without dimensionality reduction; second model performance depends on first level quality.
Approach 3
Strengths: Hashtags and mentions as separate attributes; hyperparameter tuning performed; deep neural network likely achieves higher accuracy than ML methods.
Weaknesses: Susceptible to curse of dimensionality; data size may be insufficient for neural network training; does not consider word order.
Approach 4
Strengths: Word context considered by LSTM; hyperparameter tuning guarantees strong performance; dropout regularization prevents overfitting.
Weaknesses: Computationally intensive; dataset size may be insufficient; tweets may be too short for LSTM to draw meaningful sequential patterns; convolutions on state vectors could hurt performance.
Approach 5
Strengths: Best performance highly likely as an ensemble; sub-models focus on different aspects; equal voting gives balanced weightage.
Weaknesses: Very heavy to train; computational cost may not justify performance boost; deep learning models acting as tiebreakers could fail with insufficient data.
Final Ranking
The approaches are ranked considering speed and correctness:
Approach 3 > Approach 4 > Approach 5 > Approach 2 > Approach 1
Approach 3 strikes the best balance between accuracy and speed. Approach 4 has excellent potential but only uses tweet text. Approach 5 is most accurate but computationally expensive. Approach 2 has reliability concerns with its two-level dependency. Approach 1 is fastest but most limited.
Conclusion
Five potential analogous approaches were discussed. Key takeaways: Deep Learning approaches will supersede ML algorithms given sufficient data and computation. Regularization, feature selection and dimensionality reduction must be carefully balanced. Hyperparameter tuning is essential. Word order in tweets can uncover important information despite their small size.
Future work should incorporate image classification and cross-validation to experimentally determine F1 scores and timing for a more robust ranking.
References
- IT Pro team: 'How to measure data quality', 2 Mar 2020
- C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265.
- Zhiwei Jin, Juan Cao, Yazi Zhang, and Zhang Yongdong. 2015. MCG-ICT at MediaEval 2015: Verifying Multimedia Use with a Two-Level Classification Model.
- Tripathy, R.M. et al. (2014), "Theme based clustering of tweets", Proceedings of the 1st IKDD Conference on Data Sciences.
- 'Jaccard index', Wikipedia.
- Nicolas Foucault and Antoine Courtin. 2016. Automatic Classification of Tweets for Analyzing Communication Behavior of Museums. LREC 2016.
- Faizan Shaikh, 'Deep Learning vs. Machine Learning', Analytics Vidhya.
- Deep Learning (LSTM) for Tweet Classification, Kaggle.
- Sahoo A.K., Pradhan C., Das H. (2020) Performance Evaluation of Different Machine Learning Methods and Deep-Learning Based CNN. Springer.
- Shubham Singh, "NLP Essentials: Removing Stopwords", Analytics Vidhya.
- Hafsa Jabeen, "Stemming and Lemmatization in Python", DataCamp.
- Jason Brownlee, "A Gentle Introduction to the Bag-of-Words Model", Machine Learning Mastery.
- Nikolai Janakiev, "Practical Text Classification With Python and Keras", Real Python.
- "Principal Component Analysis", International Encyclopedia of Education (Third Edition), 2010.
- "Choosing what kind of classifier to use", Stanford NLP.
- Monkey Learn, "Text Classification".
- Great Learning Team, "Random Forest Algorithm — An Overview".
- Julia Kho, "Why Random Forest is My Favorite Machine Learning Model", Towards Data Science.
- Sanket Doshi, "Various Optimization Algorithms For Training Neural Network", Towards Data Science.
- Karsten Eckhardt, "Choosing the right Hyperparameters for a simple LSTM using Keras", Towards Data Science.
- Prasoon Singh, "Fundamentals of Bag Of Words and TF-IDF", Analytics Vidhya.
- Christina Boididou et al. 2015. Verifying Multimedia Use at MediaEval 2015. In MediaEval.