Classifier Report: Sports vs. Politics

Introduction

In the era of digital information overload, automated categorization of news articles is a fundamental task in Natural Language Processing (NLP). This project explores the efficacy of classical machine learning algorithms in distinguishing between two highly distinct semantic domains: Sports and Politics.

The objective was to build an end-to-end pipeline that ingests raw text, processes it into numerical feature vectors, and trains predictive models. We hypothesized that while these topics share some vocabulary (e.g., "win", "loss", "race", "campaign"), the contextual usage would be distinct enough for linear models to achieve high accuracy without deep learning.

Dataset Description & Analysis

Rather than relying on a pre-cleaned academic dataset, we constructed a custom dataset to reflect real-world noise. Data was harvested via a Python scraper targeting RSS feeds from major news outlets (ESPN, BBC, NY Times).

372

Total Articles

448

Avg Words / Doc

16.6k

Unique Vocabulary

Class Distribution

The dataset exhibits a slight class imbalance, with a bias toward sports content. This was accounted for during the evaluation phase using weighted metrics.

Sports (213)

Politics (159)

57% Sports 43% Politics

Data Source Note: Real-time acquisition ensures the model is tested on current terminology (e.g., "election", "Super Bowl") rather than historical data.

Methodology

Preprocessing Pipeline

Raw HTML text is noisy. Our cleaning pipeline, built with NLTK and Regex, performs the following transformations to standardize the input:

1. Lowercasing: "Senate" -> "senate"
2. Regex Cleaning: Removing URLs, HTML tags, and non-alpha chars
3. Stopword Removal: Stripping common words ("the", "is", "at")
4. Tokenization: Splitting strings into discrete tokens

Feature Extraction

We compared two distinct approaches to vectorization:

Bag of Words (BoW): A simple frequency count. It resulted in a sparse matrix of shape (297, 16635) for the training set.
TF-IDF (1-2 grams): This technique weighs terms by their importance (inverse document frequency). We included bigrams (e.g., "Prime Minister", "World Cup") to capture context, resulting in a larger feature space of (297, 50000).

Models Evaluated

We trained three supervised learning algorithms, selected for their effectiveness in high-dimensional sparse data:

Multinomial Naive Bayes: A probabilistic classifier based on Bayes' theorem.
Logistic Regression: A linear model that estimates probabilities using a sigmoid function.
Support Vector Machine (SVM): A classifier that finds the optimal hyperplane to separate classes.

Quantitative Results

The models were evaluated on a held-out test set of 75 articles. Below is the comparative performance matrix. Surprisingly, the simpler Bag-of-Words representation marginally outperformed TF-IDF for linear models.

Algorithm	Feature Set	Accuracy	Analysis
Logistic Regression	Bag of Words	98.67%	Top performer (Tied). Excellent separation.
SVM	Bag of Words	98.67%	Top performer (Tied). Highly robust.
Logistic Regression	TF-IDF	97.33%	Slight drop, likely due to feature sparsity.
Naive Bayes	Bag of Words	97.33%	Strong baseline performance.
Naive Bayes	TF-IDF	96.00%	Lowest relative performance.

Deep Dive: The Best Model

The Logistic Regression (BoW) model achieved near-perfect classification. Below is the confusion matrix, which reveals exactly where the model succeeded and failed.

	Pred: Sports	Pred: Politics
Actual: Sports	43	0
Actual: Politics	1	31

Interpretation: Out of 75 test articles, the model made only one single error. It misclassified one Politics article as Sports. It correctly identified all 43 Sports articles (100% recall for Sports).

Classification Report Details

              precision    recall  f1-score   support

    politics       1.00      0.97      0.98        32
      sports       0.98      1.00      0.99        43

    accuracy                           0.99        75
   macro avg       0.99      0.98      0.99        75
weighted avg       0.99      0.99      0.99        75

Limitations & Conclusion

The results overwhelmingly demonstrate that identifying the topic of an article between "Sports" and "Politics" is a tractable problem for standard machine learning, achieving ~98.7% accuracy with minimal tuning.

Observations

BoW > TF-IDF? In this specific domain, specific keywords (e.g., "goal", "senate") are highly discriminative. TF-IDF's penalty on frequent terms might have dampened the signal of these strong indicators slightly, or the increased dimensionality of including bi-grams introduced noise.
Model Robustness: Both SVM and Logistic Regression performed identically on the best feature set, suggesting the decision boundary is linear and clear.

Limitations

Dataset Size: With only 372 rows, there is a risk that the model has learned the specific writing style of the few sources (BBC/ESPN) rather than the general topic.
Class Imbalance: The dataset had ~30% more sports articles. While our metrics (F1-score) account for this, a production system would need a balanced ingestion pipeline.