Binary Text Classification

Distinguishing Sports from Politics using NLP
Prasangeet DongreB23CH1033Assignment Report

Introduction

In the era of digital information overload, automated categorization of news articles is a fundamental task in Natural Language Processing (NLP). This project explores the efficacy of classical machine learning algorithms in distinguishing between two highly distinct semantic domains: Sports and Politics.

The objective was to build an end-to-end pipeline that ingests raw text, processes it into numerical feature vectors, and trains predictive models. We hypothesized that while these topics share some vocabulary (e.g., "win", "loss", "race", "campaign"), the contextual usage would be distinct enough for linear models to achieve high accuracy without deep learning.

Dataset Description & Analysis

Rather than relying on a pre-cleaned academic dataset, we constructed a custom dataset to reflect real-world noise. Data was harvested via a Python scraper targeting RSS feeds from major news outlets (ESPN, BBC, NY Times).

372
Total Articles
448
Avg Words / Doc
16.6k
Unique Vocabulary

Class Distribution

The dataset exhibits a slight class imbalance, with a bias toward sports content. This was accounted for during the evaluation phase using weighted metrics.

Sports (213)
Politics (159)
57% Sports 43% Politics

Data Source Note: Real-time acquisition ensures the model is tested on current terminology (e.g., "election", "Super Bowl") rather than historical data.

Methodology

Preprocessing Pipeline

Raw HTML text is noisy. Our cleaning pipeline, built with NLTK and Regex, performs the following transformations to standardize the input:

1. Lowercasing: "Senate" -> "senate"
2. Regex Cleaning: Removing URLs, HTML tags, and non-alpha chars
3. Stopword Removal: Stripping common words ("the", "is", "at")
4. Tokenization: Splitting strings into discrete tokens

Feature Extraction

We compared two distinct approaches to vectorization:

Models Evaluated

We trained three supervised learning algorithms, selected for their effectiveness in high-dimensional sparse data:

  1. Multinomial Naive Bayes: A probabilistic classifier based on Bayes' theorem.
  2. Logistic Regression: A linear model that estimates probabilities using a sigmoid function.
  3. Support Vector Machine (SVM): A classifier that finds the optimal hyperplane to separate classes.

Quantitative Results

The models were evaluated on a held-out test set of 75 articles. Below is the comparative performance matrix. Surprisingly, the simpler Bag-of-Words representation marginally outperformed TF-IDF for linear models.

Algorithm Feature Set Accuracy Analysis
Logistic Regression Bag of Words 98.67% Top performer (Tied). Excellent separation.
SVM Bag of Words 98.67% Top performer (Tied). Highly robust.
Logistic Regression TF-IDF 97.33% Slight drop, likely due to feature sparsity.
Naive Bayes Bag of Words 97.33% Strong baseline performance.
Naive Bayes TF-IDF 96.00% Lowest relative performance.

Deep Dive: The Best Model

The Logistic Regression (BoW) model achieved near-perfect classification. Below is the confusion matrix, which reveals exactly where the model succeeded and failed.

Pred: Sports Pred: Politics
Actual: Sports 43 0
Actual: Politics 1 31

Interpretation: Out of 75 test articles, the model made only one single error. It misclassified one Politics article as Sports. It correctly identified all 43 Sports articles (100% recall for Sports).

Classification Report Details

              precision    recall  f1-score   support

    politics       1.00      0.97      0.98        32
      sports       0.98      1.00      0.99        43

    accuracy                           0.99        75
   macro avg       0.99      0.98      0.99        75
weighted avg       0.99      0.99      0.99        75

Limitations & Conclusion

The results overwhelmingly demonstrate that identifying the topic of an article between "Sports" and "Politics" is a tractable problem for standard machine learning, achieving ~98.7% accuracy with minimal tuning.

Observations

Limitations