In the era of digital information overload, automated categorization of news articles is a fundamental task in Natural Language Processing (NLP). This project explores the efficacy of classical machine learning algorithms in distinguishing between two highly distinct semantic domains: Sports and Politics.
The objective was to build an end-to-end pipeline that ingests raw text, processes it into numerical feature vectors, and trains predictive models. We hypothesized that while these topics share some vocabulary (e.g., "win", "loss", "race", "campaign"), the contextual usage would be distinct enough for linear models to achieve high accuracy without deep learning.
Rather than relying on a pre-cleaned academic dataset, we constructed a custom dataset to reflect real-world noise. Data was harvested via a Python scraper targeting RSS feeds from major news outlets (ESPN, BBC, NY Times).
The dataset exhibits a slight class imbalance, with a bias toward sports content. This was accounted for during the evaluation phase using weighted metrics.
Data Source Note: Real-time acquisition ensures the model is tested on current terminology (e.g., "election", "Super Bowl") rather than historical data.
Raw HTML text is noisy. Our cleaning pipeline, built with NLTK and Regex, performs the following transformations to standardize the input:
1. Lowercasing: "Senate" -> "senate"
2. Regex Cleaning: Removing URLs, HTML tags, and non-alpha chars
3. Stopword Removal: Stripping common words ("the", "is", "at")
4. Tokenization: Splitting strings into discrete tokens
We compared two distinct approaches to vectorization:
(297, 16635) for the training set.(297, 50000).We trained three supervised learning algorithms, selected for their effectiveness in high-dimensional sparse data:
The models were evaluated on a held-out test set of 75 articles. Below is the comparative performance matrix. Surprisingly, the simpler Bag-of-Words representation marginally outperformed TF-IDF for linear models.
| Algorithm | Feature Set | Accuracy | Analysis |
|---|---|---|---|
| Logistic Regression | Bag of Words | 98.67% | Top performer (Tied). Excellent separation. |
| SVM | Bag of Words | 98.67% | Top performer (Tied). Highly robust. |
| Logistic Regression | TF-IDF | 97.33% | Slight drop, likely due to feature sparsity. |
| Naive Bayes | Bag of Words | 97.33% | Strong baseline performance. |
| Naive Bayes | TF-IDF | 96.00% | Lowest relative performance. |
The Logistic Regression (BoW) model achieved near-perfect classification. Below is the confusion matrix, which reveals exactly where the model succeeded and failed.
| Pred: Sports | Pred: Politics | |
|---|---|---|
| Actual: Sports | 43 | 0 |
| Actual: Politics | 1 | 31 |
Interpretation: Out of 75 test articles, the model made only one single error. It misclassified one Politics article as Sports. It correctly identified all 43 Sports articles (100% recall for Sports).
precision recall f1-score support
politics 1.00 0.97 0.98 32
sports 0.98 1.00 0.99 43
accuracy 0.99 75
macro avg 0.99 0.98 0.99 75
weighted avg 0.99 0.99 0.99 75
The results overwhelmingly demonstrate that identifying the topic of an article between "Sports" and "Politics" is a tractable problem for standard machine learning, achieving ~98.7% accuracy with minimal tuning.