Logistic Regression for Nepali News Classification

Hello everyone, and welcome back to my news categorization blog. In this blog, I’ll be looking into Logistic Regression for news in Nepali, which is our native language. I started this project almost a year ago but never finished it because I had no idea what I was doing and only knew BeautifulSoup from Datacamp. I was able to scrape a Nepali news webpage using that information.

I enrolled in Coursera’s natural language processing specialization because I was interested in utilizing machine learning on a news site and was able to grasp some of the core ideas of classification algorithms such as Naive Bayes, Logistic Regression, and Decision Tree. I’ll describe how I used the Logistic Regression classification method in my data in this blog because I’ve already written a blog about how I enable news scraping. The primary purpose of applying various types of classification algorithms to news data is to be able to categorize it. A Machine Learning classification approach called logistic regression is used to predict the likelihood of a categorical dependent variable. I followed the steps outlined below to accomplish this.

Import Necessary Module

I loaded the modules that I needed for my news classification work here.

  • os: The OS module in Python provides functions for interacting with the operating system

  • pandas: Working with DataFrame

  • numpy: For mathematical calculations

  • matplotlib: For visualization

  • matplotlib.front manager: A module for finding, managing, and using fonts across platforms

  • warnings: Warnings are provided to warn the developer of circumstances that aren’t always exceptions.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from matplotlib.font_manager import FontProperties
import seaborn as sns
import warnings
import pprint


Open the CSV file

To integrate separate csv files, I did the following. I scraped the news on a daily basis and put it in a different csv file for each day. For the tesk, I concatenated all of the csv files.

all_filenames = [i for i in os.listdir("/content/drive/MyDrive/News Scraping") if  ".csv" in i and "combined" not in i]
['2021-03-25.csv', '2021-03-26.csv', '2021-03-27.csv', '2021-03-28.csv', '2021-03-29.csv', '2021-03-30.csv', '2021-03-31.csv', '2021-04-01.csv', '2021-04-03.csv', '2021-04-04.csv', '2021-04-05.csv', '2021-04-06.csv', '2021-04-07.csv', '2021-04-08.csv', '2021-04-12.csv', '2021-04-13.csv', '2021-04-14.csv', '2021-04-15.csv', '2021-04-16.csv', '2021-04-17.csv', '2021-04-18.csv', '2021-04-19.csv', '2021-10-09.csv', '2021-10-10.csv', '2021-10-20.csv', '2021-10-21.csv', '2021-10-22.csv', '2021-10-23.csv', '2021-10-24.csv', '2021-10-25.csv', '2021-10-26.csv', '2021-10-27.csv', '2021-11-21.csv', '2021-11-22.csv', '2021-11-25.csv', '2021-11-28.csv', '2021-11-29.csv', '2021-11-30.csv', '2021-12-05.csv', '2021-12-25.csv', '2022-01-02.csv', '2022-01-04.csv', '2022-01-11.csv', '2022-01-13.csv', '2022-01-18.csv', '2022-03-26.csv', '2022-03-27.csv', '2022-04-04.csv']
root = "/content/drive/MyDrive/News Scraping/"

There are 9 columns and 9282 rows in the merged csv file. Title, URL, Date, Author, Author URL, Content, Category, Description, and so on are some of the 9 column titles.

combined_csv = pd.concat([pd.read_csv(root+f) for f in all_filenames])
Unnamed: 0 Title URL Date Author Author URL Content Category Description
0 0 स्पष्टीकरण बुझाउँदै रावलले ओलीलाई नै सोधे- पार... https://ekantipur.com/news/2021/03/25/16166859... चैत्र १२, २०७७ कान्तिपुर संवाददाता https://ekantipur.com/author/author-14301 काठमाडौँ — नेकपा (एमाले) का उपाध्यक्ष भीम रावल... news नेकपा (एमाले) का उपाध्यक्ष भीम रावलले पार्टी अ...
1 1 अध्ययन क्षेत्रमा नेपाल–ओमान बीच सहकार्य https://ekantipur.com/news/2021/03/25/16166848... चैत्र १२, २०७७ कान्तिपुर संवाददाता https://ekantipur.com/author/author-14301 काठमाडौँ — नेपाल र ओमानका विद्यार्थीहरु बीच ‘क... news नेपाल र ओमानका विद्यार्थीहरु बीच ‘कृषि र पर्यट...
2 2 ओलीलाई नेपालको प्रतिप्रश्‍न- तपाईंमा पार्टी वि... https://ekantipur.com/news/2021/03/25/16166833... चैत्र १२, २०७७ कान्तिपुर संवाददाता https://ekantipur.com/author/author-14301 काठमाडौँ — नेकपा (एमाले)का वरिष्ठ नेता माधवकुम... news नेकपा (एमाले)का वरिष्ठ नेता माधवकुमार नेपालले ...
3 3 कांग्रेस केन्द्रीय कार्यसमिति बैठक चैत २० गतेल... https://ekantipur.com/news/2021/03/25/16166770... चैत्र १२, २०७७ कान्तिपुर संवाददाता https://ekantipur.com/author/author-14301 काठमाडौँ — प्रमुख प्रतिपक्ष दल कांग्रेसको केन्... news प्रमुख प्रतिपक्ष दल कांग्रेसको केन्द्रीय कार्य...
4 4 कोरोना जोखिम कम गर्न सीसीएमसीलाई स्वास्थ्य मन्... https://ekantipur.com/news/2021/03/25/16166738... चैत्र १२, २०७७ कान्तिपुर संवाददाता https://ekantipur.com/author/author-14301 काठमाडौँ — स्वास्थ्य तथा जनसंख्या मन्त्रालयले ... news स्वास्थ्य तथा जनसंख्या मन्त्रालयले आवश्यक तयार...
... ... ... ... ... ... ... ... ... ...
75 9 प्रधानमन्त्रीका छोरासहित श्रीलङ्काका २६ जना मन... https://gorkhapatraonline.com/international/20... चैत्र २१, २०७८ सोमबार गोरखापत्र अनलाइन https://gorkhapatraonline.com/author_news/3 गोरखापत्र अनलाइनकाठमाडौं, चैत २१ गते । श्रीलङ्... international श्रीलङ्कामा सबैभन्दा चरम आर्थिक सङ्कट व्यवस्था...
76 10 खुम्बुबाट गर्भवति महिलाको उद्धार https://gorkhapatraonline.com/health/2022-04-0... चैत्र २१, २०७८ सोमबार सन्तोष राउत https://gorkhapatraonline.com/author_news/3 सोलुखुम्बुको खुम्बु पासाङल्हामु गाउँपालिका ३ ब... province NaN
77 11 गाउँठाउँमै सेवा पाउने भएपछि गौशालाबासी खुशी https://gorkhapatraonline.com/national/2022-04... चैत्र २१, २०७८ सोमबार नागेन्द्रकुमार कर्ण https://gorkhapatraonline.com/author_news/3 महोत्तरी जिल्लाको पश्चिमवर्ती स्थानीय तहका वास... province NaN
78 12 साकेला पार्कमा बाबु पोखरीको शिलान्यास https://gorkhapatraonline.com/open/2022-04-04-... चैत्र २१, २०७८ सोमबार निशा राई https://gorkhapatraonline.com/author_news/3 धरान ११ स्थित साकेला सांस्कृतिक पार्कमा बाबु प... province NaN
79 13 एकलाख बालबालिकालाई टाइफाइडविरुद्धको खोप https://gorkhapatraonline.com/health/2022-04-0... चैत्र २१, २०७८ सोमबार मुकुन्द सुवेदी https://gorkhapatraonline.com/author_news/3 स्वास्थ्य कार्यालय बर्दियाले जिल्लाका एक लाख ब... province NaN

9242 rows × 9 columns

I made a duplicate of the original dataframe here. The major goal of doing this is that in some cases, if I update a value, the value in the original data changes as well, so I produced a replica of it to keep my original data unchanged.

df = combined_csv.copy()

Open the stopwords.txt file.

Stop words are a collection of terms that are commonly used in any language. Stop words in English include words like “the,” “is,” and “and.” Stop words are used in NLP and text mining applications to remove extraneous terms so that computers may focus on the important ones. The following is how I loaded the stop words file. Because stop words play an important role in news classification, we should eliminate them during preprocessing.

Stopwords File

stop_file = "/content/drive/MyDrive/News Scraping/News classification/nepali_stopwords.txt"
stop_words = []
with open(stop_file) as fp:
  lines = fp.readlines()
  stop_words =list( map(lambda x:x.strip(), lines))

Punctuation file

Open the Punctuation file.

The code below is for loading a punctuation file. Punctuation is a set of tools used in writing to clearly distinguish sentences, phrases, and clauses so that their intended meaning may be understood. These tools provide no useful information during categorization, thus they should be eliminated before we train our model.

punctuation_file = "/content/drive/MyDrive/News Scraping/News classification/nepali_punctuation (1).txt"
punctuation_words = []
with open(punctuation_file) as fp:
  lines = fp.readlines()
  punctuation_words =list( map(lambda x:x.strip(), lines))
[':', '?', '|', '!', '.', ',', '" "', '( )', '—', '-', "?'"]

Pre-processing of text

I’m only going to utilize the titles of all of my blog’s categories. I’ll use content to make a blog post there later, despite the enormous quantity of words in the content columns. In this blog, I’ll show you how to use Naive Bayes in title data to classify news and categorize it by category.

First, I created a method named ‘preprocessing text’ in the provided code that accepts data, stop words, and punctuation words as parameters. I made a list called ‘new cat’ to keep track of the information once I processed it. I also initialized naise, as you can see in the code. Then, within cat data, I use for loop. I isolated the data on cats from the white space, linked them together, and gave them names.

def preprocess_text(cat_data, stop_words, punctuation_words):
  new_cat = []
  noise = "1,2,3,4,5,6,7,8,9,0,०,१,२,३,४,५,६,७,८,९".split(",")
  for row in cat_data:
    words = row.strip().split(" ")      
    nwords = "" # []
    for word in words:
      if word not in punctuation_words and word not in stop_words:
        is_noise = False
        for n in noise:
          if n in word:
            is_noise = True
        if is_noise == False:
          word = word.replace("(","")
          word = word.replace(")","")
          # nwords.append(word)
          if len(word)>1:
            nwords+=word+" "
  # print(new_cat)
  return new_cat

title_clean = preprocess_text(["शिक्षण संस्थामा ज जनस्वास्थ्य 50 मापदण्ड पालना शिक्षा मन्त्रालयको निर्देशन"], stop_words, punctuation_words)

['शिक्षण संस्थामा जनस्वास्थ्य मापदण्ड पालना शिक्षा मन्त्रालयको निर्देशन']

Here I only take title from our data and apply stops words and punctuations.

ndf = df.copy()
cat_title = []
for i, row in ndf.iterrows():
  ndf.loc[i, "Title"]= preprocess_text([row.Title], stop_words, punctuation_words)[0]


Importing Necessary Module for Logistic Regression

  • LogisticRegression : We import LogisticRegression from sklearn.linear_model.
  • CountVectorizer : Convert a set of text documents into a token count matrix. Using scipy, this approach generates a sparse representation of the counts.

  • train_test_split: Sklearn model selection has a method called train test split that splits data arrays into two subsets: training data and testing data.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

Data is split into two sets: train and test.

Our original data was separated into train and test data at this point. Normally, data is partitioned in the proportions of 70% to 30% or 80% to 20%. Model fitting is done with train data, and model prediction is done with test data.

data = pd.DataFrame()

data["target"] = data["label"].apply(lambda x: "news" if x=="prabhas-news" else "national" if x=="province" else x)
classes = {c:i for i,c in enumerate(data.target.unique())}
data["target"] = data.target.apply(lambda x: classes[x])

X_train, X_test, Y_train, Y_test = train_test_split(data['text'], 

vectorizer = CountVectorizer(ngram_range=(1, 2)).fit(X_train)
X_train_vectorized = vectorizer.transform(X_train)

.transform(X) creates a 2-D feature matrix from any dict X (mapping feature name to feature values). According to vector math, the 2D matrix is the correct way to enter entries to a classifier.

t = vectorizer.transform(
        'कैलाली, कञ्चनपुर र सुनसरीमा ३० प्रतिशत धानबाली नष्ट',
        'जलविद्युत् आयोजना : सर्वेक्षण इजाजत रोक्नबाट पछि हट्यो सरकार',
        'राष्ट्र बैंकको हस्तक्षेपपछि निक्षेपको ब्याजदर १० प्रतिशतभन्दा कम',
        'गणेशमान सिंह राष्ट्रिय टेनिस कात्तिक ९ गतेदेखि',
        'निर्माणमैत्री कानून बनाउने मन्त्री झाँक्रीको प्रतिवद्धता',
        'टी-२० विश्वकप : पीएनजीलाई ८४ रनले हराउँदै बंगलादेश सुपर १२ मा'
<6x7383 sparse matrix of type '<class 'numpy.int64'>'
	with 41 stored elements in Compressed Sparse Row format>

Model Fitting

We fit Logistic Regression by using train_set of data. We get value of coefficent of determination 92.94% that means model is able to fit only 92.94% of data.

clf = LogisticRegression(random_state=0).fit(X_train_vectorized, Y_train)
predictions = clf.predict(vectorizer.transform(X_test))
clf.score(X_train_vectorized, Y_train)


Prediction on trained data

nprd = []
for k,v in classes.items():
  for p in prd:
    if v==p:
['news', 'news', 'news', 'news', 'business', 'business']

Accuracy of Model

Our model accuray is only 77.41% while using Logistic Regression algorithms.

print("Accuracy:", 100 * sum(predictions == Y_test) / len(predictions), '%')
Accuracy: 77.41211667913238 %

Fit Decision Tress

from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

clf = tree.DecisionTreeClassifier()
clf = RandomForestClassifier()
clf.fit(X_train_vectorized, Y_train)

predictions = clf.predict(vectorizer.transform(X_test))
print("Accuracy:", 100 * sum(predictions == Y_test) / len(predictions), '%')
Accuracy: 75.39267015706807 %
nprd = []
for k,v in classes.items():
  for p in prd:
    if v==p:
['news', 'news', 'news', 'news', 'news', 'entertainment']