Skip to main content

Sentiment Analysis on Most Popular Indonesian Media

What is Sentiment Analysis?

Sentiment analysis is used to determine the sentiment of a writer in news articles about presidential candidates. Sentiment analysis can determine whether the writer has a positive, negative, or neutral view towards the news subject.

Sentiment analysis falls under text mining, which can analyze opinions and evaluate the attitudes, judgments, and emotions of the writer in a news text.

Sentiment analysis can be a powerful tool for users to gather needed information and combine the collective sentiment from news writing.

In it’s process, sentiment analysis utilizes several components including Newspaper3k and BeautifulSoup in Web Scraping, as well as TextBlob in text analysis.

After being analyzed using TextBlob, a key parameter of sentiment analysis is "polarity," which measures various aspects of feelings and emotions present in the text.

Polarity can measure whether the text expresses a sentiment of happiness, disappointment, or neutrality toward a subject. Polarity refers to the overall orientation of sentiment in the text, whether it's positive, negative, or neutral.

The polarity score of each text can vary based on user needs. In analyzing news, we have observed several samples of sentiment analysis results and further examined if they align with the given polarity scores.

In the sentiment analysis of the news that we conducted, positive sentiment has a polarity score ranging from 1 to 0.33, negative sentiment has a polarity score of -1 to -0.33, and neutral sentiment has a polarity score of -0.32 to 0.32.

RSS Feeds and Google News in Data Retrieval

The initial step for sentiment analysis starts with data retrieval from websites. Data retrieval is done using RSS (Really Simple Syndication) and Google News to easily fetch current news data.

In practice, we search Google News and can receive results in the form of RSS feeds by opening a URL and replacing 'news.google.com/' with 'news.google.com/rss/'.

We formulate the Google News RSS feed URL for top news based on topic, geographical location, and language. Retrieving data using RSS feeds and Google News simplifies obtaining comprehensive data.

Unfortunately, each domain can provide a maximum of only 100 URLs of data. However, we can address this by managing the timing of news retrieval.

RSS feeds have the advantage of organizing article links in a way that's easy to find and extract compared to regular websites.

Another advantage is that all RSS feeds have the same standard format. Therefore, the same code can often be used to extract article links from more than one RSS feed.

In the process, we initiate start and end dates for the news to be retrieved as a solution to the limitation of RSS, which can only fetch 100 URLs per domain.

Additionally, if we want to retrieve news related to the presidential candidate "Ganjar Pranowo," we first set the RSS feed with a URL containing the keyword "Ganjar Pranowo."

Web Scraping with Newspaper3k and BeautifulSoup

Web scraping is the process of automatically extracting data or information from web pages in a structured manner. This technique is used to gather data from various online sources such as websites, forums, blogs, or social media platforms.

Web scraping involves extracting text, images, links, tables, and other elements from web pages. Web scraping is done using Python libraries such as Newspaper3k and BeautifulSoup.

Newspaper3k is used to perform web scraping on news articles. This library uses the requests library and has BeautifulSoup as a dependency while parsing using lxml.

Newspaper3k not only extracts text data from articles but can also retrieve other data such as publication date, author, URL, images, and videos. Another reason we use Newspaper3k is that it allows us to understand article content without having to read it.

Newspaper3k can also perform more advanced functions like finding RSS feeds and fetching article URLs from primary news sources using the requests library.

In the process, we import the Article object from the Newspaper3k library and then extract its information. Additionally, we include the nlp() function to process keywords from the article using Natural Language Processing (NLP) and summarize the article.

In this context, we also include scraping code within a try/except block to handle bad URLs or other issues that might interrupt the program.

The exception, in the form of an 'ArticleException' error when running Newspaper3k, is caught in the 'except' block. We have also included its import at the top.

Sentiment Analysis with TextBlob and NLTK

After performing web scraping using Newspaper3k and BeautifulSoup, we use the TextBlob and NLTK libraries to process and analyze the text in the collected news articles.

Sentiment analysis with TextBlob is a part of the Natural Language Processing (NLP) process used to understand, manipulate, and analyze human language by computers.

The NLP process involves a combination of linguistic techniques, statistics, and machine learning to achieve an accurate understanding and analysis of text and human language.

Initially, NLP was performed separately, starting with Tokenization, Text Cleaning, Stopword Removal, Stemming and Lemmatization, Named Entity Recognition, Part-of-Speech (POS) Analysis, etc.

However, the TextBlob library provides various text processing features, such as language detection, word tokenization, phrase modeling, sentiment analysis, part-of-speech analysis, and more, all within a single library, making it easy to automate these processes.

As a result, TextBlob can be used to analyze the sentiment of a text, calculate word frequencies, perform part-of-speech analysis to identify words in a sentence and conduct spell checks.

TextBlob performs sentiment analysis by calculating average scores for various types of words in the text and then assigning a polarity score to the text.

Sentiment Analysis Process on News Articles in Sequence Stats

Several steps are required to obtain the final sentiment analysis results along with the categorization of positive, negative, and neutral sentiments. This is accomplished through the following steps:

  1. Install the Newspaper3k, BeautifulSoup, and TextBlob libraries if they are not already available in the Python environment.
  2. Import the installed libraries, such as requests, Article, ArticleException from Newspaper3k, TextBlob, BeautifulSoup, and NLTK. NLTK will be automatically installed if TextBlob is successfully installed.
  3. Download the 'punkt' package to enable tokenization in TextBlob.
  4. Set the time range for retrieving news articles that will be processed using RSS feeds. We use the dateutil.rrule library to set appropriate time intervals.
  5. Use the zip() function to pair the start and end dates into tuples, and place these pairs into a large list with 'datetime' format.
  6. Create a list of news sites to be scraped. Multiple news websites can be specified for sentiment analysis.
  7. Use a loop to iterate over each news site and the specified date ranges.
  8. Formulate Google News URLs with RSS and search keywords based on the presidential candidate's name for analysis. The URL is constructed with the previously recorded date range (date1 and date2) and news site.
  9. Send an HTTP request to the URL and fetch the Google News RSS feed results.
  10. Parse the RSS feed using BeautifulSoup to extract all 'item' elements representing articles.
  11. Initiate a loop through the articles to retrieve the URL of each article and store it in an article list.
  12. For each article link, the code performs the following steps:
  • Utilize the Newspaper3k library to download, parse, and analyze the article text.
  • Check if the presidential candidate's keyword provided at the beginning is present in the article text.
  • Perform sentiment analysis on the article text using TextBlob.

13. Save the sentiment analysis results, including polarity and subjectivity, as a tuple and add it to the data list.

14. Positive sentiment corresponds to polarity scores between 0.33 and 1, negative sentiment corresponds to polarity scores between -1 and -0.33, and neutral sentiment corresponds to polarity scores between -0.32 and 0.32.

The final sentiment analysis results and sentiment categories are saved in a CSV file.

This process aims to gather news articles from designated news sites, perform sentiment analysis on relevant articles related to the "Presidential Candidate" keyword, and store the results in a CSV file that includes various sentiment-related information.

Article Source

  1. Newspaper3k, https://pypi.org/project/newspaper3k/
  2. TextBlob: Simplified Text Processing, https://textblob.readthedocs.io/en/dev/
  3. Scraping websites with Newspaper3k in Python, https://www.geeksforgeeks.org/scraping-websites-with-newspaper3k-in-python/

Last updated on August 10, 2023
by Siti Nuradilla