Taking control of your own news through a Python script

DISCLAIMER: To understand most of this article, you need to be familiar with the Python programming language. But, you could still read on just for fun!

RELEVANCE

This article is relevant to data science as Python plays an important role in the field. Among other technologies, data scientists use Python extensively to design data pipelines and machine learning models. Moreover, this project includes data in the form of articles from websites.

THE WHY

One of the most important things you’ll do in your journey of becoming a data scientist is reading news related to data science. I love reading articles about data science on a daily basis. My major source of such news was Google Feed (which was previously called Google Now). After adding data science as one of the topics I am interested in, Google provided me with a good and more importantly, well-sourced list of articles related to the topic. But, last year, the company changed its policies and decided to show content based on the user’s location. Consequently, I lost my one and only reliable source of data science related news. I was obviously quite frustrated. So, I set about to come up with a solution of my own. Go to the next section to read about my solution.

THE HOW


I knew that there were libraries available for Python which allow you to parse RSS feeds from different websites and obtain information from those websites. So, my main working principle was to obtain articles from certain websites using their RSS feeds, and compiling those articles in a format which suits my interests. There are five blogs that I used: Planet Python, Python on Reddit, Machine Learning Mastery, MIT News – Artificial Intelligence, Machine Learning Weekly and Machine Learning on Medium. For parsing the RSS feeds of the websites, I made use of the feedparser module. Using this module made going through the articles on the websites very easy. It returns the entire content of the website in a dictionary. Initially, I chose to compile the articles in a text file. But, having done that, I saw how inconvenient a text file was. So, instead of a text file, I settled on creating a HTML file of the articles where each blog’s name would be displayed followed by an unordered list of all the articles from that website in a hyperlink format (see image later). For creating the HTML file, I made use of the BeautifulSoup module. Followed by this, I mailed the file to myself using the in-built module, smtplib.

THE CODE

So, we begin by importing all the necessary modules.


I am importing the feedparser module to parse the RSS feeds, followed by the smtplib module to send the email, the BeautifulSoup module to create the HTML file and some other modules that I will need to send the HTML file as an attachment in the email. 

Next, I define the following list variable.

This list contains the URLs of the websites I am obtaining my articles from.

Next, I define the following function.

It is a fairly simple function. I am looping through all the URLs in the list variable I defined above, and taking up their parsed version in the feed variable. This variable will be a dictionary which looks something like this:

{‘feed’: {‘title’: ‘Machine Learning on Medium’, ‘title_detail’: {‘type’: ‘text/plain’, ‘language’: None, ‘base’: ‘https://medium.com/feed/tag/machine-learning’,…}

So, I create a dictionary which has the following structure:

{‘blog-name’: {‘article number’: {‘title’: <headline>, ‘link’: <link>}}}

The ‘blog-name’ key is set to the website’s title and its value is set to another dictionary. I found that all the articles are under the ‘entries’ key of the feed dictionary in the form of another dictionary. In this dictionary, each article is stored in the format:

{‘title’: <headline>, ‘link’: <link>, …}

So, I assign a ‘article-number’ to each article through the variable, i, and set its value to another dictionary. This dictionary imitates the above structure.

Thus, I get something like this in the end:

{‘Planet Python’: {1: {‘title’: ‘Django Weblog: Django bugfix releases: 2.0.1 and 1.11.9’, ‘link’: ‘https://www.djangoproject.com/weblog/2018/jan/01/bugfix-releases/&#8217;}, 2: {‘title’: ‘Python Data: Forecasting Time Series data with Prophet – Part 4’, ‘link’: ‘http://pythondata.com/forecasting-time-series-data-prophet-part-4/&#8217;}, …}

Having obtained all the necessary information from the websites, I pass the created dictionary to this function:


The function does the following:

  • Opens an HTML file named, ‘links.html’.
  • Parses the file into a BeautifulSoup object.
  • Adds a title to the HTML file.
  • For every blog in the argument dictionary, creates a H3 tag with the text as the blog’s name.
  • For every blog in the argument dictionary, creates an UL (unordered list) tag to display the articles from the blog.
  • For every article in each blog, creates a list object which has a hyperlink tag with its text as the headline of the article.
  • Writes this modified BeautifulSoup object to the HTML file.
  • Returns the name of the HTML file.

In the end, the HTML file generated looks something like this:

<html>
<head>
<title>
Your Python News For Today!
</title>
</head>
<body>
<h3 id=“Planet Python”>
Planet Python
</h3>
<ul id=“links from Planet Python”>
<li id=“1-Planet Python”>
<a href=https://www.djangoproject.com/weblog/2018/jan/01/bugfix-releases/&#8221;>
Django Weblog: Django bugfix releases: 2.0.1 and 1.11.9
</a>
</li>
<li id=“2-Planet Python”>
<a href=http://pythondata.com/forecasting-time-series-data-prophet-part-4/&#8221;>
Python Data: Forecasting Time Series data with Prophet Part 4
</a>

Having done this, I pass the returned file name to this function:


It does the following:

  • Defines variables which store information such as the sender email, recipient email, etc.
  • Creates a MIMEMultiPart message.
  • Creates an attachment of the file that was passed to it as the argument.
  • Attaches this attachment to the message created.
  • Mails the complete message to my email address.

The main() function looks like this:

It just calls all the functions one by one.

In the end, the HTML file looks like this when opened in a browser:

Showing only one blog.

There you have it! That’s all of the code!

METRICS

Having tested it on my slow BSNL (a major telecom company in my country) broadband connection which averages an Internet speed of 150-200KBps, the script took approximately 18 seconds to run. I am sure it will be faster on a better Internet connection.

CONCLUSION

Using this script, I am able to obtain my own news in a desirable format and with my favorite topics. It is also highly customizable in the sense that you only need to add a valid feed URL to the URL variable declared in the beginning to include that website in the final HTML file as a source. It can easily be extended to mail the HTML file to any email address of the user’s choice.

You can find the complete code, the README and the HTML file on my GitHub repository here.
In case you decide to run the script on your own computer, please change the sender email and password, and the recipient email.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s