Predictive Hacks

Web Article NLP Analysis in Python

web article

As a Data Scientist and an NLP specialist, you may need to extract web articles and analyze them. An easy way to extract a clean text of the web pages is by using the newspaper3k library. For this tutorial, we will work with the Predictive Hacks Data Science blog, starting with the “How to get a Data Science Job without Experience” article.

You can install the newspaper3k library by running:

conda install -c conda-forge newspaper3k

Or

pip3 install newspaper3k

How to Extract Text from a Web Article

Let’s see how easily we can get the text of the Predictive Hacks article.

from newspaper import Article

# enter the required URL
url = 'https://predictivehacks.com/how-to-get-a-data-science-job-without-experience/'

article = Article(url)
article.download()
article.parse()

# Get the article text:
print(article.text)
What is “Working Experience” in practice?

Let’s assume that you are a candidate who has recently graduated from the University holding a Bsc/Msc degree in Mathematics, Computer Science, Engineering, or other related fields, and you would like to start a career as Data Scientist. The grades and the University degree indicate your theoretical background and your ability to learn new things. However, in the real world, things are different from the University, not necessarily more difficult, and somehow the hiring managers would like to know more about you and especially the way that you work.

A related “working experience” proves that you can do the required job. That you are reliable you have enthusiasm for your science and that you are good at your job. In other words, for the company, a candidate with related working experience implies a lower risk.

But as we said above you lack working experience, let’s see how you can prove to them that you are reliable and good at your job so that they can trust you, i.e. hire you!

The Era of Ratings and Reviews

Definitely, we live in an era of ratings and reviews. Don’t you look at the reviews on Booking.com and Airbnb when you plan where to stay, don’t you look at TripAdvisor and Google Reviews when you choose a restaurant to eat, don’t you look at the reviews of a movie before you watch it and so on and so forth.

The same applies to companies. They want to have a review about you. Bear in mind that is common to ask for reference before they hire someone and also they prefer to hire someone who is a referral of an existing employee. This occurs because they want to learn more about their potential colleague before they proceed to the hiring.

Let us see how you can get “reviews” from the market and in parallel to show your work.

Stack Overflow

Nobody can ignore a user with a high reputation in Stack Overflow. Clearly, it indicates that you are really passionate about your science, you can be a good team player by providing solutions to your team, you are competent in your field and you contribute to your community. In addition, you give the opportunity to everybody to look at your profile and your work.

Blogging

You will grab the hiring manager’s attention if you write in a blog. Let’s assume that you write in medium and you have many followers and claps. Having many followers implies that some people have found your article useful. It is like a positive review obtained from the market and the community. This is something that you can use as a working experience and also can be a show-case of what you can do. Also, your articles reflect somehow your working style and your field of interest and expertise.

Freelancer in a Web Platform

There are many platforms such as Upwork, Fiverr, PeoplePerHour where you can undertake data science projects. This is a good start and can be used as a related working experience. Moreover, in these platforms, you receive ratings and reviews from your clients. Definitely, a strong profile in such platforms proves your ability to deliver in high quality. It indicates also your ability to communicate with the clients, to be reliable and to have self-awareness.

GitHub Account

I would recommend having a GitHub account which is like the portfolio of the Data Scientist. By sharing your profile you give them the possibility to see the way that you code, the projects that you have done and so on.

Kaggle Account

Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals. If you participate in many competitions and you have a relatively good score it indicates that you are good in Data Science, that you have worked with real-life problems and you have dealt with many different problems.

Summing Up

If you lack working experience and you see that in most job openings ask for it, do not get disappointed. If you love your work and you are good at it is a matter of time when you will enter the industry. The most dangerous part is to have a gap in your career. For that reason, during the time that you do not work and you are looking for a job try to invest in the things that we mentioned above. By doing this, you will improve your skills such as writing, coding etc, you will get hands-on experience and at the same time, you can use them as a related working experience.

Thus, a candidate with 1000 points reputation in Stack Overflow, labeled as Top Rated freelancer in the Web Platforms with 500 followers in Medium and with an active Kaggle profile, his/her job application will stand out and will be asked for an interview where at this point they assess also other things like the cultural fit, the interpersonal skills etc.

We can extract other features like the “top image”, the “authors” and the “publish date”.

NLP Analysis

The newspaper3k library provides some basic NLP features like “keywords extraction” and “text summarization”. Let’s have a look at the examples below.

Keywords Extraction

article.nlp()
article.keywords
['hacks',
 'data',
 'experience',
 'working',
 'things',
 'reviews',
 'profile',
 'related',
 'science',
 'good',
 'job',
 'predictive']

Text Summarization

print(article.summary)
A related “working experience” proves that you can do the required job.
In other words, for the company, a candidate with related working experience implies a lower risk.
Freelancer in a Web PlatformThere are many platforms such as Upwork, Fiverr, PeoplePerHour where you can undertake data science projects.
Kaggle AccountKaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.
By doing this, you will improve your skills such as writing, coding etc, you will get hands-on experience and at the same time, you can use them as a related working experience.

Extract the URL Articles

We can extract the URL of the posts that appear on the home page. For example:

import newspaper

predictive_hacks = newspaper.build('https://predictivehacks.com')

for article in predictive_hacks.articles:
    print(article.url)

The Takeaway

I have found newspaper3k to be really fast, and it is excellent for extracting the text from a web article. However, I would prefer Transformers for Text Summarization

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s