As a Data Scientist and an NLP specialist, you may need to extract web articles and analyze them. An easy way to extract a clean text of the web pages is by using the newspaper3k library. For this tutorial, we will work with the Predictive Hacks Data Science blog, starting with the “How to get a Data Science Job without Experience” article.
You can install the
newspaper3k library by running:
conda install -c conda-forge newspaper3k
pip3 install newspaper3k
How to Extract Text from a Web Article
Let’s see how easily we can get the text of the Predictive Hacks article.
from newspaper import Article # enter the required URL url = 'https://predictivehacks.com/how-to-get-a-data-science-job-without-experience/' article = Article(url) article.download() article.parse() # Get the article text: print(article.text)
What is “Working Experience” in practice? Let’s assume that you are a candidate who has recently graduated from the University holding a Bsc/Msc degree in Mathematics, Computer Science, Engineering, or other related fields, and you would like to start a career as Data Scientist. The grades and the University degree indicate your theoretical background and your ability to learn new things. However, in the real world, things are different from the University, not necessarily more difficult, and somehow the hiring managers would like to know more about you and especially the way that you work. A related “working experience” proves that you can do the required job. That you are reliable you have enthusiasm for your science and that you are good at your job. In other words, for the company, a candidate with related working experience implies a lower risk. But as we said above you lack working experience, let’s see how you can prove to them that you are reliable and good at your job so that they can trust you, i.e. hire you! The Era of Ratings and Reviews Definitely, we live in an era of ratings and reviews. Don’t you look at the reviews on Booking.com and Airbnb when you plan where to stay, don’t you look at TripAdvisor and Google Reviews when you choose a restaurant to eat, don’t you look at the reviews of a movie before you watch it and so on and so forth. The same applies to companies. They want to have a review about you. Bear in mind that is common to ask for reference before they hire someone and also they prefer to hire someone who is a referral of an existing employee. This occurs because they want to learn more about their potential colleague before they proceed to the hiring. Let us see how you can get “reviews” from the market and in parallel to show your work. Stack Overflow Nobody can ignore a user with a high reputation in Stack Overflow. Clearly, it indicates that you are really passionate about your science, you can be a good team player by providing solutions to your team, you are competent in your field and you contribute to your community. In addition, you give the opportunity to everybody to look at your profile and your work. Blogging You will grab the hiring manager’s attention if you write in a blog. Let’s assume that you write in medium and you have many followers and claps. Having many followers implies that some people have found your article useful. It is like a positive review obtained from the market and the community. This is something that you can use as a working experience and also can be a show-case of what you can do. Also, your articles reflect somehow your working style and your field of interest and expertise. Freelancer in a Web Platform There are many platforms such as Upwork, Fiverr, PeoplePerHour where you can undertake data science projects. This is a good start and can be used as a related working experience. Moreover, in these platforms, you receive ratings and reviews from your clients. Definitely, a strong profile in such platforms proves your ability to deliver in high quality. It indicates also your ability to communicate with the clients, to be reliable and to have self-awareness. GitHub Account I would recommend having a GitHub account which is like the portfolio of the Data Scientist. By sharing your profile you give them the possibility to see the way that you code, the projects that you have done and so on. Kaggle Account Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals. If you participate in many competitions and you have a relatively good score it indicates that you are good in Data Science, that you have worked with real-life problems and you have dealt with many different problems. Summing Up If you lack working experience and you see that in most job openings ask for it, do not get disappointed. If you love your work and you are good at it is a matter of time when you will enter the industry. The most dangerous part is to have a gap in your career. For that reason, during the time that you do not work and you are looking for a job try to invest in the things that we mentioned above. By doing this, you will improve your skills such as writing, coding etc, you will get hands-on experience and at the same time, you can use them as a related working experience. Thus, a candidate with 1000 points reputation in Stack Overflow, labeled as Top Rated freelancer in the Web Platforms with 500 followers in Medium and with an active Kaggle profile, his/her job application will stand out and will be asked for an interview where at this point they assess also other things like the cultural fit, the interpersonal skills etc.
We can extract other features like the “top image”, the “authors” and the “publish date”.
The newspaper3k library provides some basic NLP features like “keywords extraction” and “text summarization”. Let’s have a look at the examples below.
['hacks', 'data', 'experience', 'working', 'things', 'reviews', 'profile', 'related', 'science', 'good', 'job', 'predictive']
A related “working experience” proves that you can do the required job. In other words, for the company, a candidate with related working experience implies a lower risk. Freelancer in a Web PlatformThere are many platforms such as Upwork, Fiverr, PeoplePerHour where you can undertake data science projects. Kaggle AccountKaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals. By doing this, you will improve your skills such as writing, coding etc, you will get hands-on experience and at the same time, you can use them as a related working experience.
Extract the URL Articles
We can extract the URL of the posts that appear on the home page. For example:
import newspaper predictive_hacks = newspaper.build('https://predictivehacks.com') for article in predictive_hacks.articles: print(article.url)
I have found newspaper3k to be really fast, and it is excellent for extracting the text from a web article. However, I would prefer Transformers for Text Summarization