We are at the age of digital marketing and now the words are more important than ever. One of the most successful techniques for digital marketing is competition analysis and keyword research. In other words what our competitors talk about. This is mostly useful for Search Engine Optimization but also for blog post Ideas etc.
Step 1: Get the text from a website
In this chapter, we will create a function that extracts the clean text from a URL so we can use it later for our analysis.
import pandas as pd import numpy as np import urllib from fake_useragent import UserAgent import requests import re from urllib.request import Request, urlopen from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import CountVectorizer import math from nltk.corpus import stopwords stopWords = list(set(stopwords.words('english'))) from bs4 import BeautifulSoup def get_text(url): try: req = Request(url , headers={'User-Agent': 'Mozilla/5.0'}) webpage = urlopen(req,timeout=5).read() soup = BeautifulSoup(webpage, "html.parser") texts = soup.findAll(text=True) res=u" ".join(t.strip() for t in texts if t.parent.name not in ['style', 'script', 'head', 'title', 'meta', '[document]']) return(res) except: return False
Let’s have an example.
get_text('https://en.wikipedia.org/wiki/Machine_learning')[0:500] #I will return the first 500 characters
CentralNotice Machine learning From Wikipedia, the free encyclopedia Jump to navigation Jump to search For the journal, see Machine Learning (journal) . "Statistical learning" redirects here. For statistical learning in linguistics, see statistical learning in language acquisition . Scientific study of algorithms and statistical models that computer systems use to perform tasks without explicit instructions Part of a series on Machine learning and data mining Problems Class'
Success! Now we can get the clean text from a website! But how exactly we can use it? Let’s jump into our next chapter.
Step 2: Get the URLs from competitors
The best way to find the best competitors is to get the top results of a keyword of our interest in Google search. We will use the code from a previous post, How To Scrape Google Results For Free Using Python.
def google_results(keyword, n_results): query = keyword query = urllib.parse.quote_plus(query) # Format into URL encoding number_result = n_results ua = UserAgent() google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result) response = requests.get(google_url, {"User-Agent": ua.random}) soup = BeautifulSoup(response.text, "html.parser") result_div = soup.find_all('div', attrs = {'class': 'ZINbbc'}) results=[re.search('\/url\?q\=(.*)\&sa',str(i.find('a', href = True)['href'])) for i in result_div if "url" in str(i)] links=[i.group(1) for i in results if i != None] return (links)
Let’s say that we want to see our “competitors” for the keyword “machine learning blog”. Let’s get the top URLs using the google results function where the first variable is the keyword and the second the number of results.
google_results('machine learning blog',10)
['https://towardsai.net/p/machine-learning/best-machine-learning-blogs-6730ea2df3bd',
'https://machinelearningmastery.com/blog/',
'https://towardsdatascience.com/how-to-start-a-machine-learning-blog-in-a-month-7eaf84692df9',
'http://ai.googleblog.com/',
'https://www.springboard.com/blog/machine-learning-blog/',
'https://blog.ml.cmu.edu/',
'https://blog.feedspot.com/machine_learning_blogs/',
'https://aws.amazon.com/blogs/machine-learning/',
'https://neptune.ai/blog/the-best-regularly-updated-machine-learning-blogs-or-resources',
'https://www.stxnext.com/blog/best-machine-learning-blogs-resources/']
Step 3: Analyse the text and get the most important words.
Let’s think. What are the most important words? For our analysis, we will use 3 metrics. Average TF-IDF, Max TF-IDF, and the Frequency. The pipeline is the following. We will get the text for every website (in our case top 12 results) and use those as a corpus for a TF-IDF vectorizer. Then from this matrix, we will get the average and the max TF-IDF score for every word. Then we can easily get the frequency from the TF-IDF matric by saying that the word is contained in the URL if it’s not equal to zero in this row and we are computing the percentage of it. The complete function is the following.
def tf_idf_analysis(keyword): links=google_results(keyword,12) text=[] for i in links: t=get_text(i) if t: text.append(t) v = TfidfVectorizer(min_df=2,analyzer='word',ngram_range=(1,2),stop_words=stopWords) x = v.fit_transform(text) f = pd.DataFrame(x.toarray(), columns = v.get_feature_names()) d=pd.concat([pd.DataFrame(f.mean(axis=0)),pd.DataFrame(f.max(axis=0))],axis=1) tf=pd.DataFrame((f>0).sum(axis=0)) d=d.reset_index().merge(tf.reset_index(),on='index',how='left') d.columns=['word','average_tfidf','max_tfidf','frequency'] #you can comment the following part if you want the number of URLs that the word occurs. The percentage makes sense #when we have a lot of URLs to check d['frequency']=round((d['frequency']/len(text))*100) return(d)
Now that our final function is ready, Let’s have a look at our competitors in machine learning by using the machine learning blog keyword.
x= tf_idf_analysis('machine learning blog') #remove the numbers and sort by max tfidf and get the top20 words x[x['word'].str.isalpha()].sort_values('max_tfidf',ascending=False).head(20)
word average_tfidf max_tfidf frequency
929 google 0.098790 0.626160 67.0
254 aws 0.052512 0.550785 25.0
171 amazon 0.060131 0.537993 33.0
1472 model 0.058276 0.521179 33.0
307 blog 0.131429 0.385008 100.0
133 ai 0.109516 0.358522 83.0
1222 learning 0.191090 0.352528 100.0
717 end 0.036682 0.304649 58.0
1332 machine 0.158022 0.295191 100.0
525 cookies 0.023013 0.263509 17.0
1980 see 0.030134 0.255031 58.0
439 cmu 0.028235 0.253162 17.0
2242 towards 0.035054 0.245614 42.0
862 followers 0.022837 0.245576 17.0
1949 science 0.057179 0.240060 58.0
670 domain 0.021410 0.236214 25.0
739 entry 0.022586 0.233097 17.0
944 gradient 0.019838 0.233097 17.0
377 brownlee 0.021105 0.226498 25.0
537 courses 0.024935 0.218224 25.0
As we can see one of the top words is Amazon AWS. Hmm, maybe we should write something about that too 😉. This can be a powerful tool for digital marketing and there are many paid services that are doing it. So start to experiment with it!
1 thought on “How to Create a Powerful TF-IDF Keyword Research Tool”
When I try to run the code I only get top-25 unigrams. Is it possible to get bi-grams and tri-grams?