According to the Indeed.cam, Indeed is the #1 job site in the world1 with over 250 million unique visitors every month. Indeed strives to put job seekers first, giving them free access to search for jobs, post resumes, and research companies. Every day, we connect millions of people to new opportunities.
I explain in my last article in Medium, how to load and store job positions from Indeed with python. In this article I would like to dirll down each job post and look for the words frequency in each job ads and visualize them.
Getting The Data
In orders to job post analyze, we have the job post as a DataFrame from my last article(We will use same DataFrame in this article as df):
We are using the link from DataFrame to download the job descriptions.
This is the list of libraries we need to import at the beginning:
from urllib.request import urlopen, Request
import pandas as pd
This function will get the URL for a job description then it will scrip the web page and will store each line of text as a value in a Python list. Output of “text_list” functions is a python list, which contains job descriptions.
def texts_list(link): #read the soup and clean the body of text. output is a List
html = urlopen(link)
soup = bs4.BeautifulSoup(html)
texts = soup.get_text()
text_list = 
for text in texts.split('\n'):
if text != '':
I will put all job descriptions in a list which is called “transcripts” in the same time it will go in the DataFrame:
transcripts = 
df['transcript'] = ''
for i in range(len(df)):
text = texts_list(df.loc[i, 'Links'])
df.loc[i, 'transcript'] = text
At this point, I need to decide to analyze each job description one by one or combine it together to analyze all job descriptions in one shot for a job title. I will go to do it all in one shot. Then all these lists will be combined in other list which is called “Text”:
Text = 
for i in range(len(transcripts)):
for j in range(len(transcripts[i])):
Cleaning The Data
We have a row text data (In linguistics it is called text corpus) which needs cleaning. There are some common data cleaning techniques, we are going to do these steps to cleaning the text:
- Make text all lower case
- Remove punctuation
- Remove numerical values
- Remove common non-sensical text (/n)
- Tokenize text
- Remove stop words
import stringdef clean_text(text):
text = text.lower()
text = re.sub('\[.*?\ ©] ', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\w*\d\w*', '', text)
text = re.sub('[‘’“”…]', '', text)
text = re.sub('\n', '', text)
return textclean_Text = 
for i in range(len(Text)):
Now, we can pickle the DataFrame for next uses.
There are some words like “the”, “a”, … which is called Stop Words. We do not need to have these in the corpus. Then we are going to remove them:
from nltk.corpus import stopwordsstop_words = set(stopwords.words(['english', 'french']))add_list =['cookies', 'etc', 'indeed', 'work',
'browse', 'care', 'jobs', 'lab',
'indeedcom', 'jobsave', 'new',
'contentindeed', 'reviewsfind']for i in add_list:
As you see, we load stop words in English and in case I add French, you can add any other language for you list. In the third line according to my experiance I find these words are repeated in the text which are unnecessary, so I made a list to add them in the stopwords. You can add more words as you wish.
Now we have almost nice and clean data, I am going to show the words frequency. I prefer to visualise them in Word Cloud.
from wordcloud import WordCloudwc = WordCloud(stopwords=stop_words, background_color="white",
random_state=42)import matplotlib.pyplot as plt
f = plt.figure(figsize=(15, 7))
This is the output for Word Cloud. You can figure out if this word maks sense for you or it needs to remove some words. For removing the words which do not make sense it is enough to add in the “add_list”.
You can analyze each job post independet from others. It will give you insghit in each job post.