Analyzing Job Posts in Indeed.com

According to the Indeed.cam, Indeed is the #1 job site in the world1 with over 250 million unique visitors every month. Indeed strives to put job seekers first, giving them free access to search for jobs, post resumes, and research companies. Every day, we connect millions of people to new opportunities.

I explain in my last article in Medium, how to load and store job positions from Indeed with python. In this article I would like to dirll down each job post and look for the words frequency in each job ads and visualize them.

Getting The Data

We are using the link from DataFrame to download the job descriptions.

This is the list of libraries we need to import at the beginning:

import requests
import bs4
from urllib.request import urlopen, Request
import pandas as pd
import re

This function will get the URL for a job description then it will scrip the web page and will store each line of text as a value in a Python list. Output of “text_list” functions is a python list, which contains job descriptions.

def texts_list(link):   #read the soup and clean the body of text. output is a List 
html = urlopen(link)
soup = bs4.BeautifulSoup(html)
texts = soup.get_text()
text_list = []
for text in texts.split('\n'):
if text != '':
text_list.append(text)
return text_list

I will put all job descriptions in a list which is called “transcripts” in the same time it will go in the DataFrame:

transcripts = []
df['transcript'] = ''
for i in range(len(df)):
text = texts_list(df.loc[i, 'Links'])
df.loc[i, 'transcript'] = text
transcripts.append(text)

At this point, I need to decide to analyze each job description one by one or combine it together to analyze all job descriptions in one shot for a job title. I will go to do it all in one shot. Then all these lists will be combined in other list which is called “Text”:

Text = []
for i in range(len(transcripts)):
for j in range(len(transcripts[i])):
Text.append(transcripts[i][j])

Cleaning The Data

  • Make text all lower case
  • Remove punctuation
  • Remove numerical values
  • Remove common non-sensical text (/n)
  • Tokenize text
  • Remove stop words
import re
import string
def clean_text(text):
text = text.lower()
text = re.sub('\[.*?\ ©] ', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\w*\d\w*', '', text)
text = re.sub('[‘’“”…]', '', text)
text = re.sub('\n', '', text)
return text
clean_Text = []
for i in range(len(Text)):
clean_Text.append(clean_text(Text[i]))

Now, we can pickle the DataFrame for next uses.

df.to_pickle('Indeed_df.pkl')

There are some words like “the”, “a”, … which is called Stop Words. We do not need to have these in the corpus. Then we are going to remove them:

from nltk.corpus import stopwordsstop_words = set(stopwords.words(['english', 'french']))add_list =['cookies', 'etc', 'indeed', 'work',
'browse', 'care', 'jobs', 'lab',
'indeedcom', 'jobsave', 'new',
'contentindeed', 'reviewsfind']
for i in add_list:
stop_words.add(i)

As you see, we load stop words in English and in case I add French, you can add any other language for you list. In the third line according to my experiance I find these words are repeated in the text which are unnecessary, so I made a list to add them in the stopwords. You can add more words as you wish.

Visualise Data

from wordcloud import WordCloudwc = WordCloud(stopwords=stop_words,  background_color="white",
colormap="Dark2", max_font_size=150,
random_state=42)
import matplotlib.pyplot as plt
wc.generate(TEXT)
f = plt.figure(figsize=(15, 7))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")

This is the output for Word Cloud. You can figure out if this word maks sense for you or it needs to remove some words. For removing the words which do not make sense it is enough to add in the “add_list”.

You can analyze each job post independet from others. It will give you insghit in each job post.