Indeed.com Web Scraping With Python
According to the Indeed.cam, Indeed is the #1 job site in the world1 with over 250 million unique visitors2 every month. Indeed strives to put job seekers first, giving them free access to search for jobs, post resumes, and research companies. Every day, we connect millions of people to new opportunities.
During my job searching journey, Indeed was one of the main sources to apply for jobs. So I decide to search more intelligently in Indeed. I told myself, I know web scraping and I use almost everyday Indeed to job searching, why should I not keep my search information? More advanced I want to know which words or competences have more frequency in the job posts.
In python Beautiful Soup library is the common library to web scraping so, I will use it for web scraping. I will import urllib.request for opening and reading URLs and pandas to store downloaded data in the DataFrame.
I am looking the “Data Analyst” job titles in the Toronto area, so after putting this information on the indeed website, I am trying to find the pattern of URL in the website. This is the URL in the first page:
There are more pages:
for the second page the URL is like this:
https://ca.indeed.com/jobs?q=data+analyst&l=Toronto%2C+ON&radius=25&start=10
for next pages the start value at the end increases to 20, 30 and so on. For this I use the URLS list for store the links of search pages like this:
Next step is going in to the pages and read the content and store the link depended to each job position. For this, urlopen will read the HTML source and then we store the HTML source in the soup object by BeautifulSoup and then all the links will store in a list which name is “all_links” from this links in the second for loop the links related to job ads will store in the Links list.
Some companies have a ‘“review” link in Indeed.com. I would like to have the links for the review pages. This function will do this for me:
and I would like to have the position name and job description. For this the code on the bellow will do this task:
In this code, it is going by the links we stored on the DataFrame, after reading HTML page and storing in the soup object it will be extract job title and will be store in the “title” and in the “Position” column in the DataFrame. In the “Review” column will be stored the review links. The “text” variable contain the job description which will store in the “.txt” file with job position name in your machine.
you can save the DataFrame with the job links, position name and review link in the “.csv” file like this:
As I wrote at the begginning, I need to analyze the job descriptions for example find the word frequency. It would be my next note in the medium.com .