-
Notifications
You must be signed in to change notification settings - Fork 0
/
Lab book on scraping iworkforsa.Rmd
81 lines (56 loc) · 4.38 KB
/
Lab book on scraping iworkforsa.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
title: "Lab book on scraping iworkfor.sa.gov.au"
output:
github_document
always_allow_html: yes
---
#Purpose
This is a labbook detailing my efforts at scraping the jobs from [iworkfor.sa.gov.au](https://iworkfor.sa.gov.au/)
#Caveats
Its worth noting that this website only inlcudes *advertisments*. This is different to *jobs*!
It is different in that someone could (and often is) seconded into a position without an advertisment and is only employed in an advertised position after several years of already doing the job. Also, position titles change for all sorts of reasons as well as being added and removed so it would not be possible to infer tenure of job held from advertisments.
#Investigation
The site [iworkfor.sa.gov.au](https://iworkfor.sa.gov.au/) returns results wehn the `Search` button is clicked even if no values are entered in any of the fields. On 20210514 that was 678 jobs. I take that to be all the jobs currently listed.
However, only the first 20 jobs are listed on this page with a `Next` button allowing access to the next 20 jobs. Unfortunatly the URL does not change when the `Next` button is clicked. It would have been easier to scrape each page if each page could be accessed with a discreet URL. The unchanging URL suggests that the databse of jobs is being accessed using JavaScript and jQuery. Hence, to scrape all the jobs, it would be necessary to use `RSelenium` to trigger each page then to use `rvest` to scrape the HTML from each page.
This would return a dataframe such as:
```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(tidyverse)
library(kableExtra)
tribble(
~`JOB TITLE`, ~`REFERENCE NO`, ~`POSTING DATE`, ~ AGENCY,
"SCHOOL AND PRESCHOOL JOBS", "295359", "30/10/2017", "DEPARTMENT FOR EDUCATION",
"ASO2 EMPLOYEE SERVICES OFFICER - TEMPORARY POOL", "394370", "06/08/2020", "DEPARTMENT FOR EDUCATION",
"ASO3 COMMUNICATIONS AND BUSINESS SUPPORT OFFICER - TEMPORARY POOL", "399845", "19/09/2020", "DEPARTMENT FOR EDUCATION"
) %>%
kable() %>%
kable_styling()
```
It is worth noting that:
* The factor `JOB TITLE` sometimes includes a standardised aspect such as the ASO level but also includes a free text aspect that is not standardised. This will mean that questions related to the `JOB TITLE` will have to be operationalised carefully because they are not neccesarily concistent, countable units.
* Some rows actually indicate multiple jobs through the use of terms such as "TEMPORARY POOL". It might make more sense to talk about each row as an *advertisment* ralther than a *job*.
Despite these limitations...
If collected as a one off event, this alone would be enough to answer questions such as:
* Which agencies are hiring the most?
* What are the differences in job titles between agencies (this could include both the standardised aspects such as ASO level and the free text of the position title)?
If this was collected systematically it would be enough to answer questions such as:
* When do different agencies recruit different job titles?
* What are the most frequently recruited for job titles?
But there is more information about each advertisment on the page for the advertisment accessed through the URL associated with the `JOB TITLE`. Reading the page:https://iworkfor.sa.gov.au/page.php?pageID=842&windowUID=0#report, it appears that the URL for each advertisment on each page can be collected with some regex in the form of ???.
The advertisment specific pages appear to follow a similar sturcture...but not always (see: https://jobs.education.sa.gov.au/page.php?pageID=786). If the page follows the most common structure it will be possible to pull out:
* Length of employment
* Salary
* Some notes similar to a position description
* Attachments which comtain the full position description
So, here are the stages
1. Step though the pages of all advertisments collecting
i. JOB TITL
ii. REFERENCE NO
iii. POSTING DATE
iv. AGENCY
v. URL of advertisment page
2. Details from the advertisment page including:
i. All the source so that useful bits can be extracted from it later
ii. The attachments (if they exist) at this point because they might not be accesible once the advertisment is no longer there.
# Break it down
## Get the details from the first page
I've followed the notes [here](http://joshuamccrain.com/tutorials/web_scraping_R_selenium.html). But, it seems that I might need to use Docker. This is probably a good thing to learn anyway.