Chapter 3 Data
3.1 Sources
The primary data set we have chosen to use is the NYC Jobs from data.gov. The data-set contains data from jobs listings around NYC. The data is collected from the City of New York’s official jobs website here.
The data was first published January 8th 2020, and was most recently updated October 25th, 2022. The NYC OpenData project maintains the data. The data is available for download via csv. Thus, we intend to download the csv file and upload it directly into R.
There are two reasons that we chose this dataset: Firstly, the data was easily downloadable in a form that could be imported directly into R with little pre-processing. Secondly, it had several variables that we could explore the relationship between.There are many interesting things to discover, such as what categories have the most jobs, what kinds of jobs tend to have a higher salary, whether a higher leveljob has a higher salary, which location tends to pay more.
3.2 Cleaning / transformation
In order to clean the data we first made sure that all of the variable names were in a form that were easy to reference, no spaces or capital letters. Then we made some simplified variables, so that analysis was not so complicated. These variables included job category, agency, and job level. We had to transform the dates into a form that R would recognize. Finally, we made a new variable based on if a job posting had required skills or not.
3.3 Missing value analysis
## Recruitment Contact Post Until Hours/Shift
## 5630 3917 3795
## Work Location 1 Additional Information Preferred Skills
## 3305 1172 871
## Full-Time/Part-Time indicator Minimum Qual Requirements Job Category
## 215 64 2
## Career Level To Apply Job ID
## 2 1 0
## Agency Posting Type # Of Positions
## 0 0 0
## Business Title Civil Service Title Title Classification
## 0 0 0
## Title Code No Level Salary Range From
## 0 0 0
## Salary Range To Salary Frequency Work Location
## 0 0 0
## Division/Work Unit Job Description Residency Requirement
## 0 0 0
## Posting Date Posting Updated Process Date
## 0 0 0
It seems that the most common missing variable is recruitment_contact
, with hours_shift
and post_until
also being quite common.