Clone the repository in a new folder, create a new virtual environment and install the package.
git clone https://github.com/mizzle-toe/find-your-dream-job.git
cd find-your-dream-job
pyenv virtualenv fydjob-local
pyenv activate fydjob-local
pip install .
Start the Docker service and run:
docker run -e PORT=8000 -p 4050:8000 vladeng/find-your-dream-job:final
Finally:
streamlit run fydjob/FindYourDreamJob.py
If you can't run Streamlit, try deleting the .streamlit
folder in your home directory.
Install package in development mode:
pip install -e .
Package name is fydjob
. Example:
import fydjob.utils as utils
from fydjob.utils import tokenize_text_field
- Pull master and merge with your branch.
- Download jobs.db from
Google_Drive/database
. - Place the file in
find-your-dream-job/fydjob/database
. cd
in the main folderfind-your-dream-job
python short-pipeline-run
To load the data:
from fydjob.NLPFrame import NLPFrame
df = NLPFrame().df
The long pipeline (which will be supported by our package) works like this:
IndeedScraper
. Scrape jobs from Indeed.IndeedProcessor
. Load scraped jobs and Kaggle data. Integrate, remove job offers with identical text, and export as a dataframe.Database
. Populate the SQLLite database. Ensure not to add duplicates.Database
. Do a whole pass through the database, removing duplicates according to our set similarity measure (long process, up to 30 minutes).NLPFrame
. Export database to dataframe (ndf
), add NLP processing columns (such as tokenized fields).- Apply NLP algorithms to dataframe, export results.
The short pipeline will start with stages 5-6. We will deploy our current database to the backend and stages 5-6 will be done on the server.
Upon scraping new job offers, they should be processed, inserted into the database, and a new similarity sweep should be executed.
Scrapes job offers. To use it, download chromedriver
from the Google Drive folders and place it in drivers/
.
Supports Indeed API parameters. When not specified, the default parameters are:
start = 0 #the job offer at which to start
filter = 1 #the API tries to filter out duplicate postings
sort = 'date' #get the newest job offers (alternative is 'relevant')
To run the scraper:
pip install -r requirements.txt
python -m fydjob.IndeedScraper
Input job title, location, and a limit on the job offers to extract.
Output is saved in fydjob/output/indeed_scrapes/
. Filename format is jobtitle_location_date_limit
.
Loads JSON files from fydjob/output/indeed_scrapes
and Kaggle file from fydjob/output/kaggle
. Joins the dataframes and applies basic preprocessing. To run as a script:
python -m fydjob.IndeedProcessor
To run as a class:
from fydjob.IndeedProcessor import IndeedProcessor
ip = IndeedProcessor()
Output is saved in fydjob/output/indeed_proc
The skills dictionary is assembled here. The file spreadsheet is downloaded as Excel file and placed into fydjob/data/dicts/skills_dict.xlsx
. Then:
from fydjob import utils
utils.save_skills() #extracts skills and saves them in JSON
utils.load_skills() #loads the skills from JSON file
This is just the setup. If you haven't changed the pipeline, just run utils.load_skills
to get the skills.