- This repo crawls data about IT jobs from TopCV.vn (IT Jobs category)
- Data can be crawled from a specific webpage or consecutive webpages
- requests
- beautifulsoup4
- In bash shell, type
python3 crawler.py a b
, wherea
,b
are the index of webpage - This command will crawl data from consecutive webpages from page
a
to pageb
- Use
run.sh
to start crawling - This bash scipt will execute simultaneously 14 thread
- Each thread crawl data from 10 consecutive pages (1-9,10-19,20-29,...) and save to file naming
recruit_a_b.json
(so there are 14 files after all)
Data is stored in this repo