okCupid provides self descriptions, selfies and big questionnaires that are really useful for anyone interested in psychometrics. This project shows how to easily download thousands of users.
Log in with your okcupid account and download cookies with Get cookies.txt. Place the okcupid.com_cookies.txt file in the scraper root folder. Replace chromedriver file with the one that corresponds to your OS and chrome version.
Then install required python packages
- python -m pip install -r requirements.txt
This scraper has two scripts, the first one downloads the profile data (except the questions) of all users it can find by swiping in the okcupid web app. The second one goes through the scraped users and downloads their answered questions.
Using this script and changing your profile details, like gender, sexual orientation and location you can scrape pretty much all users in a given location in okCupid.
You can run it like this, users data will be downloaded into users folder
- python users_by_discover.py
You can also try the users_by_question.py script, it search for users that answered specific questions, questions.csv has pretty much all okCupid questions, so I just end up searching for all the possible questions, in the practice users_by_discover.py was more effective into downloading big quantities of users.
You can run it like this, users answers will be downloaded into answers folder
- python users_by_question.py
In the testing.ipynb notebook you can check some examples of how to process the data. Users data is downloaded as HTML so I use beautifulSoup to parse it and extract the relevant information. Users questions are in JSON format so it's easier to process.
This source code was developed by Mathias Gatti (@mathigatti) if you publish something that used it remember to mention this project. For scientific publications you can cite it like this in APA notation.
Gatti, M. (2022). mathigatti/okCupidScraper: v1.0.0 (Version v1.0.0) [Computer software]. https://doi.org/10.5281/zenodo.5889263
For now I just used it to scrape self descriptions and train an AI to generate new ones. You can check more about it here.