- Download the program files: Click on '<> Code' and then on 'Download ZIP'
- Place the 'GEO-scraper_main.zip' file on the installation directory of your choice.
- Chang into the installation directory and unzip the file (unzip GEO-scraper_main.zip)
- Install necessary packages: selenium, beutifulsoup4, requests and lxml (pip install package)
- This program is suitable to scrape data from any list of studies found after a Gene Expression Omnibus (GEO) search. The input file must be one or more html files with search results. Prepare the file as following:
- 1 - Perform the GEO search with the proper key words or GEO accessions (https://www.ncbi.nlm.nih.gov/geo/) + apply desired filters.
- 2 - If the search returns more than 20 results, click on '20 per page' and change the items per page as necessary.
- 3 - Right click on the page, select 'Save Page As' and save the page as .html.
- 4 - If the search returns more than 500 results you can save more than one page with another name and use it along the first one.
- 1 - Add the GEO search results html files in the input directory (GEO_scraper-main/input)
- 2 - Run the main.py file (python3 main.py)
- 3 - A .csv file with basic information about these studies will be created in output folder
Currently, there are 3 options you can set when running the script - verbose, quick-mode and output name.
- output name (-o): Use it to define the output name. Example: "-o My_data" will produce the My_data.csv output file. If not specified, the output file will be named 'output.csv'.
- verbose (-v): Use it if you want the terminal to print the information while you are scraping. This doesn't alter the execution time or the final output.
- quick-mode (-q): Use it if you want the program to run on quick mode, i.e., gather only basic information (doesnt collect data on Tissue type, Cell type and Race/Ethnicity/Ancestry from the study samples). This considerably reduces the execution time but excludes relevant study information.