The web app is live at acadsearch.pythonanywhere.com (It is down currently), its recommended to open in Desktop mode for better user experience.
Motivation
Proposal
What we have built
Working Snapshots
High Level Design
Modules of Search Engine
How to Install and Run
Work Flow
Evaluation of Search Engine
Detailed Report
Presentation Slides and Video
Code Directory Structure
Future Work
References and Credits
Students often find the need to search for professors based on various criteria such as name, university, research topics, top cited papers and rank them based on factors like citations or h-index. A simple Google search may not allow you to first shortlist professors based on whether they do research in "adversarial machine learning" and then rank them according to the number of citations that they have in the last 5 years.
Our proposal can be accessed from here
We have developed a search engine that can cater to the needs of students looking for professors to approach for projects, internships or jobs. The search engine allows users to search for professors based on name, university, research areas and paper titles using 3 different retrieval methods. The engine also allows users to sort the search results based on criteria like h-index, citations in the last 5 years etc. We have deployed the search engine publicly as a web application, and also evaluated its performance in terms of time and quality of results.
This module uses the list of Google Scholar IDs of professors(from CSRankings, split across 10 files), scrapes data from their Google Scholar pages and stores it as CSV.
This module cleans the scraped data from the previous module and stores it as CSV.
This module uses the cleaned data to build two inverted indices (first for name and affiliation, second for topics and paper titles) and stores them as JSON.
This module receives query information from the "Backend", processes it and returns an ordered list of professors depending upon the specifications provided by the user.
This module forwards the user's query to the "Query Processing and Ranking" module and acts as a intermediate to user's browser.
The user types a query, specifies the retrieval method(Boolean, phrase or TF-IDF) and the context in which he wants the results(names and affiliations or topics and paper titles).
This module is not part of the main pipeline of the search engine. It computes and plots various statistics of the cleaned dataset.
It generates its own queries, runs the queries using "Querying and Ranking" module, evaluates search results and plots the evaluation metrics. This module is also not part of the main pipeline.
Then run commands nltk.download('punkt')
and nltk.download('stopwords')
in a Python program or iPython if these modules are not downloaded.
This repository contains data files inside folder data/
whose size is >100 MB, such files are being tracked using Git LFS. Hence install Git LFS.
Note - Run the commands for the modules in the given order since next modules uses output files from previous module. All the commands should be run inside the directory specified for each module.
cd scraping/
python scrape_prof_data.py
This module scrapes data from Google Scholar Pages by taking input from ./data/csrankings-{x}.csv
and outputing scraped data in ./data/professor_data-{x}.csv
. Scraping could take too much time hence we recommend not to run these commands instead use directly use the already scraped data. x
varies from 0 to 9. In the below commands python
should be changed to python3
if using Ubuntu/Linux.
cd cleaning/
python cleaning_data.py
This module cleans the scraped data by taking input from ./data/professor_data-{x}.csv
and outputing cleaned data in ./data/professor_data-{x}-cleaned.csv
.
cd indexing/
python build_index.py
This module takes as input the files ./data/professor_data-{x}-cleaned.csv
and build indices ./data/name_and_affiliation_index_full.json
and ./data/topic_and_paper_index_full.json
cd querying/
python compute_tf_idf.py
This module takes as input the files ./data/topic_and_paper_index_full.json
and ./data/metadata.csv
and computes TF-IDF values for every document-term pair if it exists. It then outputs these values in file ./data/tf_idf_scores_topic_and_paper_full.json
which is used while querying using tf-idf retrieval method.
cd web_server
python -m flask run
This module runs the web app on localhost (127.0.0.1:5000). The user can now interact with the Search Engine. With this module the main pipeline of Search Engine completes. The next two modules are used for computing statistics from the data and evaluating the Search Engine and are not part of the main pipeline.
The live version of the web app at acadsearch.pythonanywhere.com is currently running on 80% of total dataset of the professors due to memory limits on hosting platform. The full version can be run on localhost by using the instructions mentioned above.
cd evaluation
python evaluate.py
This module queries and evaluate the Search Engine using Querying module. It evaluates median rank, recall rate and average time per query. The plots ./evaluation/average_query_time.png
, ./evaluation/median_rank.png
and ./evaluation/recall_rate.png
are generated as output.
cd data_statistics
python compute_statictics.py
This module takes as input the data files present in ./data
and computes their statistics. The plots ./data_statistics/plots/1.png
, ./data_statistics/plots/2.png
, ./data_statistics/plots/3.png
, ./data_statistics/plots/4.png
, and ./data_statistics/plots/5.png
are generated as output.
Consider a user who has a particular professor in his mind, he using some information of that professor like name, affiliation, or title of a paper of that professor, and by choosing appropriate querying method (mentioned in Querying and Ranking) searches and gets results. Now, we define the rank as the position where the professor he had in mind shows up in the search results. Here, the professor in his mind is the ground truth.
We generated random 500 Professors and queried the search engine using the search query and retrieval method pairs given below. The appropriate index i.e. name and affiliation or research topics and paper titles was used for each combination. Since, we already had the unique ID of the Professors before querying, we lookup for that unique ID in the matched Professors IDs returned by Querying and Ranking module.
Search Query | Retrieval Method | Index Used | Pair Label in Plot |
---|---|---|---|
Professor Name | Boolean AND | Name and Affiliation | N, B |
Professor Name | Phrase Retrieval | Name and Affiliation | N, Ph |
Affiliation | Boolean AND | Name and Affiliation | A, B |
Affiliation | Boolean AND | Name and Affiliation | A, Ph |
Paper Title* | Boolean AND | Research Topics and Paper Title | P, B |
Paper Title* | Phrase Retrieval | Research Topics and Paper Title | P, Ph |
Paper Titie* | TF-IDF | Research Topics and Paper Title | P, T |
*The paper title for a professor is chosen randomly out of his available papers.
The report can be accessed from here
- The presentation slides can be accessed from here (PDF) and here (PPTX)
- The video can be accessed from here
.
├── Proposal.pdf
├── README.md
├── Report.pdf
├── Slides.pdf
├── Slides.pptx
├── cleaning
│ └── cleaning_data.py
├── data
│ ├── csrankings-0.csv
│ ├── csrankings-1.csv
│ ├── csrankings-2.csv
│ ├── csrankings-3.csv
│ ├── csrankings-4.csv
│ ├── csrankings-5.csv
│ ├── csrankings-6.csv
│ ├── csrankings-7.csv
│ ├── csrankings-8.csv
│ ├── csrankings-9.csv
│ ├── metadata.csv
│ ├── name_and_affiliation_index_full.json
│ ├── professor_data-0-cleaned.csv
│ ├── professor_data-0.csv
│ ├── professor_data-1-cleaned.csv
│ ├── professor_data-1.csv
│ ├── professor_data-2-cleaned.csv
│ ├── professor_data-2.csv
│ ├── professor_data-3-cleaned.csv
│ ├── professor_data-3.csv
│ ├── professor_data-4-cleaned.csv
│ ├── professor_data-4.csv
│ ├── professor_data-5-cleaned.csv
│ ├── professor_data-5.csv
│ ├── professor_data-6-cleaned.csv
│ ├── professor_data-6.csv
│ ├── professor_data-7-cleaned.csv
│ ├── professor_data-7.csv
│ ├── professor_data-8-cleaned.csv
│ ├── professor_data-8.csv
│ ├── professor_data-9-cleaned.csv
│ ├── professor_data-9.csv
│ ├── tf_idf_scores_topic_and_paper_full.json
│ └── topic_and_paper_index_full.json
├── data_statistics
│ ├── compute_statistics.py
│ └── plots
│ ├── 1.png
│ ├── 2.png
│ ├── 3.png
│ ├── 4.png
│ └── 5.png
├── evaluation
│ ├── average_query_time.png
│ ├── evaluate.py
│ ├── median_rank.png
│ └── recall_rate.png
├── flow-chart.png
├── helper_functions
│ ├── __pycache__
│ │ └── common_functions.cpython-37.pyc
│ └── common_functions.py
├── high-level-architecture.png
├── indexing
│ └── build_index.py
├── querying
│ ├── __pycache__
│ │ ├── boolean.cpython-37.pyc
│ │ ├── default_rankings.cpython-37.pyc
│ │ └── get_tf_idf.cpython-37.pyc
│ ├── boolean.py
│ ├── compute_tf_idf.py
│ ├── default_rankings.py
│ └── get_tf_idf.py
├── scraping
│ └── scrape_prof_data.py
├── snapshot-1.png
├── snapshot-2.png
└── web_server
├── __pycache__
│ ├── read_information.cpython-37.pyc
│ └── server.cpython-37.pyc
├── images
│ ├── placeholder.svg
│ └── search.png
├── read_information.py
├── server.py
└── templates
└── index.html
- Scraping data from homepages of professors and universities, periodically.
- Making a directed graph using citations e.g. if a professor (in one of his papers) cites another professor's paper then it can be a directed edge. This graph can then be used to implement Pagerank.
- Improving user experience by adding search history and providing suggestions based on collaborative filtering.
- Making the default ranking metric(for Phrase and Boolean Retrieval) learn-able based on user feedback on search results.
- Evaluating the search engine with real users.
This project has been made as a part of project component of the course CS-328: Introduction to Data Science offered at IIT Gandhinagar in Semester-II of AY 2020-21 under the guidance of Prof. Anirban Dasgupta.
- Berger, E. (2017). GitHub Repository. emeryberger/CSRankings.
- Rajaraman, A.; Ullman, J.D. (2011). "Data Mining" (PDF). Mining of Massive Datasets. pp. 1–17.