The directory structure is usual. Source files reside inside src\main\scala
directory. They are:
- Indexer.scala - Contains logic of Indexing.
- utils/Functions.scala - Contains utility functions like
vectorize_text()
,normalize_word()
. - Ranker.scala - Contains logic of Ranker, including the interactive querying.
- RelevanceAnalizator.scala - Contains ranker functions, namely simple inner product and BM25.
Simply run sbt package
in the root directory of project. The resulting jar
file will be target/scala-2.11/searchengine_2.11-0.1.jar
.
First the Indexer application should run to create index data and save it to a path. Then we can run Ranker on indexed data.
Typically we run indexer in this format:
spark-submit --master yarn --class Indexer <jar-file\> <input-path\> <output-path\>
Example:
spark-submit --master yarn --class Indexer searchengine_2.11-0.1.jar /EnWikiMedium IndexDir
Run with -h
argument to see full help message
Typically we run ranker in this format:
spark-submit --master yarn --class Ranker <jar-file\> -i <index-path\> <ranker-method> <search-query>
Here <ranker-method>
can be one of inner
and bm25
.
Example:
spark-submit --master yarn --class Ranker searchengine_2.11-0.1.jar -i IndexDir bm25 Game of Thrones
Once you get the results of first query, the application will ask for the next query.
Run with -h
argument to see full help message