This is the group assignemt for CDS504. We've implemented five MapReduce programs in this assignment. These implemented programs can be executed in local system or Hadoop MapReduce system.
To run in local system or Hadoop, we need to install two libraries. Here is the command to install the required libraries. First library is for MapReduce framework to run Hadoop Streaming jobs and the second library is for converting our date to day such as 2020/05/07 to Thursday.
pip install mrjob
pip install python-dateutil
In pivot, we'll use "BreadBasket_DMS.csv" dataset to pivot and preprocess the dataset to be used with other MapReduce programs. From, we will get three output textfiles which are support.txt, confidence.txt and lift.txt. Our pivot MapReduce program can be executed with this command.
Running in local system,
python BreadBasket_DMS.csv -q
Running in Hadoop system,
python -r hadoop --hadoop-streaming-jar /hadoop-2.10.0/share/hadoop/tools/lib/hadoop-streaming-2.10.0.jar BreadBasket_DMS.csv
In support, we'll use the "support.txt" dataset file as an input to our MapReduce program. For testing purposes, we've used the small dataset file which is "support_text.txt". Our support MapReduce program can be executed with this command.
Running in local system,
python support.txt -q
Running in Hadoop system,
python -r hadoop --hadoop-streaming-jar /hadoop-2.10.0/share/hadoop/tools/lib/hadoop-streaming-2.10.0.jar support.txt
In confidence, we'll use the "confidence.txt" dataset file as an input to our MapReduce program. For testing purposes, we've used the small dataset file which is "confidence_test.txt". Our support MapReduce program can be executed with this command.
Running in local system,
python confidence.txt -q
Running in Hadoop system,
python -r hadoop --hadoop-streaming-jar /hadoop-2.10.0/share/hadoop/tools/lib/hadoop-streaming-2.10.0.jar confidence.txt
In lift, we'll use the "support.txt" dataset file as an input to our MapReduce program. For testing purposes, we've used the small dataset file which iscalled "lift_test.txt". Our lift MapReduce program can be executed with this command.
Running in local system,
python lift.txt -q
Running in Hadoop system,
python -r hadoop --hadoop-streaming-jar /hadoop-2.10.0/share/hadoop/tools/lib/hadoop-streaming-2.10.0.jar lift.csv
In sales comparison part, we'll use the "BreadBasket_DMS.csv" dataset file as an input to our MapReduce program. Our sales_comparison MapReduce program can be executed with this command.
Running in local system,
python BreadBasket_DMS.csv -q
Running in Hadoop system,
python -r hadoop --hadoop-streaming-jar /hadoop-2.10.0/share/hadoop/tools/lib/hadoop-streaming-2.10.0.jar BreadBasket_DMS.csv
Notes: For Hadoop Streaming java file (hadoop_streaming-2.10.0.jar), file path and file name may be varied based on Hadoop system version and installation.