Note: These scripts are experimental. We would appreciate users testing them out and using them and providing feedback / fixes.
- An Amazon AWS account with properly set up security groups and policies
- An s3 bucket
- AWS Command Line Interface
If you don't have these prerequisites, they will be covered in more detail below.
You can install the scripts simply by cloning the github respository.
git clone https://github.com/tesseradata/install-emr
cd install-emr
If you have the prerequisites and you have a bash shell, you can simply simply call tessera-emr.sh as follows:
./tessera-emr.sh -s <s3 bucket>
To see more options (number of workers, instance types, etc.):
./tessera-emr.sh -h
This script does the following:
- Syncs the custom Tessera bootstrap scripts to a "scripts" folder in your s3 bucket
- Creates a security group to allow RStudio Server to be served over port 80 (by default open to just your IP address)
- Launches the EMR cluster and installs and configures all Tessera components
Once your cluster is up and running, if you need to install additional R packages on the nodes, there are some helper scripts for this:
# CRAN package
./install-package.sh <cluster id> <s3 bucket> rvest
# github package
./install-package-gh.sh <cluster id> <s3 bucket> bokeh/rbokeh
If you want finer control over things, take a look at tessera-emr.sh and modify the aws create-cluster
command for your needs.
Please note that you are responsible for making sure that instances you have started are terminated when you are done. Please familiarize yourself with the following resources for monitoring usage, and check them frequently. It is your responsibility to monitor and handle your resource usage.
- AWS Console -> EMR (direct link) - you can view running EMR clusters and terminate them here
- AWS Console -> EC2 -> Instances (direct link) - you can view running instances and terminate them here
- AWS Console -> Menu Bar -> (username dropdown) -> Billing and Cost Management: you can view your account balance here
If you don't already have an AWS account, go to http://aws.amazon.com and click the button that says "Create a Free Account" or if you have logged in to the system before, the button will say something like "Sign in to the Console".
You can sign in if you have an existing amazon.com account or create a new account.
- Sign in to the AWS management console
- Click on "Identity and Access Management"
- Click on "Users" and then click the "Create New Users" button and create your user
- After you have created the user, click the "Download Credentials" button - this will give you a file,
credentials.csv
, with your user's key and secret key that will be used when we configure the AWS Command Line Interface - Click on "Groups" and click the "Create New Group" button
- Call the group what you'd like, e.g. "tessera"
- Attach the following two policies to the group:
AmazonDynamoDBFullAccess
,AmazonElasticMapReduceFullAccess
(DynamoDB access only required if you are going to use EMRFS with the-e
option) - Now click "Groups" and click on the entry of the group you just created
- Click the "Add Users to Group" button and select your user
- Sign in to the AWS management console
- Click on "EC2"
- Click on "Key Pairs" under "Network & Security"
- Click the "Create Key Pair" button
- Name it what you'd like, e.g. "tessera-emr"
- Keep track of the name of this file, as it will be the
-k
argument totessera-emr.sh
. - A file with that name and a .pem extension will be downloaded
- You can put this file where you'd like but treat it with care (don't share with anyone or put it anywhere where others can get it)
- You can put it in the emr-3.2.1 directory of this repo if you'd like (but don't check it in to git)
We will use this to store the EMR startup scripts and you can also use it to store your HDFS data.
- Sign in to the AWS management console
- Click "S3"
- Click the "Create Bucket" button and go through the steps
- Enable logging for the bucket with the default prefix "logs/"
- Make sure you make note of the Region you choose
The AWS CLI uses Python so make sure you have that installed.
Instructions for how to install the AWS CLI can be found here.
Follow the instrutions here to configure the AWS CLI.
Some notes:
- Use your user
credentials.csv
file you downloaded when you created the user to get your key and secret key - If you don't have this file, follow this guide.
- To see the possibilities for "region", look at the codes here - it is a good idea to choose the same one as your s3 bucket
- You can choose the default value for "output" - it doesn't matter which you choose
You should now be ready to run tessera-emr.sh as outlined at the beginning of this README.
m1.large
or larger instance types must be used. Smaller instance types have caused issues where hadoop is unable to start.- Each time a cluster is started a new security group is created with a name
TesseraEMR-xxxxxx
. Periodically you may want to check your security groups and clean out old groups with this prefix.