GettingStarted

This document is for those who …

…tried R but are not used to Hadoop and Map/Reduce.
…tried using SQL but not proficient with it and neither with working by typing in Command Lines.
…want to keep or at least similarly maintain the current R-using analysis style but not fussing much about distributed environments and massive volumes.
…want to be quickly immersed and experienced in distributed or big data analysis.

Contents

Installation required for operating RHive, configuring settings and simple troubleshooting.
RHive examples
Those R users who want to use RHive for big data analysis can use the user guide to learn basic instructions and accomplish fundamentals of analysis.

RHive – GettingStarted
This document is written in simple, intuitive instructions for users new to RHive.

What is RHive?
R language + Hive (Extension software of Hadoop that allows using SQL syntax to approach files stored in Hadoop and Map/Reduce task)
RHive is an R package that makes distributed data processing and big data analysis easy by using R language’s syntax to approach R or use it for Map/Reduce framework.
As a composite of R package and Hive, users can simultaneously use R language and Hive and also supports use of R language and SQL syntax for big data analysis.

Things to know before using RHive

Basic User
- R language (GNU-R) syntax
  - R is an open source language for statistics and analysis(can be used for open source’s SAS, SPSS, MATLAB etc.)
- Basic Concepts of Map/Reduce
  - Basic concepts of Map/Reduce, which is the most widely used method of processing distributed data.
- Basic Concepts of Hadoop
- Basic Concepts of Hive
Advanced User
- Hive SQL syntax, concepts and the workings of UDF, UDAF
- Hadoop file system(HDFS)
- Advanced R syntax

Introduction to RHive
Understanding RHive

Enables using big data and map/reduce work to be dealt without making implementations for them but using only R and SQL syntax.
With but simple tasks in R, aggregating, preprocessing, and basic statistical analyzing become easy.
There’s no need to strain to obtain a complete understanding of Map/reduce.
Reference : Hive UDF

Pros

With nothing else required but knowledge of R and SQL syntax, anyone can process big data.

Cons

Requires at least an inkling of understanding of map/reduce and Hive.
- Analysts virtually know nil about this.
Must know about SQL syntax.
- Quite a significant population of analysts do not know about this.
Debugging is difficult (the root of this problem lies not in RHive, but in distributed environment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GettingStarted

Clone this wiki locally