Skip to content

GettingStarted

miloveme edited this page Dec 11, 2011 · 9 revisions

This document is for those who …

  • …tried R but are not used to Hadoop and Map/Reduce.
  • …tried using SQL but not proficient with it and neither with working by typing in Command Lines.
  • …want to keep or at least similarly maintain the current R-using analysis style but not fussing much about distributed environments and massive volumes.
  • …want to be quickly immersed and experienced in distributed or big data analysis.

Contents

  • Installation required for operating RHive, configuring settings and simple troubleshooting.
  • RHive examples
  • Those R users who want to use RHive for big data analysis can use the user guide to learn basic instructions and accomplish fundamentals of analysis.

RHive – GettingStarted
This document is written in simple, intuitive instructions for users new to RHive.

What is RHive?
R language + Hive (Extension software of Hadoop that allows using SQL syntax to approach files stored in Hadoop and Map/Reduce task)
RHive is an R package that makes distributed data processing and big data analysis easy by using R language’s syntax to approach R or use it for Map/Reduce framework.
As a composite of R package and Hive, users can simultaneously use R language and Hive and also supports use of R language and SQL syntax for big data analysis.

Things to know before using RHive

  • Basic User
    • R language (GNU-R) syntax
      • R is an open source language for statistics and analysis(can be used for open source’s SAS, SPSS, MATLAB etc.)
    • Basic Concepts of Map/Reduce
      • Basic concepts of Map/Reduce, which is the most widely used method of processing distributed data.
    • Basic Concepts of Hadoop
    • Basic Concepts of Hive
  • Advanced User
    • Hive SQL syntax, concepts and the workings of UDF, UDAF
    • Hadoop file system(HDFS)
    • Advanced R syntax

Introduction to RHive
Understanding RHive

  • Enables using big data and map/reduce work to be dealt without making implementations for them but using only R and SQL syntax.
  • With but simple tasks in R, aggregating, preprocessing, and basic statistical analyzing become easy.
  • There’s no need to strain to obtain a complete understanding of Map/reduce.
  • Reference : Hive UDF

Pros

  • With nothing else required but knowledge of R and SQL syntax, anyone can process big data.

Cons

  • Requires at least an inkling of understanding of map/reduce and Hive.
    • Analysts virtually know nil about this.
  • Must know about SQL syntax.
    • Quite a significant population of analysts do not know about this.
  • Debugging is difficult (the root of this problem lies not in RHive, but in distributed environment).
Clone this wiki locally