Azure HNN project content repo

This repository built for HNN hackfest HDInsight and node.js storage code

Azure storage SDK code

As part of the process HNN Project to AWS S3 to Azure Blob, we provide a sampe code - how to access Azure Blob Storage using a node.js. In case of Java code, guide is provided with support of @zzz2613

Azure Storage sample code ofr node.js and etc.

HDInsight - Preprocessing log data using Hadoop big data

For simple data type conversion, the JSON type log data loaded in basic S3 can be easily converted into CSV type after deserializing in Azure function app. However, the log size is too large to preprocess and to become difficult log data type conversion, and it may be necessary to obtain insight from the log data for management and monitoring. To do this, we briefly performed the process of analyzing data using HDInsight – Hadoop during the Hackfest.

Deploying HDInsight and understand how Hadoop services work
Performing a Hive query on the management portal by using Ambari
Assuming that the entire workflow is done via the Azure Data Factory (Did not review on hackfest)
CSV file data is about 10,000 rows
Assuming that there are about 30 real estate areas, it is assumed that the analysis of how many users stay for the longest

HDInsight's HDFS is compatible with Azure Blob Storage.
Copy the blob using Storage Explorer. Copy the blobs to the data subdirectory of the container specified at installation. These blobs are mapped to "wasbs:///example/data/"

Structure of loaded csv file

PartitionKey	RowKey	Timestamp	Date	Level	Logger	Message
BECA21F0-8B5E-4877-A18A-FD2A4B04322D	4DC0C4A6-893C-4246-AF80-B990EDD10C54	0.011269676	2016-09-10T04:29:50.828Z	stage5	CBComInsMemberGameInfoStages	9
6EDC9E0F-5E7E-4A04-B420-5F3091052CE7	819F7AC7-61E9-40C8-9904-B53B1008311E	0.011269676	2016-06-13T17:40:10.131Z	stage21	CBComInsMemberGameInfoStages	14
9F86454F-4E96-4CAD-A0AE-05DED005774F	766461D9-5FE6-4135-8FB3-15AACE24C4A8	0.011269676	2016-10-18T08:06:05.589Z	stage15	CBComInsMemberGameInfoStages	7
920E72B5-377C-4590-82D5-387ED86EF757	3AB215E0-AAE7-4E69-AF8B-825FADBAF756	0.011269676	2016-06-13T17:38:02.779Z	stage14	CBComInsMemberGameInfoStages	1
...

Daily, log file include about 10,000 pieces of data, the real estate area from the client and the time information of the stay
Processes taking time (minutes) of message for items
Perform HDInsigt Cluster Manager to perform Hive queries
Prepare Hive query execution

Load csv files as External Table from Hive

set hive.execution.engine=tez;
DROP TABLE cloudbreadlog;
CREATE EXTERNAL TABLE cloudbreadlog (
    PartitionKey string, 
    RowKey string, 
    `Timestamp` string, 
    `Date` string, 
    Level string, 
    Logger string, 
    Message string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION 'wasbs:///example/data/';
SELECT 
    * 
FROM 
    cloudbreadlog 
WHERE 
    INPUT__FILE__NAME LIKE '%.csv' 
--	where Logger like 'CBComInsMemberGameInfoStages'
--	AND INPUT__FILE__NAME LIKE '%.csv' 
--order by level asc;

Execure Hive query

Check whether the external table is loaded by Hive query
```
SELECT count(*) FROM cloudbreadlog;
```

Load into Internal table ORC(Optimized Row Columnar) for fast processing

set hive.execution.engine=tez;
CREATE TABLE IF NOT EXISTS CloudBreadStageTime(
    PartitionKey string, 
    RowKey string, 
    `Timestamp` string, 
    `Date` string, 
    Level string, 
    Logger string, 
    Message string
)
STORED AS ORC;
INSERT OVERWRITE TABLE CloudBreadStageTime
SELECT 
    PartitionKey,
    RowKey, 
    `Timestamp`, 
    `Date`, 
    Level, 
    Logger, 
    Message
FROM 
    cloudbreadlog;

Execute final select query on ORC table

select 
    level, 
    avg(cast(Message as int)) as stage_play_avg_min, 
    sum(cast(Message as int)) as stage_play_sum_min 
from CloudBreadStageTime
where Logger = 'CBComInsMemberGameInfoStages'
group by level
order by level asc;

Average stay time and total stay time performed by region as a result of execution
You can get Hive query results like this.

level	stage_play_avg_min	stage_play_sum_min
stage0	7.032447466	22757
stage1	6.923216995	22812
stage10	6.990762125	24216
stage11	6.962710861	23527
stage12	6.99791294	23471
stage13	7.04329525	23426
stage14	7.060240964	24026
stage15	7.005102041	23341
stage16	6.991793313	23003
stage17	6.954984894	23021
stage18	7.089904421	23737
stage19	7.13037037	24065
stage2	7.110673135	23451
...

Easily visualize and publish this Hadoop analysis result to the Web using Power BI.

Various other Hadoop analysis scenarios can be derived

Changes in the volume of inquiries of major real estate properties
Grouped property information by feature

Scenarios that utilize various logs can be used to store and analyze by Hadoop - Big data

hdinsight example data location

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
demo		demo
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure HNN project content repo

Azure storage SDK code

HDInsight - Preprocessing log data using Hadoop big data

About

Releases

Packages

Contributors 3

Languages

License

hnn-project/azure-content

Folders and files

Latest commit

History

Repository files navigation

Azure HNN project content repo

Azure storage SDK code

HDInsight - Preprocessing log data using Hadoop big data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages