Cdh On Azure Blog

CDH on Azure

Recently I worked on a unique Hadoop Project in which we deployed CDH (Cloudera Distribution for Hadoop) on Azure. This platform provides “BigData as a Service” to Data scientists of a large organization. This deployment has rarely been implemented anywhere before. Sharing some of the knowledge and first hand experience acquired while working on the project. The intension of the article is to give you a gist of steps required to be followed for installing CDH on Azure.

Although there is a reference architecture document for Azure deployment available online from Cloudera but it seems to be obsolete one.

Reference Architecture

We used DS-14 type CentOS 6.6 machines for all nodes in cluster. Also we used a dedicated premium storage with each machine in the cluster. The reason for using premium storage is that it provides high Input Output Operations per second (5000+ IOPS per second per disk) for various data science related jobs. One major problem with Premium storage is that we can’t add them to Backup vault.

For more details about various machine types in Azure, please refer to Machine Types

For more details about Premium storage, please refer to Premium Storage Details.

Architecture of platform looks as below –

Architecture

Provisioning of Machines

The very first step is to setup VPC (Virtual Private Cloud) in Azure and configure it based on aspects like access to Internet, connectivity with other trusted networks, access to other Azure services etc.

After that, we used Azure command line azure-cli to provision instances. The same can be done via Azure Management Portal. But as per my personal experience, Azure portal is not very robust. It gives many generic errors even when we perform some common operations. Also it doesn’t even allow operations in parallel. To provision machines, create a SSH key pair with which we could log into the instances.

https://www.npmjs.com/package/azure-cli

Install Azure CLI from Mac Yosemite

brew install node
npm install -g azure-cli

Connect Account:

Download the certificate from portal, by logging in:

azure account download
azure account import

 

Set the right account:

azure account list

There might be more that 1 account.

azure account set Account name

Define Command line Variables:

export vmStorageAccountName=clouderaw1store
export vmStaticIP=XX.XX.XX.XX
export vmName=CLOUDERAW1

STEP 1: Create StorageAccount for each Machine (through Azure Portal)

STEP 2: Create a container

Find the Connection String in Azure Portal: Browse > Storage Account (classic) > clouderaw1store > Settings > Keys > Primary Connection String

azure storage container create –container “vhds”
–connection-string “paste-here-connection-string” vhds

STEP 3: Create CentOS node:

azure vm create –vm-name ${vmName}
–virtual-network-name NETWORK1
–blob-url Link to Blog URL
–static-ip ${vmStaticIP}
–userName clouderaadmin
–ssh Specify Port Number
–ssh-cert key.pub
–no-ssh-password
–vm-size Standard_DS14
–availability-set WORKER_AVS
–connect
cloudera-hadoop “5112500ae3b842c8b9c604889f8753c3__OpenLogic-CentOS-66-20150706”

STEP 4: Attach disks – Multiple Disks

azure vm disk attach-new –host-caching ReadOnly ${vmName} 512 https://${vmStorageAccountName}.blob.core.windows.net/vhds/${vmName}-application.vhd

azure vm disk attach-new –host-caching ReadOnly ${vmName} 1023 https://${vmStorageAccountName}.blob.core.windows.net/vhds/${vmName}-hadoop.vhd

azure vm disk attach-new –host-caching ReadOnly ${vmName} 1023 https://${vmStorageAccountName}.blob.core.windows.net/vhds/${vmName}-hadoop1.vhd

azure vm disk attach-new –host-caching ReadOnly ${vmName} 1023 https://${vmStorageAccountName}.blob.core.windows.net/vhds/${vmName}-hadoop2.vhd

STEP 5: Validate

Validate VM in Azure portal and perform SSH

ssh -i pathtoprivate_key clouderaadmin@{IP Address} -p {Port Number}

Each Azure instance has OS disk (/dev/sda). The purpose of this disk is to provide fast boot time. It should not be used for any other purpose. Second disk (/dev/sdb) is a temporary disk which is used for Linux Swap file purpose. We can attach additional disks (Max 32 x 1 TB disks on DS-14 type machines) with each VM storing log files, application data, Hadoop data etc. These can be located as /dev/sdc, /dev/sdd etc. We can check read speed of disks using hdparm. Hdparm is a Linux utility that allows to quickly find out Read speed of a hard drive.

sudo yum install hdparm

hdparm -t /dev/sdc

Gateway machine or Stepping Stone Machine (2 Extra Disks) –
1. 512 GB for Applications
2. 1 TB for Hadoop Data

Worker machines (4 Extra Disks) –
1. 512 GB for Applications
2. 3 x 1 TB for Hadoop

Master machines (2 Extra Disks) –
1. 512 GB for Applications
2. 1 TB for NN Dir

Ansible and Cloudera Manager Installation

Once machines are provisioned, we used Ansible script to perform below tasks in automated way for preparing machines and setting up environment –

• Disable SELinux
• Enable Swap on Azure CentOS
• Set Swappiness to 1
• Disable IPv6
• Partition Disks
• Format Disks
• Mount Disks
• Setup NTP etc.

For more details about installing Cloudera Manager via Ansible, please refer to CM via Ansible

Cloudera Manager 5 can be installed using link below –
http://archive.cloudera.com/cm5/installer/latest/

CDH can be installed by following the steps mentioned in below guide –
http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-Guide/CDH5-Installation-Guide.html

## Conclusion

It is not easy to find Azure, Linux and Hadoop experts in market. Azure IaaS and PaaS functionalities are still evolving with Cloudera Distribution for Hadoop. Another easier way to install CDH on Azure could be using Azure Templates The environment is not yet production ready but Microsoft and Cloudera are working hard for making this unique deployment combination a success. Hopefully, we can see more such Hadoop deployments in near future.

Leave a Reply

Your email address will not be published. Required fields are marked *