I know, nowadays everyone is building Big Data systems in less then 30 minutes. With Cloudera software it's even faster. I tried to do it, however I finished without so spectacular speed. To be honest I've spent 3 days, 3 hours, and 33 minutes. This article describes how to install Cloudera based Hadoop cluster in a universal way, making it able to run it in any type of environment including enterprise. With the instruction installation will take less then one day.
To install Cloudera based Hadoop system you need internet connection to access repositories with binary files. One possibility is to clone repositories, and expose is locally. Other one is to install the system outside of the enterprise, using virtualization technology, and copy configured machines to the protected environment.
The environment with access to public network may be your computer and a mobile phone. The system in this article was build on a 16GB, 4 core notebook with external USB3 disk drive and mobile connection having unlimited LTE data plan.
Main challenge in installing Hadoop is it's massive nature, main challenge doing it in enterprise environment is access to the internet. Typically it is possible to go to public network via proxy servers. In restricted ones, it's not possible.To install Cloudera based Hadoop system you need internet connection to access repositories with binary files. One possibility is to clone repositories, and expose is locally. Other one is to install the system outside of the enterprise, using virtualization technology, and copy configured machines to the protected environment.
The environment with access to public network may be your computer and a mobile phone. The system in this article was build on a 16GB, 4 core notebook with external USB3 disk drive and mobile connection having unlimited LTE data plan.
1. Target architecture
This manual will led you to having private Hadoop cluster, having access to public network via gateway node. The gateway hosts DNS, NTP, and HTTP proxy services. Gateway public interface is configured by DHCP, what in most situations will simplify system movements.
2. Credits
This article was prepared using great instructions presented by masterschema, and Chrisitan Javet.
3. Other solutions
As extension to this article take a look at Ofir Manor article on using Linux Containers to host Hadoop. This technique looks really smart. Instead on putting whole new OS on top of already existing OS controlling bare metal, Ofir uses lxc technology to sandbox each Hadoop node on existing Linux. Sounds much better.
There is interesting way of deployment based on Vagrant, which may handle provisioning to various platforms as VBox, lxs, Docker, and more.
To proceed you need to install VirtualBox on your computer. Computer must be equipped with min. 16GB ram, 4 cores, fast internet connection, and 50GB of available disk space.
During my first installation, I've made number of mistakes. First of all make sure that manager machine is configured with 8GB of ram. It's critical. I had number of issues with DNS. It looks that using public domain as example.com may cause problems. Of curse cloning machines without generating new mac address is just wrong. I do not know why VBox does not change mac by default. Selecting "Single User Mode" in Cloudera installer is wrong, do not do it. It's a perfect, I'm sure, option, but makes everything more complex. Finally I learned that http proxy can't be fully utilized in the system due to Cloudera's https connectivity. Majority of updates are going trough proxy with no possibility to be cached. You should understand networking in VirtualBox. Things like correctly connected Bridged network to proper host's interface, is not a rocket science, however it may be overlooked when you are trying to connect gateway host to internet via USB connected mobile phone.
There is interesting way of deployment based on Vagrant, which may handle provisioning to various platforms as VBox, lxs, Docker, and more.
To proceed you need to install VirtualBox on your computer. Computer must be equipped with min. 16GB ram, 4 cores, fast internet connection, and 50GB of available disk space.
During my first installation, I've made number of mistakes. First of all make sure that manager machine is configured with 8GB of ram. It's critical. I had number of issues with DNS. It looks that using public domain as example.com may cause problems. Of curse cloning machines without generating new mac address is just wrong. I do not know why VBox does not change mac by default. Selecting "Single User Mode" in Cloudera installer is wrong, do not do it. It's a perfect, I'm sure, option, but makes everything more complex. Finally I learned that http proxy can't be fully utilized in the system due to Cloudera's https connectivity. Majority of updates are going trough proxy with no possibility to be cached. You should understand networking in VirtualBox. Things like correctly connected Bridged network to proper host's interface, is not a rocket science, however it may be overlooked when you are trying to connect gateway host to internet via USB connected mobile phone.
1. Prepare Golden Image
Golden image is a virtual (box) image which will be used as base for all other node types. This image contains all required packages and configuration, generic for the system. Golden image is based on CentOS.
a) download CentOS minimal
Download ISO image from CentOS repository. This manual uses file CentOS-6.7-x86_64-minimal.iso
b) create VirtualBox machine
Create machine used for golden image. Machine should have following parameters:
- name golden image
- type Linux, version RedHat (64bit)
- 2048 MB RAM, 2. 1 CPU,
- 1 network adapter connected to "Bridged" network,
- 40GB disk VDI type, dynamically allocated.
c) install CentOS
Connect downloaded "CentOS-6.7-x86_64-minimal.iso" file to the Golden Image virtual machine's CDROM, start the virtual machine and select "Install or upgrade existing system". Skip media test, install all default options; initialize attached disk. Installation will take few minutes. Configure root name with easy to remember password. Why not to use just "welcome" this time?
d) update CentOS
It's assumed that your host is connected to the Internet, so VirualBox bridged network will forward all IP requests to your Internet connected network. It's needed on this stage to update and install linux packages. To save space, we are installing minimal release of CentOS, so some of mandatory utilities are missing. Before this step you need to ensure that goldenimage host has connection with internet. Depending on your host computer and your network configuration you may need to configure yum package manager with proxy server.
After installation eth0 network interface is configured with dhcp, but is not started during Linux boot. To make things easier, we will do it manually.
ifconfig eth0 up
To make configuration easier we will use host computer ssh client. With such interface you will be able to copy/paste commands from this manual to ssh terminal. To be able to do it, check assigned IP address - it will be required in next steps with ssh client connection.
ifconfig | grep Mask
Now you can connect using your ssh client with golden image system.
ssh root@ADDRESS
Discover your HTTP proxy server address and update yum.conf file.
echo "
" >> /etc/yum.conf
With proper connectivity and you configuration execute below yum commands:
yum update -y
yum install traceroute wget telnet lsof bind-utils perl ntp openssh-clients -y
This step finishes first step of generic configuration. Now we will configure elements more specific for Hadoop cluster being built.
e) configure CenOS
As Hadoop cluster needs internet network connectivity, all hosts have to be configured with proper gateway. To simplify overall system configuration one of nodes is configured as internet gateway with all required services including HTTP proxy, DNS, and NTP.
Internet gateway node will use address; lab domain name is "bigdatalab.orcl"
cat <<EOF >/etc/resolv.conf
search bigdatalab.orcl
Golden image is configured with name CHANGE, which will be changed during deployment of each node. Gateway is pointing to cluster gateway host. To make things easier to debug IPv6 is disabled.
Note that hostname is specified without domain name. It's only a short name, not FQDN. It's important. Note that name of "localhost" will be changed for each host. It's interesting that this default host name may be overwritten by DHCP protocol. I will not use this technique now, but it's good to have such possibility.
cat <<EOF >/etc/sysconfig/network
It's recommended by Claudera to do some adjustments increasing speed of the system. Note that lab Hadopp cluster is not configured for security; it's configured for simplicity.
perl -pi -e 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
Disable defragmentation of transparent huge pages, and minimize impact of paging memory blocks to swap area.
echo "echo never > /sys/kernel/mm/transparent_hugepage/defrag" >> /etc/rc.local
echo "# Minimize swap activity
vm.swappiness = 10
" >> /etc/sysctl.conf
Disable firewall.
chkconfig iptables off
Configure trusted authentication between cluster hosts. It will make further administration easier.
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
perl -pi -e 's/# StrictHostKeyChecking ask/StrictHostKeyChecking no/g' /etc/ssh/ssh_config
Finally let's configure name resolution. It's based on local static file. I've decided to prepare name space for more that 4 nodes. In case of preparing bigger cluster you will not need to distribute hosts file to all boxes. Not a big thing, but will make life easier.
cat <<EOF >/etc/hosts localhost.localhost localhost gateway.bigdatalab.orcl gateway hadoop1.bigdatalab.orcl hadoop1 hadoop2.bigdatalab.orcl hadoop2 hadoop3.bigdatalab.orcl hadoop3 hadoop4.bigdatalab.orcl hadoop4 hadoop5.bigdatalab.orcl hadoop5 hadoop6.bigdatalab.orcl hadoop6 hadoop7.bigdatalab.orcl hadoop7 hadoop8.bigdatalab.orcl hadoop8 hadoop9.bigdatalab.orcl hadoop9 hadoop10.bigdatalab.orcl hadoop10 hadoop11.bigdatalab.orcl hadoop11 hadoop12.bigdatalab.orcl hadoop12 hadoop13.bigdatalab.orcl hadoop13 hadoop14.bigdatalab.orcl hadoop14 hadoop15.bigdatalab.orcl hadoop15 hadoop16.bigdatalab.orcl hadoop16 hadoop17.bigdatalab.orcl hadoop17 hadoop18.bigdatalab.orcl hadoop18 hadoop19.bigdatalab.orcl hadoop19 hadoop20.bigdatalab.orcl hadoop20
EOF hadoop6.bigdatalab.orcl hadoop6 hadoop7.bigdatalab.orcl hadoop7 hadoop8.bigdatalab.orcl hadoop8 hadoop9.bigdatalab.orcl hadoop9 hadoop10.bigdatalab.orcl hadoop10 hadoop11.bigdatalab.orcl hadoop11 hadoop12.bigdatalab.orcl hadoop12 hadoop13.bigdatalab.orcl hadoop13 hadoop14.bigdatalab.orcl hadoop14 hadoop15.bigdatalab.orcl hadoop15 hadoop16.bigdatalab.orcl hadoop16 hadoop17.bigdatalab.orcl hadoop17 hadoop18.bigdatalab.orcl hadoop18 hadoop19.bigdatalab.orcl hadoop19 hadoop20.bigdatalab.orcl hadoop20
On this stage network adapter is configured with DHCP. It was good until now, but it's time to switch it to static mode, which will be used in the cluster. During cloning of golden image, which is done with new MAC address, Linux will detect new network card. It will be the best time to configure network.
cat <<EOF >/etc/sysconfig/network-scripts/ifcfg-eth0
Finally remove information about currently discovered network card. The card will be rediscovered after clone.
rm -f /etc/udev/rules.d/70-persistent-net.rules
Golden image is configured now. Let's shut it down.
init 0
2. Prepare Golden Image for cloning
Golden image configuration is finished after changing network adapter type.
- go to Settings, Network and change Bridged to "Host-only Adapter" connected to "vboxnet0".
- go to VirtualBox Preferences, Network, Host-only Networks, select "vboxnet0" and press screwdriver icon.
- Enter to IPv4 Address field
- press DHCP Server
- disable Enable Server checkbox
- press OK twice
This is temporary step which make it easier to work with multiple virtual machines on a single host. With above you will be able to connect to virtual machines from your host machine. You will have connectivity between all boxes to connected to "vboxnet0" as well. In real lab system running on multiple machines you will go with Bridged configuration, making sure that all machines are connected to the same network.
3. Clone golden image to prepare gateway host.
- Use VirtualBox clone operation (right click on golden image)
- enter name "InternetGateway"
- mark checkbox "Reinitialize the MAC..."
- press Continue
- select "Full clone"
- press Clone
4. Configure public network card
- go to InternetGateway's Settings
- navigate to System
- change Base memory to 512 MB
- navigate to Network
- select Adapter 2
- enable the adapter
- attach to Bridged Adapter
- attach to your public network interface
5. Start Internet Gateway
Start machine, and connect from your host computer via ssh
ssh root@
6. Start and prepare InternetGateway machine for configuration
a) To progress you will need to collect following information:
- gateway of your external network,
- DNS server,
- NTP server,
- HTTP PROXY server.
To collect both use traceroute, and nslookup utilities, executing them from your host computer.
nslookup oracle.com
Non-authoritative answer:
Name: oracle.com
traceroute to (, 64 hops max, 52 byte packets
1 ( 991.404 ms 6.355 ms 6.096 ms
traceroute to (, 64 hops max, 52 byte packets
1 ( 991.404 ms 6.355 ms 6.096 ms
To find out http proxy and ntp you should use other techniques. Note that NTP is not critical as we will use local clock from gateway machine.
b) configure discovered addresses , by executing below lines in ssh session
export GTW=
export NTP=none
export DNS1=
export DNS2=
export PROXY=none
export PROXY_PORT=none
7. Configure IP services
a) configure secondary network adapter in VirtualBox. This interface is connected to external network by using "Bridged" mode.
cat <<EOF >/etc/sysconfig/network-scripts/ifcfg-eth1
b) configure IP gateway
cat <<EOF >/etc/sysconfig/network
c) configure static internal IP address
cat <<EOF >/etc/sysconfig/network-scripts/ifcfg-eth0
d) reboot system
e) after a while connect again with ssh. This time to
ssh root@
f) install packages required on gateway host
yum install bind squid -y
8. Configure other services
a) configure addresses again
export GTW=
export NTP=none
export DNS1=
export DNS2=
export PROXY=none
export PROXY_PORT=none
b) NTP server will use local clock to synchronize time in the cluster.
cat <<EOF > /etc/ntp.conf
# For more information about this file, see the man pages
# ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5).
driftfile /var/lib/ntp/drift
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict -6 ::1
# Hosts on local network are less restricted.
restrict mask nomodify notrap
# Use local clock
server # local clock
fudge stratum 10
# Use your NTP source clock
#server $NTP
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
#broadcast autokey # broadcast server
#broadcastclient # broadcast client
#broadcast autokey # multicast server
#multicastclient # multicast client
#manycastserver # manycast server
#manycastclient autokey # manycast client
# Enable public key cryptography.
includefile /etc/ntp/crypto/pw
# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys
# Specify the key identifiers which are trusted.
#trustedkey 4 8 42
# Specify the key identifier to use with the ntpdc utility.
#requestkey 8
# Specify the key identifier to use with the ntpq utility.
#controlkey 8
# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats
Start NTP, and configure automatic start during boot.
/etc/init.d/ntpd start
chkconfig ntpd on
chkconfig ntpd on
b) Configure DNS forwarder
Nodes from local network need to have access to public resources including DNS.
cat <<EOF >/etc/named.conf
options {
listen-on port 53 {;; };
directory "/var/named";
allow-query { localhost;; };
recursion yes;
forwarders { $DNS1; $DNS2; };
forward only;
Start DNS, and configure automatic start during boot.
service named start
chkconfig named on
service named start
chkconfig named on
c) Configure HTTP proxy
Cluster nodes require HTTP connectivity to install Cloudera packages. Squid is configured to store cached data on disk. The concept is taken from http://ma.ttwagner.com/lazy-distro-mirrors-with-squid. Looks and works quite well.
if [ "$PROXY" != "none" ]; then
cat <<EOF >>/etc/squid/squid.conf
cache_peer $PROXY parent $PROXY_PORT 0 no-digest
never_direct allow all
cat <<EOF >>/etc/squid/squid.conf
# Squid configuration for yum
# source: http://ma.ttwagner.com/lazy-distro-mirrors-with-squid/
maximum_object_size 4096 MB
cache_dir ufs /var/spool/squid 10000 16 256
cache_replacement_policy heap LFUDA
refresh_pattern -i .rpm$ 129600 100% 129600 refresh-ims override-expire
refresh_pattern -i .iso$ 129600 100% 129600 refresh-ims override-expire
Start Squid, and configure automatic start during boot.
service squid start
chkconfig --levels 235 squid on
d) set yum proxy
perl -pi -e 's/^proxy=/#proxy=/g' /etc/yum.conf
echo "proxy=" >>/etc/yum.conf
echo "" >>/etc/yum.conf
e) The gateway is ready.
The gateway is ready. All internal hosts will use this node as a communication gateway with HTTP proxy, NTP source, and DNS server.
Keep gateway up and running.
9. Clone golden image to prepare compute node host.
- Use VirtualBox clone operation (right click on golden image)
- enter name "ComputeNode#1"
- mark checkbox "Reinitialize the MAC..."
- press Continue
- select "Full clone"
- press Clone
Note that you can clone machine using command line
VBoxManage clonevm "GoldenImage" --name "ComputeNode#1" --register
g) start newly created machine
Newly started machine is running with name and ip addresses that need to be changed.
You can start machine from command line
VBoxManage startvm "ComputeNode#1"
h) connect via ssh to newly started machine
ssh root@
h) specify hostname and address for this node.
export HOST=hadoop1
export ADDRESS=
i) set static IP address
cp /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth0.old
perl -pi -e 's/IPADDR=$ENV{ADDRESS}/g' /etc/sysconfig/network-scripts/ifcfg-eth0
c) set host name
perl -pi -e 's/HOSTNAME=CHANGE/HOSTNAME=$ENV{HOST}/g' /etc/sysconfig/network
d) update DNS
cat <<EOF >/etc/resolv.conf
search bigdatalab.orcl
d) set yum proxy
perl -pi -e 's/^proxy=/#proxy=/g' /etc/yum.conf
echo "proxy=" >>/etc/yum.conf
echo "" >>/etc/yum.conf
e) disable fastest mirrors plugin
perl -pi -e 's/^enabled=1/#proxy=0/g' /etc/yum/pluginconf.d/fastestmirror.conf
f) set wget proxy
cat <<EOF > ~/.wgetrc
g) set time sync
cat <<EOF > /etc/ntp.conf
# For more information about this file, see the man pages
# ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5).
driftfile /var/lib/ntp/drift
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict -6 ::1
# Hosts on local network are less restricted.
#restrict mask nomodify notrap
# Use local clock
#server # local clock
#fudge stratum 10
# Use your NTP source clock
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
#broadcast autokey # broadcast server
#broadcastclient # broadcast client
#broadcast autokey # multicast server
#multicastclient # multicast client
#manycastserver # manycast server
#manycastclient autokey # manycast client
# Enable public key cryptography.
includefile /etc/ntp/crypto/pw
# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys
# Specify the key identifiers which are trusted.
#trustedkey 4 8 42
# Specify the key identifier to use with the ntpdc utility.
#requestkey 8
# Specify the key identifier to use with the ntpq utility.
#controlkey 8
# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats
Start NTP, configure automatic start during boot, and verify it works.
service ntpd start
chkconfig ntpd on
ntpq -p
ntpq -p
h) Shut down machine
init 0
i) repeat above steps for other three machines
10. Prepare management node
Management node need more memory due to hosting management services. Compute node #1 will be used as management node.
- set system memory to 8192 MB
- start node
Note that above may be done from command line:
VBoxManage modifyvm "ComputeNode#1" --memory 8192
VBoxManage startvm "ComputeNode#1"
11. Start all machines
Start all machines manually or use below command line:
VBoxManage startvm "InternetGateway"
VBoxManage startvm "ComputeNode#1"
VBoxManage startvm "ComputeNode#2"
VBoxManage startvm "ComputeNode#3"
VBoxManage startvm "ComputeNode#4"
12. Install Cloudera
Cloudra installation will
a) connect to node#1 - it will be a management console
ssh root@
b) download Cloudera installer
curl -x -O http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
c) execute installer
Execute installer using below command:
chmod +x cloudera-manager-installer.bin
In the installer accept all licenses, and progress install screens. After all initial steps, installer will start downloading binaries.
d) Monitor install progress
Installer writes progress information in /var/log/cloudera-manager-installer
tail -f /var/log/cloudera-manager-installer/*
You may monitor filling squid disk cache by going to gateway host and executing df -h
sh gateway "df -h"
e) wait for manager initialization
After installation is completed, wait few minutes to finalize system initialization. You may check log for readiness information.
cat /var/log/cloudera-scm-server/cloudera-scm-server.log | grep "Started SelectChannelConnector@"
tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log
13. Configure Claudera
- Connect to with admin/admin
- accept user license
- choose Cloudera Express edition, click Continue twice
- Select nodes to install packages. Put hadoop[1-4] in input box
- All hadoop boxes should be found by installer. Press Continue.
- Select "Use Packages". Press Continue
- select "Install Oracle Java SE Development Kit (JDK)", do not select "Install Java Unlimited Strength Encryption Policy Files", press Continue
- do not select "Single User Mode"
- leave "Login to all hosts as root", enter password for root user twice. Our password is "welcome"
- Press Continue to start installation
Note that performant Internet connection is mandatory. Each host downloads approx. 2GB of data during installation. Squid cache minimizes traffic a little, but only a little. Unfortunately yum may go to different repos (event with disabled fast mirror plugin), and Cloudera repos cannot be cached due to used https protocol.
Installation may fail due to network conditions. In case of problems, restart of the installation process is done automatically. You may even restart management host to continue install process after some time. Thanks to having gateway host, it's possible to change public network interface (office, mobile). To do so, you need to reconnect Bridge interface to proper host interface and reboot gateway host.
13.1 Error?
Note that after this step you may see error "Cannot have empty version string segment". In such situation:
a) restart scm server on hadoop1
ssh root@
service cloudera-scm-server restart
tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log
b) open web page
Authenticate admin/admin to see install screen, and continue installation:
- go to tab "Currently Managed Hosts"
- select all hosts
- Continue
- select Packages
After some additional steps you should see that installer recognized all hadoop[1-4] hosts.
13.2 Inspector
Inspector should inform that all host are ok. Press Finish to continue.
13.3 Cluster Setup
Choose Cloudera services which should be deployed into cluster nodes. Select e.g. "Core with Impala" and press Continue
13.4 Customize Role Assignments
Review proposed by installer service layout in the cluster. To continue press Continue
13.5 Database Setup
Press "Test Connection", and Continue
13.6. Review cfg. changes
You may adjust configuration changes. To continue press Continue
14. Starting your cluster services.
Installer will now deploy configuration files and start services on all nodes. This step takes approx. 10 minutes. Press Finish.
15. Done!
You are done.
Open Cloudera Manager application and look around
From inside of the management console you may start hue to start playing with the cluster. You may go directly to Hue:
Post Scriptum
Above installation is done on a single computer. To make it really work you should distribute virtual machines among physical servers. Why to use virtual machines in such situation? It will be easier to manage. You may keep virtual machines on your hard disk, and put on a bare metal when needed. In second decade of virtualization, and time of Vagrant, Linux Containers, and Docker my words sounds little naive, however you may use this poor man's virtualization technique.
No comments:
Post a Comment