I’ve tried setting up Hadoop from scratch couple of times, but in the end there were too many subtleties for me to handle. I quickly realised that adding components to Hadoop ecosystem takes time (configuration, compatibility issues). For example, this is a great post on how to install Hadoop on Ubuntu, but clearly there are many detailed steps to go through. Even a single-node setup described on the project page is quite involved. Time to reach out for a ready-to-use platform.

Cloudera and Hortonworks are arguably the most popular Hadoop providers today and are frequently compared to each other. Since I’ve tried both, here is my take on their features and why you might want to prefer one over the other. Please note, this article only looks at the first steps with Hadoop using sandboxed environment.

Initial Setup

Both Cloudera and Hortonworks sport a sandbox environment completely free of charge. Apart from downloading a VM which will work with your favourite virtualisation software, Cloudera offers a Docker image, whereas Hortonworks give you a choice to run in a cloud (Azure, 30-day trial).

Installation Effort

There is a fairly low demand on you as a user, which is obviously the purpose of the whole endeavour. I’d advise you to allow for at least 8GB of RAM for the virtual machines to run reasonably fast. Hortonworks cloud service trial is especially simple to set up. After the initial registration with Microsoft, as the platform runs on Azure, you just choose one of the prebuilt specs. Either way you end up with working and fully configured platform accompanied with a decent web-based admin interface as well as a terminal access via ssh.

Ease of Administration

There isn’t much of a difference in terms of the actual administration. Services can be managed in groups or individually, adding a new service is very straightforward. What I like about Cloudera Manager is a clear configuration structure. It came in handy when I wanted to upgrade the stack to Java 8 for example. Ambari on the other hand brings a number of key performance metrics directly to the dashboard.

Administration Tools

This section could have been called Ambari vs Cloudera Manager. The main difference is that the first one is open-source whereas the latter is not.  Either of them provide a good start in a sandboxed environment. In terms of scaling up, Cloudera Manager offers additional enterprise features (requires a Cloudera Enterprise license):

  • multi-cluster management
  • rolling upgrades
  • extensible integration with (selected) partner services
  • backup and disaster recovery

First Steps

Hortonworks get out of their way to get you started. Topics in the Hadoop learning trail are neatly linked together with a bunch of tutorials providing useful hints and examples. Cloudera’s documentation does not excel when it comes to very details of what’s under the hood of their platform. Nevertheless, their Hadoop tutorial is also very good and gives you an idea of how to leverage both open source as well as Cloudera’s proprietary technologies.

Challenges

Depending on your needs changing the default configuration might be a bit complicated. For instance, both platforms ship with Java 7 and both require Oracle JDK. When upgrading to Java 8 you are slightly better off with Hortonworks as they allow for a Java reset (a list of JDKs to choose from). However, you are still expected to “restart each component, each host and all services”. With Cloudera you are required to install (Oracle) Java yourself.

Another thing to bear in mind is the complexity of the whole solution. Make sure you familiarise yourself with the web admin interface as changing a configuration file in a command line might not have the expected effect.

These are just minor comments and I found nothing particularly disappointing when using either of the platforms.

Hadoop Ecosystem

The table below looks at selected Hadoop components and compares which versions ship with the latest release of Cloudera and Hortonworks data platforms. See release notes of CDH 5.7.0 and HDP 2.4.0 for full details.

[table]
Component (Apache Project),Cloudera CDH 5.7.0,Hortonworks HDP 2.4.0
Accumulo,1.6,1.7.0
Flume,1.6.0,1.5.2
Hadoop,2.6.0,2.7.1
HBase,1.2.0,1.1.2
Hive,1.1.0,1.2.1
Kafka, based on 0.9,0.9.0
Mahout,0.9+,0.9.0+
Oozie,4.1.0,4.2.0
Pig,0.12.0,0.15.0
Solr,4.10.3,5.2.1
Spark,1.6.0,1.6.0
Sqoop,1.4.6,1.4.6
ZooKeeper,3.4.5,3.4.6
[/table]

Conclusion

Either of the two platforms provide an excellent start with Hadoop, but part ways in many regards: Tez + Stinger vs Impala, ORC vs Parquet, Ranger vs Sentry etc.

Choosing one over the other will largely depend on your project needs, deployment considerations, client requirements and other prerequisites and limitations.

 

 

Categories: Uncategorized

Tomas Zezula

Hello! I'm a technology enthusiast with a knack for solving problems and a passion for making complex concepts accessible. My journey spans across software development, project management, and technical writing. I specialise in transforming rough sketches of ideas to fully launched products, all the while breaking down complex processes into understandable language. I believe a well-designed software development process is key to driving business growth. My focus as a leader and technical writer aims to bridge the tech-business divide, ensuring that intricate concepts are available and understandable to all. As a consultant, I'm eager to bring my versatile skills and extensive experience to help businesses navigate their software integration needs. Whether you're seeking bespoke software solutions, well-coordinated product launches, or easily digestible tech content, I'm here to make it happen. Ready to turn your vision into reality? Let's connect and explore the possibilities together.