Following are the reasons (and eye opener) that explains the failure risks associated with Hadoop Projects:
Hive: Hive is the data warehouse SQL layer of Hadoop. While it is the most accessible entry point to Hadoop, it still comes with its own complexities. For example, HiveQL is not (yet) fully compatible with SQL standardsand users often have a hard time working around the missing pieces. Advanced Hive usage typically builds on a decent understanding of Hadoop storage formats like Parquet or ORC and also requires the ability to program User Defined Functions (UDFs) in Java.
Pig: Pig allows distributed execution of program scripts on the MapReduce framework. However, the Pig Latin scripting language comes with a custom syntax that takes time to get used to. As you advance your Pig development, you will also need to know storage formats and UDFs.
MapReduce: While the original execution framework of Hadoop, MapReduce, is abstracted and simplified by Hive and Pig, sometimes you still need to implement your analytics in the MapReduce framework which implies that you have a grasp on parallel programming. Though you can use many programming languages to write map and reduce functions, effectively translating your analytics to those functions is challenging and requires a lot of experience. You will need to dive deep into data storage formats, compression, serialization, and lots of low-level tricks to achieve your analytics goals.
Mahout: Machine learning is widely considered the holy grail of Big Data analytics, so the promise of using a simple machine learning library like Mahout on top of MapReduce is very compelling. However, machine learning algorithms tend to be iterative and not suitable to MapReduce (which is why many well-adopted algorithms are not yet available for Hadoop). Furthermore, to use Mahout algorithms, you need to manage manual data format conversions.
Spark: Apache Spark is the new kid on the block, gaining traction for good reasons – optimized memory and disk usage, support for iterative computations, clean programming API, and a large community – all making it the natural successor of MapReduce. However, Spark requires proficiency in either Python, Java, or, even better, Scala. Additionally, you need a good understanding of parallel programming to avoid writing programs with performance bottlenecks. Because Spark is developing so quickly, it is hard to keep track of all the changes.
Each of the aforementioned packages, as well as other Hadoop technologies, require detailed knowledge to use them. The problem is amplified, then, by the need to manually integrate these technologies. Passing data from one component to another regularly involves transforming that data and converting it to the correct format.
As a consequence, mastering analytics on Hadoop through programming not only requires a very broad yet specialized skill set, but also translates to repetitively solving many tasks that are more a necessity of technology than of the actual analytics initiative. For both reasons, developing analytics on Hadoop means a complex and costly endeavor, typically resulting in misalignment between stakeholders and unnecessarily risking your Big Data project to ultimately fail.