ClickHouse/docs/en/faq/general/mapreduce.md
2022-03-12 00:24:31 -06:00

1.7 KiB
Raw Blame History

title toc_hidden toc_priority
Why not use something like MapReduce? true 110

Why Not Use Something Like MapReduce?

We can refer to systems like MapReduce as distributed computing systems in which the reduce operation is based on distributed sorting. The most common open-source solution in this class is Apache Hadoop. Large IT companies often have proprietary in-house solutions.

These systems arent appropriate for online queries due to their high latency. In other words, they cant be used as the back-end for a web interface. These types of systems arent useful for real-time data updates. Distributed sorting isnt the best way to perform reduce operations if the result of the operation and all the intermediate results (if there are any) are located in the RAM of a single server, which is usually the case for online queries. In such a case, a hash table is an optimal way to perform reduce operations. A common approach to optimizing map-reduce tasks is pre-aggregation (partial reduce) using a hash table in RAM. The user performs this optimization manually. Distributed sorting is one of the main causes of reduced performance when running simple map-reduce tasks.

Most MapReduce implementations allow you to execute arbitrary code on a cluster. But a declarative query language is better suited to OLAP to run experiments quickly. For example, Hadoop has Hive and Pig. Also consider Cloudera Impala or Shark (outdated) for Spark, as well as Spark SQL, Presto, and Apache Drill. Performance when running such tasks is highly sub-optimal compared to specialized systems, but relatively high latency makes it unrealistic to use these systems as the backend for a web interface.