新手区最新随笔(rss)

What is shade jar, and what is its purpose.

     摘要: https://stackoverflow.com/questions/13620281/what-is-the-maven-shade-plugin-used-for-and-why-would-you-want-to-relocate-javaUber JAR, in short, is a JAR containing everything. Normally in Maven, we r...  阅读全文

2017-06-19 11:23 作者: wythern【评论:0】【阅读:6】 

[Collection] Spark partition related things.

Partition:
Understanding:
1. https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297
2. http://dev.sortable.com/spark-repartition/ -- example of partition & repartition to avoid data-imbalance.
3. https://acadgild.com/blog/partitioning-in-spark/ -- real case on existing partitioner & self-created partitioner.

Programming guidence.
Avoid using GroupByKey https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

Reference 1 says: Applying transformations that return RDDs with specific partitioners. Some operation on RDDs that hold to and propagate a partitioner are-
  • Join
  • LeftOuterJoin
  • RightOuterJoin
  • groupByKey
  • reduceByKey
  • foldByKey
  • sort
  • partitionBy
  • foldByKey
groupByKey is one of them, My understanding is such operations may cause extra shuffle, but repartition also helps relieve data imbalance if well considered, so use head please! :)

2017-05-18 14:29 作者: wythern【评论:0】【阅读:10】 

技 术 改 变 世 界

网站分类

统计信息

聚合

Blog客户端API

推荐客户端

博客排行榜[前59人]