X presents Y for a better Z

[Collection] Spark partition related things.

1. https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297
2. http://dev.sortable.com/spark-repartition/ -- example of partition & repartition to avoid data-imbalance.
3. https://acadgild.com/blog/partitioning-in-spark/ -- real case on existing partitioner & self-created partitioner.

Programming guidence.
Avoid using GroupByKey https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

Reference 1 says: Applying transformations that return RDDs with specific partitioners. Some operation on RDDs that hold to and propagate a partitioner are-
  • Join
  • LeftOuterJoin
  • RightOuterJoin
  • groupByKey
  • reduceByKey
  • foldByKey
  • sort
  • partitionBy
  • foldByKey
groupByKey is one of them, My understanding is such operations may cause extra shuffle, but repartition also helps relieve data imbalance if well considered, so use head please! :)

posted on 2017-05-18 14:29 wythern 阅读(88) 评论(0)  编辑 收藏 引用

【推荐】超50万行VC++源码: 大型组态工控、电力仿真CAD与GIS源码库
网站导航: 博客园   IT新闻   BlogJava   知识库   博问   管理