2. http://dev.sortable.com/spark-repartition/ -- example of partition & repartition to avoid data-imbalance.
3. https://acadgild.com/blog/partitioning-in-spark/ -- real case on existing partitioner & self-created partitioner.
Avoid using GroupByKey https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
Reference 1 says: Applying transformations that return RDDs with specific partitioners. Some operation on RDDs that hold to and propagate a partitioner are-
groupByKey is one of them, My understanding is such operations may cause extra shuffle, but repartition also helps relieve data imbalance if well considered, so use head please! :)