apache spark - Map slices of RDD/Dataframe based on column value in PySpark -
i have dataframe 1 below:
timestamp key value 2016-06-29 88 2016-06-28 89 2016-06-27 90 2016-06-29 b 78 2016-06-28 b 79 2016-06-27 b 80 2016-06-29 c 98 2016-06-27 c 99
the goal convert rdd of panda.series in efficient way. want result:
(a, pandas.series) (b, pandas.series) (c, pandas.series)
so want operate on slice of dataframe has same key , give output tuple of (key, pandas.series) every slice.
things tried/ thoughts:
- spark-ts seems ideal use, seems python version broken.
- tried window function, it's not suitable such cases.
- instead of read them in bulk, read based on key convert panda.series , repeat. slow - not viable.
any ideas/suggestions achieve in fast , efficient manner?
Comments
Post a Comment