apache spark - Map slices of RDD/Dataframe based on column value in PySpark -
i have dataframe 1 below:
timestamp   key  value 2016-06-29     88 2016-06-28     89 2016-06-27     90 2016-06-29   b   78 2016-06-28   b   79 2016-06-27   b   80 2016-06-29   c   98 2016-06-27   c   99   the goal convert rdd of panda.series in efficient way. want result:
(a, pandas.series) (b, pandas.series) (c, pandas.series)   so want operate on slice of dataframe has same key , give output tuple of (key, pandas.series) every slice.
things tried/ thoughts:
- spark-ts seems ideal use, seems python version broken.
 - tried window function, it's not suitable such cases.
 - instead of read them in bulk, read based on key convert panda.series , repeat. slow - not viable.
 
any ideas/suggestions achieve in fast , efficient manner?
 
 
  
Comments
Post a Comment