apache spark - Map slices of RDD/Dataframe based on column value in PySpark -


i have dataframe 1 below:

timestamp   key  value 2016-06-29     88 2016-06-28     89 2016-06-27     90 2016-06-29   b   78 2016-06-28   b   79 2016-06-27   b   80 2016-06-29   c   98 2016-06-27   c   99 

the goal convert rdd of panda.series in efficient way. want result:

(a, pandas.series) (b, pandas.series) (c, pandas.series) 

so want operate on slice of dataframe has same key , give output tuple of (key, pandas.series) every slice.

things tried/ thoughts:

  1. spark-ts seems ideal use, seems python version broken.
  2. tried window function, it's not suitable such cases.
  3. instead of read them in bulk, read based on key convert panda.series , repeat. slow - not viable.

any ideas/suggestions achieve in fast , efficient manner?


Comments

Popular posts from this blog

Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12:test (default-test) on project.Error occurred in starting fork -

windows - Debug iNetMgr.exe unhandle exception System.Management.Automation.CmdletInvocationException -

configurationsection - activeMq-5.13.3 setup configurations for wildfly 10.0.0 -