apache spark - Map slices of RDD/Dataframe based on column value in PySpark -

- July 15, 2015

i have dataframe 1 below:

timestamp   key  value 2016-06-29     88 2016-06-28     89 2016-06-27     90 2016-06-29   b   78 2016-06-28   b   79 2016-06-27   b   80 2016-06-29   c   98 2016-06-27   c   99

the goal convert rdd of panda.series in efficient way. want result:

(a, pandas.series) (b, pandas.series) (c, pandas.series)

so want operate on slice of dataframe has same key , give output tuple of (key, pandas.series) every slice.

things tried/ thoughts:

spark-ts seems ideal use, seems python version broken.
tried window function, it's not suitable such cases.
instead of read them in bulk, read based on key convert panda.series , repeat. slow - not viable.

any ideas/suggestions achieve in fast , efficient manner?

Search This Blog

M16

apache spark - Map slices of RDD/Dataframe based on column value in PySpark -

Comments

Post a Comment

Popular posts from this blog

Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12:test (default-test) on project.Error occurred in starting fork -

unity3d - How do I remove the Unity Splash Screen from my iOS builds? -

android - CoordinatorLayout, FAB and container layout conflict -