r - Counting rows based upon conditional grouping with dplyr -


i have dataframe follows:

        position_time telematic_trip_no  lat_dec lon_dec 1 2016-06-05 00:00:01         526132109 -26.6641 27.8733 2 2016-06-05 00:00:01         526028387 -26.6402 27.8059 3 2016-06-05 00:00:01         526081476 -26.5545 28.3263 4 2016-06-05 00:00:04         526140512 -26.5310 27.8704 5 2016-06-05 00:00:05         526140518 -26.5310 27.8704 6 2016-06-05 00:00:19         526006880 -26.5010 27.8490    is_stolen hour_of_day time_of_day day_of_week  lat_min 1         0           0           0      sunday -26.6651 2         0           0           0      sunday -26.6412 3         0           0           0      sunday -26.5555 4         0           0           0      sunday -26.5320 5         0           0           0      sunday -26.5320 6         0           0           0      sunday -26.5020    lat_max lon_max lon_min  1 -26.6631 27.8743 27.8723      2 -26.6392 27.8069 27.8049     3 -26.5535 28.3273 28.3253     4 -26.5300 27.8714 27.8694       5 -26.5300 27.8714 27.8694       6 -26.5000 27.8500 27.8480      

now want count each line is_stolen = 1, number of rows in dataframe fulfill following conditions:

  • the lat_dec , lon_dec between lat_max, lat_min, lon_max , lon_min (i.e. fit within 'box' around gps point)
  • the time_of_day , day_of_week same of row of interest
  • the telematic_trip_no of rows need different of row of interest
  • and is_stolen tag of matching rows needs equal 0

i've written script using for loop ran very , got me thinking if there's efficient way complex row counts many conditions using dplyr or data.table?

ps if you're curious indeed trying calculate how many cars stolen-car passes during typical trip :)

given description of problem, following should work

library(dplyr) library(stats) # df data.frame (see below) df <- cbind(id=seq_len(nrow(df)),df) r.stolen <- which(df$is_stolen == 1) r.not <- which(df$is_stolen != 1) print(df[rep(r.not, times=length(r.stolen)),] %>%   setnames(.,paste0(names(.),"_not")) %>%     bind_cols(df[rep(r.stolen, each=length(r.not)),], .) %>%        mutate(in_range = as.numeric(telematic_trip_no != telematic_trip_no_not & time_of_day == time_of_day_not & day_of_week == day_of_week_not & lat_dec >= lat_min_not & lat_dec <= lat_max_not & lon_dec >= lon_min_not & lon_dec <= lon_max_not)) %>%         group_by(id) %>%           summarise(count = sum(in_range)) %>%              arrange(desc(count))) 

the first line adds column named id df identifies row row number can later dplyr::group_by make count.

the next 2 lines divides rows stolen , not-stolen cars. key to:

  1. replicate each row of stolen cars n times n number of not-stolen car rows,
  2. replicate rows of not-stolen cars (as block) m times m number of stolen car rows, and
  3. append result of (2) (1) new columns , change names of these new columns can reference them in condition

the result of (3) have rows enumerates pairs of stolen , not-stolen rows original data frame condition can applied in array fashion. dplyr piped r workflow fourth line of code (wrapped in print()) this:

  • the first command replicates not-stolen car rows using times
  • the second command appends _not column names distinguish them stolen car columns when bind columns. this answer gem.
  • the third command replicates stolen car rows using each , appends previous result new columns using dplyr::bind_cols
  • the fourth command uses dplyr::mutate create new column named in_range result of applying condition. boolean result converted {0,1} allow easy accumulation
  • the rest of commands in pipe counting of in_range grouped id , arranging results in decreasing order of count. note id column identifies rows of original data frame is_stolen = 1 whereas id_not column rows is_stolen = 0

this assumes want count each row is_stolen = 1 in original data frame, said in question. if instead want count each telematic_trip_no stolen, can use

group_by(telematic_trip_no) %>% 

in pipe instead.

i've tested using following data snippet

df <- structure(list(position_time = structure(c(1l, 1l, 1l, 2l, 3l,                  4l, 4l, 5l, 6l, 7l, 8l, 9l, 10l), .label = c("2016-06-05 00:00:01",                  "2016-06-05 00:00:04", "2016-06-05 00:00:05", "2016-06-05 00:00:19",                  "2016-06-05 00:00:20", "2016-06-05 00:00:22", "2016-06-05 00:00:23",                  "2016-06-05 00:00:35", "2016-06-05 00:09:34", "2016-06-06 01:00:06"                 ), class = "factor"), telematic_trip_no = c(526132109l, 526028387l,                  526081476l, 526140512l, 526140518l, 526006880l, 526017880l, 526027880l,                  526006880l, 526006890l, 526106880l, 526005880l, 526007880l),                  lat_dec = c(-26.6641, -26.6402, -26.5545, -26.531, -26.531,                  -26.501, -26.5315, -26.5325, -26.501, -26.5315, -26.5007,                  -26.5315, -26.5315), lon_dec = c(27.8733, 27.8059, 28.3263,                  27.8704, 27.8704, 27.849, 27.88, 27.87, 27.849, 27.87, 27.8493,                  27.87, 27.87), is_stolen = c(0l, 0l, 0l, 0l, 0l, 0l, 1l,                  1l, 1l, 1l, 1l, 1l, 1l), hour_of_day = c(0l, 0l, 0l, 0l,                  0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l), time_of_day = c(0l,                  0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 9l, 0l), day_of_week = structure(c(2l,                  2l, 2l, 2l, 2l, 2l, 2l, 2l, 2l, 2l, 2l, 2l, 1l), .label = c("monday",                  "sunday"), class = "factor"), lat_min = c(-26.6651, -26.6412,                  -26.5555, -26.532, -26.532, -26.502, -26.532, -26.532, -26.502,                  -26.532, -26.502, -26.532, -26.532), lat_max = c(-26.6631,                  -26.6392, -26.5535, -26.53, -26.53, -26.5, -26.53, -26.53,                  -26.5, -26.53, -26.5, -26.53, -26.53), lon_max = c(27.8743,                  27.8069, 28.3273, 27.8714, 27.8714, 27.85, 27.8714, 27.8714,                  27.85, 27.8714, 27.85, 27.8714, 27.8714), lon_min = c(27.8723,                  27.8049, 28.3253, 27.8694, 27.8694, 27.848, 27.8694, 27.8694,                  27.848, 27.8694, 27.848, 27.8694, 27.8694)), .names = c("position_time",                  "telematic_trip_no", "lat_dec", "lon_dec", "is_stolen", "hour_of_day",                  "time_of_day", "day_of_week", "lat_min", "lat_max", "lon_max",                  "lon_min"), class = "data.frame", row.names = c(na, -13l)) 

here, appended 7 new rows is_stolen = 1 original 6 rows is_stolen = 0:

  1. the first added row telematic_trip_no = 526005880 violates longitude condition not-stolen rows, count should 0
  2. the second added row telematic_trip_no = 526006880 violates latitude condition not-stolen rows, count should 0
  3. the third added row telematic_trip_no = 526007880 violates telematic_trip_no condition not-stolen rows, count should 0
  4. the fourth added row telematic_trip_no = 526006890 satisfies condition rows 4 , 5 not-stolen, count should 2
  5. the fifth added row telematic_trip_no = 526106880 satisfies condition row 6 not-stolen, count should 1
  6. the sixth added row telematic_trip_no = 526017880 violates time_of_day condition not-stolen rows, count should 0
  7. the seventh added row telematic_trip_no = 526027880 violates day_of_week condition not-stolen rows, count should 0

running code on data gives:

# tibble: 7 x 2      id count   <int> <dbl> 1    10     2 2    11     1 3     7     0 4     8     0 5     9     0 6    12     0 7    13     0 

which expected recalling appended rows is_stolen = 1 starts @ row 7 id = 7.

if 1 group telematic_trip_no instead, result:

# tibble: 7 x 2   telematic_trip_no count               <int> <dbl> 1         526006890     2 2         526106880     1 3         526005880     0 4         526006880     0 5         526007880     0 6         526017880     0 7         526027880     0 

as caveat, above approach cost memory. worst case number of rows grows n^2/4 n number of rows in original data frame, , number of columns doubles data frame used evaluate condition. array processing techniques, there trade between speed , memory.

hope helps.


Comments

Popular posts from this blog

Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12:test (default-test) on project.Error occurred in starting fork -

windows - Debug iNetMgr.exe unhandle exception System.Management.Automation.CmdletInvocationException -

configurationsection - activeMq-5.13.3 setup configurations for wildfly 10.0.0 -