r - Counting rows based upon conditional grouping with dplyr -
i have dataframe follows:
position_time telematic_trip_no lat_dec lon_dec 1 2016-06-05 00:00:01 526132109 -26.6641 27.8733 2 2016-06-05 00:00:01 526028387 -26.6402 27.8059 3 2016-06-05 00:00:01 526081476 -26.5545 28.3263 4 2016-06-05 00:00:04 526140512 -26.5310 27.8704 5 2016-06-05 00:00:05 526140518 -26.5310 27.8704 6 2016-06-05 00:00:19 526006880 -26.5010 27.8490 is_stolen hour_of_day time_of_day day_of_week lat_min 1 0 0 0 sunday -26.6651 2 0 0 0 sunday -26.6412 3 0 0 0 sunday -26.5555 4 0 0 0 sunday -26.5320 5 0 0 0 sunday -26.5320 6 0 0 0 sunday -26.5020 lat_max lon_max lon_min 1 -26.6631 27.8743 27.8723 2 -26.6392 27.8069 27.8049 3 -26.5535 28.3273 28.3253 4 -26.5300 27.8714 27.8694 5 -26.5300 27.8714 27.8694 6 -26.5000 27.8500 27.8480
now want count each line is_stolen = 1, number of rows in dataframe fulfill following conditions:
- the lat_dec , lon_dec between lat_max, lat_min, lon_max , lon_min (i.e. fit within 'box' around gps point)
- the time_of_day , day_of_week same of row of interest
- the telematic_trip_no of rows need different of row of interest
- and is_stolen tag of matching rows needs equal 0
i've written script using for loop ran very , got me thinking if there's efficient way complex row counts many conditions using dplyr or data.table?
ps if you're curious indeed trying calculate how many cars stolen-car passes during typical trip :)
given description of problem, following should work
library(dplyr) library(stats) # df data.frame (see below) df <- cbind(id=seq_len(nrow(df)),df) r.stolen <- which(df$is_stolen == 1) r.not <- which(df$is_stolen != 1) print(df[rep(r.not, times=length(r.stolen)),] %>% setnames(.,paste0(names(.),"_not")) %>% bind_cols(df[rep(r.stolen, each=length(r.not)),], .) %>% mutate(in_range = as.numeric(telematic_trip_no != telematic_trip_no_not & time_of_day == time_of_day_not & day_of_week == day_of_week_not & lat_dec >= lat_min_not & lat_dec <= lat_max_not & lon_dec >= lon_min_not & lon_dec <= lon_max_not)) %>% group_by(id) %>% summarise(count = sum(in_range)) %>% arrange(desc(count)))
the first line adds column named id
df
identifies row row number can later dplyr::group_by
make count.
the next 2 lines divides rows stolen , not-stolen cars. key to:
- replicate each row of stolen cars
n
timesn
number of not-stolen car rows, - replicate rows of not-stolen cars (as block)
m
timesm
number of stolen car rows, and - append result of (2) (1) new columns , change names of these new columns can reference them in condition
the result of (3) have rows enumerates pairs of stolen , not-stolen rows original data frame condition can applied in array fashion. dplyr
piped r workflow fourth line of code (wrapped in print()
) this:
- the first command replicates not-stolen car rows using
times
- the second command appends
_not
column names distinguish them stolen car columns when bind columns. this answer gem. - the third command replicates stolen car rows using
each
, appends previous result new columns usingdplyr::bind_cols
- the fourth command uses
dplyr::mutate
create new column namedin_range
result of applying condition. boolean result converted{0,1}
allow easy accumulation - the rest of commands in pipe counting of
in_range
groupedid
, arranging results in decreasing order of count. noteid
column identifies rows of original data frameis_stolen = 1
whereasid_not
column rowsis_stolen = 0
this assumes want count each row is_stolen = 1
in original data frame, said in question. if instead want count each telematic_trip_no
stolen, can use
group_by(telematic_trip_no) %>%
in pipe instead.
i've tested using following data snippet
df <- structure(list(position_time = structure(c(1l, 1l, 1l, 2l, 3l, 4l, 4l, 5l, 6l, 7l, 8l, 9l, 10l), .label = c("2016-06-05 00:00:01", "2016-06-05 00:00:04", "2016-06-05 00:00:05", "2016-06-05 00:00:19", "2016-06-05 00:00:20", "2016-06-05 00:00:22", "2016-06-05 00:00:23", "2016-06-05 00:00:35", "2016-06-05 00:09:34", "2016-06-06 01:00:06" ), class = "factor"), telematic_trip_no = c(526132109l, 526028387l, 526081476l, 526140512l, 526140518l, 526006880l, 526017880l, 526027880l, 526006880l, 526006890l, 526106880l, 526005880l, 526007880l), lat_dec = c(-26.6641, -26.6402, -26.5545, -26.531, -26.531, -26.501, -26.5315, -26.5325, -26.501, -26.5315, -26.5007, -26.5315, -26.5315), lon_dec = c(27.8733, 27.8059, 28.3263, 27.8704, 27.8704, 27.849, 27.88, 27.87, 27.849, 27.87, 27.8493, 27.87, 27.87), is_stolen = c(0l, 0l, 0l, 0l, 0l, 0l, 1l, 1l, 1l, 1l, 1l, 1l, 1l), hour_of_day = c(0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l), time_of_day = c(0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 0l, 9l, 0l), day_of_week = structure(c(2l, 2l, 2l, 2l, 2l, 2l, 2l, 2l, 2l, 2l, 2l, 2l, 1l), .label = c("monday", "sunday"), class = "factor"), lat_min = c(-26.6651, -26.6412, -26.5555, -26.532, -26.532, -26.502, -26.532, -26.532, -26.502, -26.532, -26.502, -26.532, -26.532), lat_max = c(-26.6631, -26.6392, -26.5535, -26.53, -26.53, -26.5, -26.53, -26.53, -26.5, -26.53, -26.5, -26.53, -26.53), lon_max = c(27.8743, 27.8069, 28.3273, 27.8714, 27.8714, 27.85, 27.8714, 27.8714, 27.85, 27.8714, 27.85, 27.8714, 27.8714), lon_min = c(27.8723, 27.8049, 28.3253, 27.8694, 27.8694, 27.848, 27.8694, 27.8694, 27.848, 27.8694, 27.848, 27.8694, 27.8694)), .names = c("position_time", "telematic_trip_no", "lat_dec", "lon_dec", "is_stolen", "hour_of_day", "time_of_day", "day_of_week", "lat_min", "lat_max", "lon_max", "lon_min"), class = "data.frame", row.names = c(na, -13l))
here, appended 7
new rows is_stolen = 1
original 6
rows is_stolen = 0
:
- the first added row
telematic_trip_no = 526005880
violates longitude condition not-stolen rows, count should0
- the second added row
telematic_trip_no = 526006880
violates latitude condition not-stolen rows, count should0
- the third added row
telematic_trip_no = 526007880
violatestelematic_trip_no
condition not-stolen rows, count should0
- the fourth added row
telematic_trip_no = 526006890
satisfies condition rows4
,5
not-stolen, count should2
- the fifth added row
telematic_trip_no = 526106880
satisfies condition row6
not-stolen, count should1
- the sixth added row
telematic_trip_no = 526017880
violatestime_of_day
condition not-stolen rows, count should0
- the seventh added row
telematic_trip_no = 526027880
violatesday_of_week
condition not-stolen rows, count should0
running code on data gives:
# tibble: 7 x 2 id count <int> <dbl> 1 10 2 2 11 1 3 7 0 4 8 0 5 9 0 6 12 0 7 13 0
which expected recalling appended rows is_stolen = 1
starts @ row 7
id = 7
.
if 1 group telematic_trip_no
instead, result:
# tibble: 7 x 2 telematic_trip_no count <int> <dbl> 1 526006890 2 2 526106880 1 3 526005880 0 4 526006880 0 5 526007880 0 6 526017880 0 7 526027880 0
as caveat, above approach cost memory. worst case number of rows grows n^2/4
n
number of rows in original data frame, , number of columns doubles data frame used evaluate condition. array processing techniques, there trade between speed , memory.
hope helps.
Comments
Post a Comment