R: how to optimize multiple pattern count over multiple strings? -
i have (1) set of sentences, (2) set of keywords, , (3) scores (real numbers) each keyword. need assign scores sentences, score of sentence = sum_over_keywords(keyword count within sentence * keyword score).
reproducible example:
library(stringi) # generate 200 synthetic sentences containing 15 5-character words each set.seed(7122016) sentences_splitted = lapply(1:200, function(x) stri_rand_strings(15, 5)) # randomly select words sentences our keywords set.seed(7122016) keywords = unlist(lapply(sentences_splitted, function(x) if(sample(c(true,false),size=1,prob=c(0.2,0.8))) x[1])) len_keywords = length(keywords) # assign scores keywords set.seed(7122016) my_scores = round(runif(len_keywords),4)
now, scoring sentences:
res = system.time(replicate(100, unlist(lapply(sentences_splitted, function (x) sum(unlist(lapply(1:len_keywords, function(y) length(grep(paste0("\\<",keywords[y],"\\>"),x))*my_scores[y] )))))))
i tried optimize code as could, still very slow:
user system elapsed 11.81 0.01 11.89
i need repeat operation more 200,000 times... there faster length(grep(paste0("\\<",keywords[y],"\\>"),x))
? should use else nested lapply
's?
notes:
- i plan use 4 cores of laptop in parallel, need make basic chunk shown above faster.
- i happy call c/c++/fortran code r if offers script (unfortunately don't know these languages).
we can name my_scores
vector keywords. remember, r allows subsetting names. if can matched words can scores too:
names(my_scores) <- keywords res <- sapply(sentences_splitted, function(x) sum(my_scores[x[x %in% keywords]]))
that needed. can test out smaller testable example:
#create sentences sentences_splitted <- list(c("abc", "def", "ghi", "abc"), c("xyz", "abc", "mno", "xyz")) keywords <- c("abc", "xyz") my_scores <- c(10,20) #we should expect 10 * 2 #first sentence 10 * 1 , 20 * 2 #second sentence #expected result [1] 20 50 #check function works expected names(my_scores) <- keywords sapply(sentences_splitted, function(x) sum(my_scores[x[x %in% keywords]])) [1] 20 50
Comments
Post a Comment