R: how to optimize multiple pattern count over multiple strings? -


i have (1) set of sentences, (2) set of keywords, , (3) scores (real numbers) each keyword. need assign scores sentences, score of sentence = sum_over_keywords(keyword count within sentence * keyword score).

reproducible example:

library(stringi) # generate 200 synthetic sentences containing 15 5-character words each set.seed(7122016) sentences_splitted = lapply(1:200, function(x) stri_rand_strings(15, 5))  # randomly select words sentences our keywords set.seed(7122016) keywords = unlist(lapply(sentences_splitted, function(x) if(sample(c(true,false),size=1,prob=c(0.2,0.8))) x[1])) len_keywords = length(keywords)  # assign scores keywords set.seed(7122016) my_scores = round(runif(len_keywords),4) 

now, scoring sentences:

res = system.time(replicate(100,     unlist(lapply(sentences_splitted, function (x)         sum(unlist(lapply(1:len_keywords, function(y)              length(grep(paste0("\\<",keywords[y],"\\>"),x))*my_scores[y]         ))))))) 

i tried optimize code as could, still very slow:

   user  system elapsed    11.81    0.01   11.89   

i need repeat operation more 200,000 times... there faster length(grep(paste0("\\<",keywords[y],"\\>"),x))? should use else nested lapply's?

notes:

  • i plan use 4 cores of laptop in parallel, need make basic chunk shown above faster.
  • i happy call c/c++/fortran code r if offers script (unfortunately don't know these languages).

we can name my_scores vector keywords. remember, r allows subsetting names. if can matched words can scores too:

names(my_scores) <- keywords res <- sapply(sentences_splitted, function(x) sum(my_scores[x[x %in% keywords]])) 

that needed. can test out smaller testable example:

#create sentences sentences_splitted <- list(c("abc", "def", "ghi", "abc"), c("xyz", "abc", "mno", "xyz")) keywords <- c("abc", "xyz") my_scores <- c(10,20)  #we should expect 10 * 2 #first sentence 10 * 1 , 20 * 2 #second sentence #expected result [1] 20 50  #check function works expected names(my_scores) <- keywords sapply(sentences_splitted, function(x) sum(my_scores[x[x %in% keywords]])) [1] 20 50 

Comments

Popular posts from this blog

Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12:test (default-test) on project.Error occurred in starting fork -

windows - Debug iNetMgr.exe unhandle exception System.Management.Automation.CmdletInvocationException -

configurationsection - activeMq-5.13.3 setup configurations for wildfly 10.0.0 -