Regex in R -- extracting sub-string based on two start/stop words -
i have character (text) column:
tweets <- c( "drinking bud light @budweiser @ joe's crab shack http://www.joes.com", "drinking sam adams winter ale @samadams @ growler stop http://www.growlerstop.com", "drinking coco loco @nodabrewing @ corner pub http://www.cornerpub.com" )
as can see, assume tweets have standard structure:
"drinking [name of beer] @[name of brewery] @ [name of bar, notice whitespace] http://"
i want use regular expressions (and substr()
?) create 3 new columns:
- name of beer
- name of brewery
- name of bar (note have white space, needs go "http:")
one step further - how control tweets not have same structure?
it's ugly:
setnames(nm=c('beer','brewery','bar'),as.data.frame(do.call(rbind, regmatches(tweets,regexec('^drinking an? (.*) @(.*) @ (.*) http://.*$',tweets)) )[,-1l])); ## beer brewery bar ## 1 bud light budweiser joe's crab shack ## 2 sam adams winter ale samadams growler stop ## 3 coco loco nodabrewing corner pub
see regexec()
, regmatches()
.
Comments
Post a Comment