regex - R: How to use grep() to find specific words? -
i have long data frame words. want use multi specific words find each part-of-speech words.
for example:
df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning", "cleaning composition", "supplying", "supply", "supplying cmp abrasive", "chemical mechanical")) words 1 clean 2 grinding liquid cmp 3 cleaning 4 cleaning composition 5 supplying 6 supply 7 supplying cmp abrasive 8 chemical mechanical
i want extract "clean" , "supply" single words different pos. have tried use grep()
function do.
specific_word <- c("clean", "supply") grep_onto <- df_1[grepl(paste(ontoword_apparatus, collapse = "|"), df_1$word), ] %>% data.frame(word = ., row.names = null) %>% unique()
but result not want:
word 1 cleans 2 grinding liquid cmp 3 cleaning 4 cleaning composition 5 supplying 6 supply 7 supplying cmp abrasive 8 chemical mechanical
i prefer get
words 1 clean 2 cleaning 3 supplying 4 supply
i know maybe regular expression can solve problem, don't know how define it. can give me advice?
there various ways this, if want single word , you're using regex, need specify beginning ^
, end $
of line limit can come before or after pattern. seem want able expand more letters, add in \\w*
allow it:
df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning", "cleaning composition", "supplying", "supply", "supplying cmp abrasive", "chemical mechanical")) specific_word <- c("clean", "supply") pattern <- paste0('^\\w*', specific_word, '\\w*$', collapse = '|') pattern #> [1] "^\\w*clean\\w*$|^\\w*supply\\w*$" df[grep(pattern, df$word), , drop = false] # drop = false stop simplification vector #> word #> 1 clean #> 3 cleaning #> 5 supplying #> 6 supply
another interpretation of you're looking split each term individual words, , search of match. tidyr::separate_rows
can used such split, can filter
grepl
:
library(tidyverse) df <- data_frame(word = c("clean", "grinding liquid cmp", "cleaning", "cleaning composition", "supplying", "supply", "supplying cmp abrasive", "chemical mechanical")) specific_word <- c("clean", "supply") df %>% separate_rows(word) %>% filter(grepl(paste(specific_word, collapse = '|'), word)) %>% distinct() #> # tibble: 4 x 1 #> word #> <chr> #> 1 clean #> 2 cleaning #> 3 supplying #> 4 supply
for more robust word tokenization, try tidytext::unnest_tokens
or word actual word tokenizer.
Comments
Post a Comment