regex - R: How to use grep() to find specific words? -


i have long data frame words. want use multi specific words find each part-of-speech words.

for example:

df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning",                            "cleaning composition", "supplying", "supply", "supplying cmp                            abrasive", "chemical mechanical"))  words 1 clean 2 grinding liquid cmp 3 cleaning 4 cleaning composition 5 supplying 6 supply 7 supplying cmp abrasive 8 chemical mechanical 

i want extract "clean" , "supply" single words different pos. have tried use grep() function do.

specific_word <- c("clean", "supply")  grep_onto <- df_1[grepl(paste(ontoword_apparatus, collapse = "|"), df_1$word), ] %>%     data.frame(word = ., row.names = null) %>%     unique() 

but result not want:

  word 1 cleans 2 grinding liquid cmp 3 cleaning 4 cleaning composition 5 supplying 6 supply 7 supplying cmp abrasive 8 chemical mechanical 

i prefer get

words 1 clean 2 cleaning 3 supplying 4 supply 

i know maybe regular expression can solve problem, don't know how define it. can give me advice?

there various ways this, if want single word , you're using regex, need specify beginning ^ , end $ of line limit can come before or after pattern. seem want able expand more letters, add in \\w* allow it:

df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning",                            "cleaning composition", "supplying", "supply",                            "supplying cmp abrasive", "chemical mechanical"))  specific_word <- c("clean", "supply") pattern <- paste0('^\\w*', specific_word, '\\w*$', collapse = '|')  pattern #> [1] "^\\w*clean\\w*$|^\\w*supply\\w*$"  df[grep(pattern, df$word), , drop = false]    # drop = false stop simplification vector #>        word #> 1     clean #> 3  cleaning #> 5 supplying #> 6    supply 

another interpretation of you're looking split each term individual words, , search of match. tidyr::separate_rows can used such split, can filter grepl:

library(tidyverse)  df <- data_frame(word = c("clean", "grinding liquid cmp", "cleaning",                            "cleaning composition", "supplying", "supply",                            "supplying cmp abrasive", "chemical mechanical"))  specific_word <- c("clean", "supply")  df %>% separate_rows(word) %>%     filter(grepl(paste(specific_word, collapse = '|'), word)) %>%      distinct() #> # tibble: 4 x 1 #>        word #>       <chr> #> 1     clean #> 2  cleaning #> 3 supplying #> 4    supply 

for more robust word tokenization, try tidytext::unnest_tokens or word actual word tokenizer.


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -