r - Creating Tidy Text -


i using r text analysis. used 'readtext' function pull in text pdf. however, can imagine, pretty messy. used 'gsub' replace text different purposes. general goal use 1 type of delimiter '%%%%%' split records rows, , delimiter '@' columns. accomplished first @ loss of how accomplish latter. sample of data found in dataframe follows:

895 "the ambulatory case-mix development project\n@published:: june 6, 1994@authors: baker a, honigfeld s, lieberman r, tucker am, weiner jp@country: united states @journal:project final report. baltimore, md, usa: johns hopkins university , aetna health plans. johns hopkins\nuniversity , aetna health plans, usa […"

896 "ambulatory care groups: evaluation military health care use@published:: june 6, 1994@authors: bolling dr, georgoulakis jm, guillen ac@country: united states @journal:fort sam houston, tx, usa: united states army center healthcare education , studies, publication #hr 94-\n004. united states army center healthcare education , […]@url: http://oai.dtic.mil/oai/oai?verb=getrecord&metadataprefix=html&identifier=ada27804"

i want take data , split @published, @authors, @journal, @url columns -- c("published", "authors", "journal", "url").

any suggestions?

thanks in advance!

this seems work ok:

dfr <- data.frame(text=c("the ambulatory case-mix development project\n@published:: june 6, 1994@authors: baker a, honigfeld s, lieberman r, tucker am, weiner jp@country: united states @journal:project final report. baltimore, md, usa: johns hopkins university , aetna health plans. johns hopkins\nuniversity , aetna health plans, usa […", "ambulatory care groups: evaluation military health care use@published:: june 6, 1994@authors: bolling dr, georgoulakis jm, guillen ac@country: united states @journal:fort sam houston, tx, usa: united states army center healthcare education , studies, publication #hr 94-\n004. united states army center healthcare education , […]@url: http://oai.dtic.mil/oai/oai?verb=getrecord&metadataprefix=html&identifier=ada27804"), stringsasfactors = false)  library(magrittr) do.call(rbind, strsplit(dfr$text, "@published::|@authors:|@country:|@journal:")) %>% as.data.frame %>% setnames(nm = c("preamble","published","authors","country","journal")) 

basically split text 1 of 4 fields (noticing double :: after published!), row-binding result, converting dataframe, , giving names.


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -