vb.net - Find all nodes that contain punctuation marks -


i have extremely large xml file , within each main node there child node

<term>text, text</term> 

some of these child nodes have punctuation marks shown above, punctuation mark not known. need list of of punctuation marks used in these child nodes can visually inspect them , later replace them 1 punctuation mark.

i've tried using regex /<term>[[:punct:]]<\/term> finds no matches in regex tester.

how can copy of punctuation marks used in child node text file?

how can replace punctuation marks in child node semi-colon?

here sample node, there 2 occurrences of in each node.

<conceptgrp><descripgrp><descrip type="subjectfield">6821</descrip></descripgrp><languagegrp><language lang="de" type="german" /><termgrp><term>betonkanal be;betonkanal breites ei</term><descripgrp><descrip type="termtype">phraseologicalunit</descrip></descripgrp><descripgrp><descrip type="reliabilitycode">2</descrip></descripgrp></termgrp></languagegrp><languagegrp><language lang="en" type="english" /><termgrp><term>flattened egg-shaped concrete sewer</term><descripgrp><descrip type="termtype">phraseologicalunit</descrip></descripgrp><descripgrp><descrip type="reliabilitycode">2</descrip></descripgrp></termgrp></languagegrp></conceptgrp> 

to answer first question, can use \p{p} match punctuation characters. so, assuming have way of iterating on xml nodes need examine...

option infer on option strict on  imports system.text.regularexpressions  module module1      sub main()         dim x = <root>                     <term>no punctuation</term>                     <term>here be... dots</term>                     <term>no, there isn't semi-colon here.</term>                 </root>          dim re new regex("\p{p}")          each in x.descendants             dim puncs = re.matches(a.value)             if puncs.count > 0                 each m match in puncs                     'todo: write file instead of console.                     console.write(m.groups(0).value)                 next                  console.writeline()              end if         next          console.readline()      end sub  end module 

outputs

...
,'-.

for second part of question, can use

for each in x.descendants     dim newvalue = re.replace(a.value, ";")     'todo: update value of node     console.writeline(newvalue) next 

which outputs

no punctuation
here be;;; dots
no; there isn;t semi;colon here;


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -