vb.net - Find all nodes that contain punctuation marks -
i have extremely large xml file , within each main node there child node
<term>text, text</term>
some of these child nodes have punctuation marks shown above, punctuation mark not known. need list of of punctuation marks used in these child nodes can visually inspect them , later replace them 1 punctuation mark.
i've tried using regex /<term>[[:punct:]]<\/term>
finds no matches in regex tester.
how can copy of punctuation marks used in child node text file?
how can replace punctuation marks in child node semi-colon?
here sample node, there 2 occurrences of in each node.
<conceptgrp><descripgrp><descrip type="subjectfield">6821</descrip></descripgrp><languagegrp><language lang="de" type="german" /><termgrp><term>betonkanal be;betonkanal breites ei</term><descripgrp><descrip type="termtype">phraseologicalunit</descrip></descripgrp><descripgrp><descrip type="reliabilitycode">2</descrip></descripgrp></termgrp></languagegrp><languagegrp><language lang="en" type="english" /><termgrp><term>flattened egg-shaped concrete sewer</term><descripgrp><descrip type="termtype">phraseologicalunit</descrip></descripgrp><descripgrp><descrip type="reliabilitycode">2</descrip></descripgrp></termgrp></languagegrp></conceptgrp>
to answer first question, can use \p{p} match punctuation characters. so, assuming have way of iterating on xml nodes need examine...
option infer on option strict on imports system.text.regularexpressions module module1 sub main() dim x = <root> <term>no punctuation</term> <term>here be... dots</term> <term>no, there isn't semi-colon here.</term> </root> dim re new regex("\p{p}") each in x.descendants dim puncs = re.matches(a.value) if puncs.count > 0 each m match in puncs 'todo: write file instead of console. console.write(m.groups(0).value) next console.writeline() end if next console.readline() end sub end module
outputs
...
,'-.
for second part of question, can use
for each in x.descendants dim newvalue = re.replace(a.value, ";") 'todo: update value of node console.writeline(newvalue) next
which outputs
no punctuation
here be;;; dots
no; there isn;t semi;colon here;
Comments
Post a Comment