regex - Python 2 and 3 're.sub' inconsistency -
i writing function split numbers , other things text in python. code looks this:
en_extract_regex = '([a-za-z]+)' num_extract_regex = '([0-9]+)' aggr_regex = en_extract_regex + '|' + num_extract_regex entry = re.sub(aggr_regex, r' \1\2', entry) now, code works fine in python3, not work under python2 , "unmatched group" error.
the problem is, need support both versions, , not work in python2 although tried various other ways.
i curious root of problem, , there workaround it?
i think problem might regex pattern matches 1 or other of subpatterns en_extract_regex , num_extract_regex, not both.
when re.sub() matches alpha characters in first pattern attempts substitute second group reference \2 fails because first group matched - there no second group.
similarly when digit pattern matched there no \1 group substitute , fails.
you can see case test in python 2:
>>> re.sub(aggr_regex, r' \1', 'abcd') # reference first pattern abcd >>> re.sub(aggr_regex, r' \2', 'abcd') # reference second pattern traceback (most recent call last): .... sre_constants.error: unmatched group the difference must lie within different versions of regex engine python 2 , python 3. unfortunately can not provide definitive reason difference, however, there documented change in version 3.5 re.sub() regarding unmatched groups:
changed in version 3.5: unmatched groups replaced empty string.
which explains why works in python >= 3.5 not in earlier versions: unmatched groups ignored.
as workaround can change pattern handle both matches single group:
import re en_extract_regex = '[a-za-z]+' num_extract_regex = '[0-9]+' aggr_regex = '(' + en_extract_regex + '|' + num_extract_regex + ')' # ([a-za-z]+|[0-9]+) s in '', '1234', 'abcd', 'a1b2c3', 'aa__bb__1122cdef', '_**_': print(re.sub(aggr_regex, r' \1', s)) output
1234 abcd 1 b 2 c 3 aa__ bb__ 1122 cdef _**_
Comments
Post a Comment