regex simple pattern fails

Question

s1w 0 Newbie Poster

13 Years Ago

Problem looks easy. I havent issue like this in other languages, I cant find reason. PythonScript for Notepad++:

non-greedy html comment removal for entity:
 ass -->

python script: (?P<begin> and ?P<end> group names are for clearness)

import re

line = editor.getCurLine()
p = re.compile('(?P<begin><!--+)(?P<between>.*?)(?P<end>--+>)')
if '<!--' in line or '-->' in line:
    console.write(p.sub(r'\g<between>', line)) 
else:
    console.write('not commented.. ' + line)

results:
<input id="file_upload" type="file"/> ass -->

works satisfactorily.

but if I want to implement conditional (?) / {0,1} occurrences for <begin> or <end>:
1) '(?P<begin><!--+)?(?P<between>.*?)(?P<end>--+>)?'
2) '(<!--+)?(?P<between>.*?)(--+>)?'
the result looks GREEDY/or replaced in all occurences.. count=x flag wont help - disables replacements at all.

other pattern tryings with similar result is failing too:
3) '(<!--+){0,1}(?P<between>.*?)(--+>){0,1}' replacement = r'\g<between>'
4) '(<!--+)?(?:.*?)(--+>)?' replacement = ''
5) '(?:<!--+)?(.*?)(?:--+>)?' replacement = '\\1'

all results are:
<input id="file_upload" type="file"/> ass

(last '-->' should stood). any clue?

python

3 Contributors
7 Replies
200 Views
1 Week Discussion Span
Latest Post 13 Years Ago Latest Post by s1w

TrustyTony 888 ex-Moderator

13 Years Ago

Maybe you are trying to do Html/XML processing with wrong tools. You should use parser, not regex.

griswolf 304 Veteran Poster

13 Years Ago

Or write your own very simple parser if you prefer, using regex to recognize the interesting bits. You probably need start_of_tag, tag_contents, end_of_tag and some non-regex things to keep track of opening and closing tags with the same name.

To avoid rolling your own, check Beautiful Soup if you are parsing "other people's pages" or try HtmlParser from the standard library if you are pretty sure that the HTML/XML is correct.

Edited 13 Years Ago by griswolf because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

s1w 0 Newbie Poster · Answer 1 · 2011-06-05T05:29:36+00:00

Thank you for reply. Anyway Ive solved my problem meanwhile.. The function of this script is to fast process single line without selection. To check validity of html comments in multiline, I am going to do another thing.

My solution for now:
easy way: importing string and using 3 commands /I really wanted to avoid that way:

import re, string

line = editor.getCurLine()
if '<!--' in line or '-->' in line:
    line = re.sub('(?<=<!)-{3,}', '--', line)
    line = re.sub('-{3,}(?=>)', '--', line)
    line = line.replace("<!--", "", 1).replace("-->", "", 1)
else:
    line = re.sub('(\s*)(.*)\s*\n', r'\1<!-- \2 -->', line)

result:
+ allows conditional instances of '' in NON-GREEDY way
+ shortens '' cases
- unfortunately allows removal reversed setting of comments "--> . . . <!--"

but of course Ive created a magic pattern to do this all, and I challenge anyone to shorten it

import re

line = editor.getCurLine()
p1 = re.compile('^(.*?)((?P<lt><!--+)|(?P<rt>--+>))(?P<block>.*?)((?(rt)|(?(lt)--+>|<!--+))|$)')
p2 = re.compile('(\s*)(.*)\s*\n')

if '<!--' in line or '-->' in line:
    line = p1.sub(r'\1\g<block>', line)
else:
    line = p2.sub(r'\1<!-- \2 -->', line)

s1w 0 Newbie Poster · Answer 2 · 2011-06-05T06:49:36+00:00

one more correction:

p1 = re.compile('^(.*?)((?P<lt><!--+)|(?P<rt>--+>))(?P<block>.*?)((?(rt)|(?(lt)((?=<!--+)|--+>)))|$)')

+ now it also poperly handles "--> . . . " cases, it will be well self-explanatory during exploitation

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 3 · 2011-06-05T14:02:49+00:00

This is the must read classic about the subject:
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

s1w 0 Newbie Poster · Answer 4 · 2011-06-05T17:39:48+00:00

I am devastatingly overLoled about your link tonyjv. It made my day. Anyway, I must repeat, that I didnt intended to parse big portion of HTML. Just wanted do make PytonScript to Notepad++ that I can use from shortcut to fast comment single line without selection, dunno even if parsers can be linked to shortcut.

Anyway, my code had little error, and indeed, there was some issue f.ex. with whitespaces defining begin/end line (/n/f/r). But I think Ive handled it very well.

here is corrected and tested html comment python script for Notepad++:

import re
line = editor.getCurLine()
p1 = re.compile('^(.*?)((?P<lt><!--+\s*)|(?P<rt>\s*--+>))(?P<block>.*?)((?(rt)|(?(lt)((?=<!--+\s*)|\s*--+>)))|$)')
p2 = re.compile('\r?(\s*)(.*)\s*\n')

if '<!--' in line or '-->' in line:
    line = p1.sub(r'\1\g<block>', line).rstrip()
else:
    line = p2.sub(r'\1<!-- \2', line).rstrip() + ' -->'

currentLine = editor.lineFromPosition(editor.getCurrentPos())
editor.replaceLine(currentLine, line)

+ allows conditional instances of '' in NON-GREEDY way
+ shortens '' cases
+ correctly handles reversed setting of comments "--> . . . " situations

s1w 0 Newbie Poster · Answer 5 · 2011-06-12T22:26:18+00:00

ive decided to put some more work into this, and now i think the result is fully satisfying

#Toggle HTML Comment >> PythonScript for Notepad++
# -comments single line without selection/removes comment tags in NON-GREEDY or GREEDY style (see options)
# -comments selection/clears all comments from selection
# -intelligently handles '--> ... <!--' or '<!-- . . <!-- .. -->' in relation to single line when nothing selected
# -recognizes incorrect tags like <!----
# -script can remove also stream comments '/* */' and others own defined, which macro in N++ was not supported
# -all of this under one shortcut. author: s1w_
#note: this is NOT begin-end validity checker, use HTML parsers instead
import re, string

class commentPair:
    """Objects of this class will contain pairs of comment tags;
    you can define your additional bracket pairs below.
    Constructor takes two parameters: <opening> and <closing>
    raw string describing new definition of comment pair.

    If you want use '\' char in parameter, you have to double it,
    because it is recognized as escape sign. '\\**' and '***\\\\'
    in fact is passed as '\**' and '***\\'. If you forget about
    this, you will have syntax error (quantity of '\' should be
    no different than even). Do not escape other characters!

    Order of defining comment pairs is crucial in this script,
    because it will determine comments recognition order.
    First (top) defined pair is used for main script functionality.

    Additional definitions apply only for operation of cleaning.
    It performs if script detects in selected text or in current
    line one of defined comments: step by step and in the same
    order as were defined (from top to bottom).

    To promote additional comment pair to main functionality, you
    will want to create multiple copies of this script under
    different names and binds each one under different shortcut
    key. Each instance of script may define different order of
    comment pairs."""

    global esc; esc = re.escape

    def __init__( self, left='<!--', right='-->' ):
        self.right = right.strip();  self.left  =  left.strip()
        self.rplus = esc(self.right)
        self.rplus = re.sub(r'^([\\]?.)', r'\1+', self.rplus, 1);
        self.lplus = esc(self.left)+'+'
        self.lmore = self.lplus + '[ \t]*'
        self.rmore = '[ \t]*' + self.rplus
        self.combo = esc(self.left)+'+'+esc(self.right)
        self.tail  = self.right[0]


#top-most main comment pair definition:
commentTags  = [ commentPair('<!--', '-->') ]   # always first served and used while commenting line or selection

#additional comment pairs definitions (only for cleaning):
commentTags.append( commentPair('/*', '*/') )   # you can add more statements like this in desired order

#options:
GREEDY_MODE = 0       #greedy(1): widest matches; non-greedy(0): narrowest matches (recommended)
MULTIPLE_CLEANING = 1 #allow cleaning multiple braces from selection
ALIGN_TO_INDENT = 1   #align line to indent if new leading spaces detected (for non-selections)
CONSOLE_DEBUG = 0     #print script actions to console;

#declarations:
position, line = editor.getCurrentPos(), editor.getCurLine()
currentLine  = editor.lineFromPosition(position)
indent = editor.getLineIndentPosition(currentLine)
indstr = re.match('[ \t]*', line).group(0)
selStart, selEnd = editor.getSelectionStart(), editor.getSelectionEnd()
cright, cleft = ' '+commentTags[0].right, commentTags[0].left+' '
alt = position - indent

#functions:
def combine_pattern(comm):
    "checks mode and selection dependences and creates adequate pattern compilation for regex functions"
    if selStart == selEnd:
      if not GREEDY_MODE:
            return re.compile('^[ \t]*(.*?)(?P<left>((?P<combo>'+comm.combo+')|(?P<lt>'+comm.lmore+')|(?P<rt>'+comm.rmore+')))(?P<block>.*?)(?P<right>((?(combo)|(?(rt)|(?(lt)((?='+esc(comm.left)+')|'+comm.rmore+'))))|(\r?\n|$)))')
      else: return re.compile('^[ \t]*(?P<b1>(.*?('+comm.combo+'|(?=('+esc(comm.left)+'|'+comm.rplus+'))))+)(?P<left>((?P<cb1>'+comm.combo+')|'+comm.lmore+'|'+comm.rplus+'))(?P<b>.*)(?P<right>((?P<cb2>'+comm.combo+')|'+comm.lmore+'|[ \t]'+comm.rplus+'|((?<![ \t])(?<!<!)(?<!'+esc(comm.tail)+')'+comm.rplus+')))(?P<b2>.*?)(\r?\n|$)'), \
                   re.compile('^.*?(?P<left>((?P<combo>'+comm.combo+')|'+comm.lmore+'|'+comm.rmore+'))')
    else:   return re.compile('('+comm.combo+')|('+esc(comm.left)+')('+esc(comm.tail)+'*[ ]?)|([ ]?'+esc(comm.tail)+'*)('+esc(comm.right)+')'), \
                   re.compile('('+comm.combo+'|'+comm.lplus+'[ ]?|[ ]?'+comm.rplus+')')

def format_line(comment_pairs):
    global position
    global line; spaces = 0
    if not GREEDY_MODE:
      normal =  combine_pattern(comment_pairs)
      mod = normal.match(line);  mod_exclude = ''
      if CONSOLE_DEBUG:
        console.write('\n\n\n>removing  '+(mod.group("combo") or IfIn(mod.group("left"), comment_pairs.left, comment_pairs.right))+
        '  '+IfIn(mod.group("right"), comment_pairs.right, comment_pairs.left)+'\nfrom line  "'+line.strip()+'"')
      line = normal.sub(r'\1\g<block>', line).rstrip()
    else:
      greed, single = combine_pattern(comment_pairs)
      mod = greed.match(line)
      if CONSOLE_DEBUG: #console output section for debug:
        debug = greed.sub(r'\g<b1>[R1]\g<left>\g<b>\g<right>[R2]\g<b2>', line)
        if mod: console.write('\n\n\nprocessing: '+debug.strip()+'\n'+('='*25))
        chunks = greed.subn(r'\nleft garbage: \g<b1>\nleft tag: \g<left>\nright garbage: \g<b2>\nright tag: \g<right>\n'+('='*25), line)
        if chunks[1]: console.write(chunks[0]+'\n')
      if not mod: #remove single comment without its pair
        mod = single.match(line); mod_exclude = 'right'
        if CONSOLE_DEBUG:
          console.write('\n\n\n>removing single  '+(mod.group("combo") or
          IfIn(mod.group("left"), comment_pairs.left, comment_pairs.right))+'\nfrom line  "'+line.strip()+'"')
        line = re.sub('('+comment_pairs.combo+'|'+comment_pairs.lmore+'|'+comment_pairs.rmore+')', '', line).rstrip()
      else: #determining to remove two comment tags with greedy mode
        flag = 0; mod_exclude = ''; rcombo = False #map of flags was included at the end of script if eventually needed to fathom details
        rcombo = (comment_pairs.left in mod.group("right") and comment_pairs.right in mod.group("right")) #<!---->
        flag = flag|(bool(mod.group("b1").strip())*0x8) #left garbage
        flag = flag|((comment_pairs.right in mod.group("left"))*0x4)  #left tag
        flag = flag|((comment_pairs.right in mod.group("right"))*0x2) #right tag
        flag = flag|(bool(mod.group("b2").strip())*0x1) #right garbage
        if CONSOLE_DEBUG: console.write('flag = '+str("%X" % flag)+', removing: ')
        if flag not in [0x4,0x6,0x8,0xC,0xD,0xE,0xF] and (not rcombo or rcombo and flag not in [0x2,0x3,0xA,0xB]):
          line = re.sub('('+esc(mod.group("b1"))+')'+esc(mod.group("left")), '\\1', line, 1).rstrip() #remove left
          if flag not in [0x2,0x3,0xA,0xB] or rcombo: mod_exclude = 'right'
          if CONSOLE_DEBUG: console.write((mod.group("cb1") or IfIn(mod.group("left"), comment_pairs.left, comment_pairs.right))+'  ')
        if flag not in [0x0,0x1,0x5,0x7,0x9] and (not rcombo or rcombo and flag not in [0x3]):
          line = re.sub('^(.*)'+esc(mod.group("right")), '\\1', line).rstrip() #remove right
          if flag not in [0x2,0x3,0xA,0xB] or rcombo: mod_exclude = 'left'
          if CONSOLE_DEBUG: console.write(mod.group("cb2") or IfIn(mod.group("right"), comment_pairs.right, comment_pairs.left))
    if ALIGN_TO_INDENT:
      spaces = len(re.match('[ \t]*', line).group(0));
      line = line.lstrip();
    if not GREEDY_MODE or ALIGN_TO_INDENT:
      line = indstr+line
    if line.isspace(): line = line.strip()
    checkCursorPosition(mod, spaces, mod_exclude)
    return

def setSelectionArea(begin, end):
    "checks if selection was in rectangle mode, then sets new selection area"
    if editor.selectionIsRectangle():
      editor.setSelectionMode(1)
      editor.setRectangularSelectionAnchor(begin)
      editor.setRectangularSelectionCaret(end)
    else:
      editor.setSel(begin, end)

def checkCursorPosition(mod, sp, exclude='', tag='right'):
    "sets new cursor position after line modifications"
    global position;
    newlength = len(line) - len(indstr)
    if alt > len(line.strip()) or newlength < 0:
      position = indent + newlength
      return
    elif not exclude == tag and mod.group(tag):
      spaces = (tag == 'left') and sp or 0
      beg = mod.start(tag) - len(indstr)
      end = mod.end(tag) + spaces - len(indstr)
      if alt > beg:
        if beg < alt and end > alt: position = beg + indent
        else: position -= mod.end(tag)+spaces - mod.start(tag)
    if tag != 'left': checkCursorPosition(mod, sp, exclude, 'left')
    return

def IfIn(targetStr, *checkargs):
    "function checks targetStr if it contains any of checkargs strings. Returns first match or empty string"
    if targetStr.strip():
      for str in checkargs:
        if str in targetStr: return str
    return ''

formatted = 0
#final script code:
if selStart == selEnd:
    #fast single line modification
    for comments in commentTags:
       if comments.left in line or comments.right in line:
          format_line(comments); formatted = 1; break
    if not formatted:
       if line.isspace(): line = line[:alt-2]
       if alt > len(line.strip()): position = indent + len(line.strip())
       line = re.sub('([ \t]*)(.*)[ \t]*(\r?\n|$)', r'\1'+cleft+r'\2', line).rstrip() + cright
       if indent <= position: position += len(cleft)
       if CONSOLE_DEBUG: console.write('\n\n\n>commenting line...  '+cleft+cright)
    editor.replaceLine(currentLine, line)
    editor.setSel(position, position)
else:
    #selection modification: if tags detected clear, else comment
    if MULTIPLE_CLEANING:
      for comments in commentTags:
        if formatted: break
        counter = 0; modrange = 0;
        list, rem = combine_pattern(comments)
        for linePos in range(editor.lineFromPosition(selStart), editor.lineFromPosition(selEnd)+1):
          linebeg, lineend = editor.getLineSelStartPosition(linePos), editor.getLineSelEndPosition(linePos)
          lineStr = editor.getTextRange(linebeg, lineend)
          editor.setTargetStart(linebeg); editor.setTargetEnd(lineend)
          if comments.left in lineStr or comments.right in lineStr:
            if CONSOLE_DEBUG: console.write('\n\n>removing  ')
            for tag in list.findall(lineStr):
              if CONSOLE_DEBUG: console.write((tag[0] or tag[1] or tag[4])+'  '); counter+=1
              modrange+=len("".join(tag));
            if CONSOLE_DEBUG: console.write('\nfrom selection  "'+lineStr.strip()+'"')
            lineStr = rem.sub('', lineStr)
            editor.replaceTarget(lineStr); formatted = 1
            setSelectionArea(selStart, selEnd-modrange)
        if formatted and CONSOLE_DEBUG:
          console.write('\n:total removals '+str(counter)+'  :total range '+str(modrange))
    if not formatted:
      editor.setTargetStart(selEnd); editor.setTargetEnd(selEnd)
      editor.replaceTarget(cright)
      editor.setTargetStart(selStart); editor.setTargetEnd(selStart)
      editor.replaceTarget(cleft)
      if CONSOLE_DEBUG:
        console.write('\n\n>commenting selection...  '+cleft+' '+cright)
      setSelectionArea(selStart, selEnd+len(cleft)+len(cright))


#author: s1w_

#flag_legend_for_greedy_engine: ------------------------------------
# string                action      flag     condition
# |-->      *      -->| rem right   0110 x6
# |abc -->  *      -->| rem right   1110 xE
# |abc -->  *  --> abc| rem right   1111 xF
# |-->      *     <!--| rem right   0100 x4
# |abc -->  *     <!--| rem right   1100 xC
# |abc -->  * <!-- abc| rem right   1101 xD
# |abc <!-- *     <!--| rem right   1000 x8
# |<!--     *      -->| rem right   0010 x2  if right--> === <!---->
# |abc <!-- *      -->| rem right   1010 xA  if right--> === <!---->
# |abc <!-- *  --> abc| rem right   1011 xB  if right--> === <!---->
# |<!--     *  --> abc| rem left    0011 x3  if right--> === <!---->
# |-->      *  --> abc| rem left    0111 x7
# |-->      * <!-- abc| rem left    0101 x5
# |<!--     *     <!--| rem left    0000 x0
# |<!--     * <!-- abc| rem left    0001 x1
# |abc <!-- * <!-- abc| rem left    1001 x9
# |<!--     *      -->| rem both    0010 x2  if right--> != <!---->
# |<!--     *  --> abc| rem both    0011 x3  if right--> != <!---->
# |abc <!-- *      -->| rem both    1010 xA  if right--> != <!---->
# |abc <!-- *  --> abc| rem both    1011 xB  if right--> != <!---->
# where <!----> === -->

#testing_samples: --------------------------------------------------
#    asd <!--asdg ----->  <!----> dd <!--- ---> sdfaf  --->
#    asd <!---- /**<!-- asdg <!--- */dd --->sdfaf ---> ---> asda
#    asd <!----> a<!---->sdg<!---->sd ----->  <!----> dd
#     <!---->        <!--- asdg ----->  <!----> dd <!--- --->

http://snipplr.com/view/55184/toggle-html-comment--pythonscript-for-notepad/