The text I have:

Amiloride-sensitive cation channel, ASIC3 (also called BNC1 or MDEG) which is an acid-sensitive (proton-gated) homo- or hetero-oligomeric cation (Na+ (high affinity), Ca2+, K+) channel. It associates with DRASIC and ASIC1. It mediates touch sensation, being a mechanosensor) (lead inhibited) (Wang et al., 2006). In pulmonary tissue (lung epithelial cells) it and CFTR interregulate each other (Su et al., 2006). ASIC3 is a sensor of acidic and primary inflammatory pain (Deval et al., 2008).

Im trying to remove all instanes of X-sensitive or X-gated, etc.

My Code:

functional = r'\w{1,}( |-)(inducing|inducible|inhibited|inhibiting|responsive|gated|regulated|activated|receptor|modulated|enhanced|repressed|repressible|sensitive|dependent)'
cleantext=re.sub('\(|\)|\[|\]','',cleantext)
cleantext = re.sub(functional,'',cleantext,re.IGNORECASE)
print cleantext

Sometimes the two words are separated by a space or a dash.

But Python will only do a few instances.

cation channel, ASIC3 also called BNC1 or MDEG which is an proton-gated homo- or hetero-oligomeric cation Na+ high affinity, Ca2+, K+ channel. It associates with DRASIC and ASIC1. It mediates touch sensation, being a mechanosensor lead inhibited . In pulmonary tissue lung epithelial cells it and CFTR interregulate each other . ASIC3 is a sensor of acidic and primary inflammatory pain .

Notice that 'proton-gated' is still there? It got rid of :Amiloride-sensitive, acid-sensitive, lead inhhibited.

But IGNORES 'PROTON-GATED'

why is this? I have many instances of this where only random parts are being replaced.

Recommended Answers

All 3 Replies

When I replace my regex to just:
`functional = r'(inducing|inducible|inhibited|inhibiting|responsive|gated|regulated|activated|receptor|modulated|enhanced|repressed|repressible|sensitive|dependent)'

only the word 'sensitive' is removed. Everything else just sits there....`

I think it is a parity error. Basically, you are looking for 2 consecutive words. Take the sequence foo inhibited bar baz gated qux. There are 3 pairs of consecutive words, (foo, inhibited), (bar, baz), (gated, qux). Gated is not removed because it is not in second position in its pair.

If you remove re.IGNORECASE look better?

import re

data = '''\
Amiloride-sensitive cation channel, ASIC3 (also called BNC1 or MDEG) which is an acid-sensitive (proton-gated)'''

cleantext = re.sub(r'\(|\)|\[|\]', '' ,data)
cleantext = re.sub(r'\w{1,}( |-)(gated|sensitive)', '' ,cleantext)
print cleantext.strip()

'''Output-->
cation channel, ASIC3 also called BNC1 or MDEG which is an
'''
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.