Hi there!

I have run into a new problem, this time with the re.findall() module.
The objective of this code is to iterate over rows in a Excel sheet and print them in a other Excel sheet with a separation of column between the species name and the gene name.

It seems that the regular expression is working fine for the first 4 rows of the excel sheet. But the last 3 are not printed, but they contain the same sorts of names as the 4 that are working. So it should be working...

*Example:
species name: gene name:
Homo sapiens CYP2C19
Homo sapiens CYP2C9
Danio rerio CYP39A1
Xenopus leavis CYP39A1 **Text is printed until here **
Mus musculus Cyp2c65
Mus musculus Cyp2c66
Danio rerio Cyp2c38
*
Without the re.findall() module all the rows are read, so is it a bug in the re.findall() module?

Can someone tell me what I'm doing wrong?

Here is part of my code (reading only the rows of the Excel sheet):
(see below for the Excel sheet used in this test)

import xlrd
import xlwt
import re

# inputfile:
wb = xlrd.open_workbook('Test_input.xls') 

#Get the first sheet either by index or by name
sh = wb.sheet_by_index(0)

print "Number of rows: %s   Number of cols: %s" % (sh.nrows, sh.ncols)

# Create a output workbook and worksheet
wbk = xlwt.Workbook()
sheet_total = wbk.add_sheet('names total') 
sheet_split = wbk.add_sheet('names split')

#Check the sheet names
wb.sheet_names()

#Algorithm for reading en writing from file to file per row:

#Index individual cells:
rowx = 1
colx = 0
row = 0  # row counter for new Excel sheet
counter_row = 1 # while counter

print 'Printing rows of Excel sheet:'
sheet_total.write(row,0,'Rows') # writes heater in new Excel sheet
sheet_split.write(row,0,'Rows')
sheet_split.write(row,1,'Rows')

while counter_row < sh.nrows:
  row_cell = sh.cell(rowx,colx).value
  tuples = re.findall(r'(\w+\s\w+)\s*(CYP\w+)', row_cell)
  print 'TUPLES:', tuples

  rowx += 1
  print 'print_row:', rowx, colx, row_cell
  row += 1

  for tuple in tuples:
    print tuple   ## The whole match, print on sheet 1
    sheet_total.write(row,0,tuple)

    print tuple[0]  ## Species name (group 1), print sheet 2, col 1
    sheet_split.write(row,0,tuple[0])

    print tuple[1]  ## Gene name (group 2), print sheet 2, col 2
    sheet_split.write(row,1,tuple[1])

  if rowx == sh.nrows:
    rowx = 1
    counter_row += 1

wbk.save('reformatted.data.xls')

Recommended Answers

All 4 Replies

Last rows has mixed case and you do not set the re.I flag to ignore case.

Last rows has mixed case and you do not set the re.I flag to ignore case.

Thanks pyTony!
What do you mean by mixed case?
Do you mean upper and lower case?
But this is for all the rows (so names of spiecies) the same.
So why is it only working for the fist 4 rows?

as pyTony said, It only finds the cells containing "CYP", and ignores the cells containing "Cyp"

Thanks!

This is my new code and it works!

tuples = re.findall(r'(\w+\s\w+)\s*(CYP\w+)', row_cell, re.IGNORECASE)
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.