Hi, I have a huge file (over 60 GB) which has lines in the following consistent format.

"entry1";;"entry2";;"entry3";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"

The problem is that a Few lines in this file have a line break precisely after the 3rd entry like this:

"entry1";;"entry2";;"entry3\n
";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"

I need to delete that extra newline and concatenate the line below it with the above one so that it becomes a complete line again. I've come up with the following so far.

in_file = open('myhugetextfile.txt')
out_file = open('mycleaneduptextfile.txt','w')

#go through the file line by line

for line in in_file:

     #split on ;; and check if the length is less than 11 entries long
     #if length is less than 11, it means the line has an unnecessary newline in it

     if len(line.split(';;')) < 11:
          #strip the unneeded newline char off the end of the line
          line = line.strip('\n')
          
          #Read the next line (incomplete) and store it
          newline = in_file.readline()
          
          #now join the original broken line and the next line 
          repaired_line = line + newline

          #Write it to a new file
          out_file.write(repaired_line)

          #If there are no breaks in the line, just write it out to the new file
      else:
         out_file.write(line)

However, when I run this I get a

"ValueError: Mixing iteration and read methods would lose data"

Is my program logic correct or am I doing this the wrong way?

Any help would be appreciated.

Thanks,
Adi

Recommended Answers

All 7 Replies

A tip:

for line in in_file.readlines():

Cheers and Happy coding

Store the "short" line and blank it after writing. If there isn't a short line, the store variable will be empty.

in_file = open('myhugetextfile.txt')
out_file = open('mycleaneduptextfile.txt','w')
 
#go through the file line by line
repaired_line = "" 
for line in in_file:
 
     #split on ;; and check if the length is less than 11 entries long
     #if length is less than 11, it means the line has an unnecessary newline in it
 
     if len(line.split(';;')) < 11:
          #strip the unneeded newline char off the end of the line
          line = line.strip('\n')
 
          #  store this value
          repaired_line = line
 
      else:
         out_file.write(repaired_line + line)
         
         ## blank repaired_line after every write
         repaired_line = ""

Good solution, though I would do

if line.count(';;') < 4

Probably lot cheaper that split every line (60 GB file). Also the writing out should happen when read incomplete second line which is after the short one, not when read next complete line. (The code overwrites the shorter one with second short one)

A tip:

for line in in_file.readlines():

Cheers and Happy coding

Do yo have 60 GB+ RAM? Lucky guy ;)

Here the test of woooee's code to prove my analysis (can not edit post anymore)

lines='''"entry1";;"entry2";;"entry3
";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"
"entry1";;"entry2";;"entry3";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"'''
repaired_line = "" 
for line in lines.split():
 
     #split on ;; and check if the length is less than 11 entries long
     #if length is less than 11, it means the line has an unnecessary newline in it
 
     if len(line.split(';;')) < 11:
         #strip the unneeded newline char off the end of the line
         line = line.strip('\n')
         #  store this value
         repaired_line = line
     else:
        print(repaired_line + line)
        ## blank repaired_line after every write
        repaired_line = ""
"""Output:
";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11""entry1";;"entry2";;"entry3";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"
"""

Corrected version 1:

lines='''"entry1";;"entry2";;"entry3
";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"
"entry1";;"entry2";;"entry3";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"'''
repaired_line = "" 
for line in lines.split():
      if line.count(';;') < 5:
         #strip the unneeded newline char off the end of the line
         line = line.strip('\n')
         #  store this value
         repaired_line = line
     else:
        print(repaired_line + line)
        ## blank repaired_line after every write
        repaired_line = ""
"""Output:
"entry1";;"entry2";;"entry3";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"
"entry1";;"entry2";;"entry3";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11""""

Second version:

lines='''"entry1";;"entry2";;"entry3
";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"
"entry1";;"entry2";;"entry3";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"'''
repaired_line = "" 
for line in lines.split():
    semicount = line.count(';;') 
    if  semicount < 5:
         #strip the unneeded newline char off the end of the line
         line = line.strip('\n')
         #  store this value
         repaired_line = line
    elif semicount < 10:
        print(repaired_line + line)
    else:
        print(line)
""" Output:
"entry1";;"entry2";;"entry3";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"
"entry1";;"entry2";;"entry3";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"
"""

It was a tip as I said, and you can read the lines individually as he does now.

And your code still uses my 60 GB+ RAM stick. :)

Hey thanks everyone!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.