Gribouillis 1,391 Programming Explorer Team Colleague

The sendto() documentation says

socket.sendto(string[, flags], address): Send data to the socket. The socket should not be connected to a remote socket, since the destination socket is specified by address.

So perhaps you should remove the first call to connect().

Gribouillis 1,391 Programming Explorer Team Colleague

Probably delete the equal sign at line 7.

Gribouillis 1,391 Programming Explorer Team Colleague

You mean perhaps something like

def local_hot_list(filename, day_to_extract):
    f1 = h5py.File(filename, 'r')
    data = f1['Data']
    temp = data['Temperature']
    temp_day = numpy.array(temp[day_to_extract])
    press_head = data['Pressure_Head']
    press_day = numpy.array(press_head[day_to_extract])
    geo = f1['Geometry']
    nodes = geo['orig_node_number']
    orig_nodes = numpy.array(nodes['0'])
    return numpy.column_stack((orig_nodes,press_day,temp_day))

complete_hot_list = numpy.vstack(tuple(local_hot_list(name, day_to_extract) for name in local_files))

?

Gribouillis 1,391 Programming Explorer Team Colleague

I don't know how glade works, but I recently hand coded a GtkListStore, and I remember that the method treeview.add_column() must be called on each column that you want to see to correctly initialize the treeview. Can you check this in your code ?

edit: sorry, it's append_column()

Gribouillis 1,391 Programming Explorer Team Colleague

I found a simple way to load or reload a firefox tab from python which uses a firefox addon called 'remote control'. Using this add-on in a single firefox tab, your python program can change automatically the browser's page content. See this snippet. There is also this module but I have'nt tried it yet ;)

Gribouillis 1,391 Programming Explorer Team Colleague

There is also numpy.loadtxt() to read the file

    import numpy
    with open("numfile.txt") as ifh:
        arr = numpy.loadtxt(ifh, usecols = (2, 3), dtype=float, delimiter=" , ", skiprows = 2)
    print arr

""" my output -->
[[  2.00000000e+00  -1.00000000e-04]
 [  2.00000000e+00   1.69973300e-01]
 [  2.00000000e+00   3.36706700e-01]
 [  2.00000000e+00   5.00040000e-01]]
"""

Use matplotlib to plot the curves.

ceck30s commented: Thank you for letting me know about numpy.txt(). It works! +0
Gribouillis 1,391 Programming Explorer Team Colleague

You can change to another port, perhaps avoiding well known ports 0 - 1023. For example, port 23 is normally used for a telnet service. I don't think there is a limit on the data files (either the number of files or their size) but if you have a huge file, read it by chunks !

Gribouillis 1,391 Programming Explorer Team Colleague

Well, here is a working example. First the server program

#!/usr/bin/python3
# -*-coding: utf8-*-
# Title: theserver.py

from bottle import route, run
from bottle import static_file

@route('/static/<filename>')
def server_static(filename):
    return static_file(filename, root='./my_static_files')

if __name__ == "__main__":
    run(host='localhost', port=8080)

Then the client program

#!/usr/bin/python3
# -*-coding: utf8-*-
# Title: theclient.py

from urllib.request import urlopen

if __name__ == "__main__":
    data = urlopen("http://localhost:8080/static/foobar.txt").read()
    print(data)

Also I created a directory my_static_files with a data file foobar.txt.

Gribouillis 1,391 Programming Explorer Team Colleague

You could use a small web server like bottle. The server side is almost trivial (define a function to serve static files), the client side is straightforward using module urllib. Your app could start/stop the server process and manage the content of the directory where it would store the static files.

Gribouillis 1,391 Programming Explorer Team Colleague

My preferred way is to use the module I uploaded in this post

from whatever import path    
p = path("C:")/"foo"/"bar"/"qux"

Although usually I don't call it 'module whatever' but module 'kernilis.path'.

Gribouillis 1,391 Programming Explorer Team Colleague

You can try this too

import os
p = os.path.join('C:\\', 'Users', 'spreston', 'Desktop', 'HeadFirst', 'Chapter 3')
Gribouillis 1,391 Programming Explorer Team Colleague

Here is a stronger code to read the number from user input.

def findx(): # This function asks an integer from the user, it does not need a parameter.
    while True: # This means 'repeat indefinitely' (unless a break or return or raise statement is met)
        try:
            z = int(input("Enter: "))
        except (ValueError, TypeError): # Your code can handle more than one exception at the same time
            print("A integer is required to proceed.")
        else:
            if z <= 0: # the code can handle the negative case
                print("A positive integer is required to proceed.")
            else:
                return z # exit the function findx(), returning a value

def main():
    x = findx() # this is how you catch the value returned by findx()
    print("x = ", x) # python converts x to str for you. No need to write str(x)

main()
input()

I don't understand exactly what you want to do with this number x. Your
formula

((x-n)*2)+(n-2)

looks false if we are speaking about number of bits.

The number of bits of an integer usually means the number of digits it takes when it is written in base 2. For example the number 2387 can be written 100101010011 in base two, and one says that the number of bits of 2387 is 12 because there are twelve 0 and 1 in base 2. Is it what you mean ?

Gribouillis 1,391 Programming Explorer Team Colleague

It seems to me that

MyList = custom_list_type("Mylist", (check_int, check_int, check_str))

is not more difficult to write than

Mylist = List(Int,Int,Str)

This feature may be part of the enthought tool suite, but it's not included in python. The pythonic way of thinking is to avoid type checking.

snippsat commented: The pythonic way of thinking is to avoid type checking. +1 +9
Gribouillis 1,391 Programming Explorer Team Colleague

Well, here is how you could use the abstract base classes in module collections to implement such custom types. Here a list type which item types are given. Notice that this class uses 'converters' instead of types, which means that complex behavior can be implemented whith regards to the conditions on items, for example one could ask that an item be a floating number between 0 and 1, etc

#!/usr/bin/env python
# -*-coding: utf8-*-
# Title: customtp.py
# Author: Gribouillis
# Created: 2012-07-06 09:17:00.649267 (isoformat date)
# License: Public Domain
# Use this code freely.

import collections

def custom_list_type(name, converters):

    class tp(collections.MutableSequence):

        def __init__(self, iterable):
            L = list(iterable)
            if len(L) != len(self.converters):
                raise TypeError(("Invalid number of items", len(L)))
            self.data = list(c(x) for c, x in zip(self.converters, L))

        # sized methods
        def __len__(self):
            return len(self.data)

        # iterable methods
        def __iter__(self):
            return iter(self.data)

        # container methods
        def __contains__(self, value):
            return value in self.data

        # sequence methods
        def __getitem__(self, index):
            return self.data[index]

        # mutable sequence methods
        def __setitem__(self, index, value):
            self.data[index] = self.converters[index](value)
        def __delitem__(self, index):
            raise TypeError("Can not remove item")
        def insert(self, index, item):
            raise TypeError("Can not insert item")

        # repr
        def __repr__(self):
            return "{name}({items})".format(name = self.__class__.__name__, items = repr(list(self)))

    tp.converters = tuple(converters)
    tp.__name__ = name
    return tp

if __name__ == "__main__":

    # the first custom type attempts to convert invalid values (for example str to int)

    cuslist = custom_list_type("Mylist", (int, int, str))

    L = cuslist([3, "4", "hello"])
    L[1] = 39.14
    try:
        L[0] = "joe"
    except ValueError:
        print "test joe passed" …
TrustyTony commented: Neat! +12
Gribouillis 1,391 Programming Explorer Team Colleague

You can try something like this

#!/usr/bin/env python
# -*-coding: utf8-*-
# Title: remcol.py

# WARNING: UNTESTED

import os

class Error(Exception):
    pass

def remove_column(srcdir, dstdir = None):
    if dstdir is None:
        dstdir = os.path.join(os.path.expanduser("~"), "columnremoved")
    if os.path.exists(dstdir):
        # this guarantees that this function won't overwrite anything on the disk
        raise Error(("destination directory exists", dstdir))
    else:
        os.mkdir(dstdir)
    for filename in os.listdir(srcdir):
        if not filename.endswith(".txt"):
            continue
        inname = os.path.join(srcdir, filename)
        outname = os.path.join(dstdir, filename)
        with open(outname, "w") as ofh:
            with open(inname, "r") as ifh:
                for line in ifh:
                    line = line.split(",", 1)[1]
                    ofh.write(line)
    return dstdir

if __name__ == "__main__":
    dstdir = remove_column(r"/path/to/the/directory/containing/the/3200/files")
    print("output files written to directory '%s'" % dstdir)
Gribouillis 1,391 Programming Explorer Team Colleague

Also

'\t'.join(str(x) for x in mylist)
Gribouillis 1,391 Programming Explorer Team Colleague

It's always been like this: if you evaluate a python expression in a namespace, the __builtins__ dictionary is inserted in that namespace. For example

In [3]: D = dict()

In [4]: D.keys()
Out[4]: []

In [5]: eval("1+2", D)
Out[5]: 3

In [6]: D.keys()
Out[6]: ['__builtins__']

I think it's done at line 4702 in ceval.c.

Gribouillis 1,391 Programming Explorer Team Colleague

I don't understand why you open the logfile in read mode in the first case. You can try the following, which should create an error file with a traceback if the logfile does not open

from datetime import datetime
import os
import traceback

logfilename = os.path.join('C:\\', 'Dir1', 'Sub One', 'Sub3', 'Sub4', 'Sub Five', 'logfile.txt')
mode = 'a' if os.path.exists(logfilename) else 'w+'
try:
    log = open(logfilename, mode)
except Exception:
    with open(os.path.join('C:\\', 'ERROR.txt'), 'w') as ofh:
        traceback.print_exc(file = ofh)
Gribouillis 1,391 Programming Explorer Team Colleague

A circle with center (a,b) and radius r has the equation (x-a)^2 +(y-b)^2=r^2. If the horizontal line has the equation y = c, it means that the intersection points are given by (x-a)^2 = r^2 - (c-b)^2 when this is non negative, so that

x = a +/- sqrt(r**2 - (b-c)**2)
y = c

You could ask the user to enter the center, radius and the number c.

Gribouillis 1,391 Programming Explorer Team Colleague

In linux I used pyinotify which works very well.

Gribouillis 1,391 Programming Explorer Team Colleague

You must not compare strings but datetime instances

now_date = datetime.now()
dir_date = datetime.strptime(fname, fm)
del_date = now_date - timedelta(days = 14)

if  dir_date <= del_date :
    print "Folder %s is older than 14 days" % fname
vegaseat commented: nice +14
Gribouillis 1,391 Programming Explorer Team Colleague

Use datetime module

>>> from datetime import datetime, timedelta
>>> fm = "%d-%b-%y"
>>> delta = timedelta(days=14)
>>> d = datetime.strptime("28-FEB-12", fm)
>>> (d - delta).strftime(fm)
'14-Feb-12'
Gribouillis 1,391 Programming Explorer Team Colleague

I tried to do it using itertools.groupby(), but hihe's method is faster. The reason is probably that groupby() needs an initial sort, while filling a dict doesn't. Here is the comparison

#!/usr/bin/env python
# -*-coding: utf8-*-
# compare two implementations of creating a sorted index list

data_list = ['the', 'house', ',', 'the', 'beer']

from itertools import count, groupby
from operator import itemgetter
ig0 = itemgetter(0)
ig1 = itemgetter(1)

def score(item):
    return (-len(item[1]), item[0])

def grib_func(data_seq):
    L = sorted(zip(data_seq, count(0)))
    L = ((key, [x[1] for x in group]) for key, group in groupby(L, key = ig0))
    return sorted(L, key = score)

# hihe's code    
def create_index_dict(data_list):
    index_dict = {}
    for ix, word in enumerate(data_list):
        index_dict.setdefault(word, []).append(ix)
    return index_dict

def hihe_func(data_seq):
    return sorted(create_index_dict(data_seq).items(), key = score)

# comparison code

print grib_func(data_list)
print hihe_func(data_list)

from timeit import Timer
for name in ("hihe_func", "grib_func"):
    tm = Timer("%s(data_list)"%name, "from __main__ import %s, data_list"%name)
    print "{0}: {1}".format(name, tm.timeit())

""" my output -->
[('the', [0, 3]), (',', [2]), ('beer', [4]), ('house', [1])]
[('the', [0, 3]), (',', [2]), ('beer', [4]), ('house', [1])]
hihe_func: 11.2509949207
grib_func: 18.3761279583
"""
HiHe commented: thanks +6
vegaseat commented: thanks fortiming this +14
Gribouillis 1,391 Programming Explorer Team Colleague

Did you read the sticky thread projects-for-the-beginner which contains many ideas in this direction ?

Gribouillis 1,391 Programming Explorer Team Colleague

All this would be more serious if you could describe the exact syntax rules of the input string.

Gribouillis 1,391 Programming Explorer Team Colleague

Yes, your statements are correct. Here is how you could implement the cpp static field behavior in python using an accessor function instead of an attribute

from __future__ import print_function
from functools import partial

# preliminary definitions

_default = object()

def _helper_static_cpp(container, value = _default):
    if value is _default:
        return container[0]
    else:
        container[0] = value

def static_cpp(initializer = None):
    return partial(_helper_static_cpp, [initializer])

# class example imitating a cpp static field with an accessor function

class A(object):
    istat = static_cpp(111)

if __name__ == '__main__':
    a = A()
    print(A.istat(), a.istat())
    A.istat(112)
    print(A.istat(), a.istat())
    a.istat(113)
    print(A.istat(), a.istat())
    b = A()
    b.istat(114)
    print(A.istat(), a.istat(), b.istat())

""" my output -->
111 111
112 112
113 113
114 114 114
"""
HiHe commented: nice help +5
Gribouillis 1,391 Programming Explorer Team Colleague

for Python3 you need to remove the main part or change the print statements

I posted a new version as a code snippet . It includes a bugfix with regard to rounding the last bit.

Gribouillis 1,391 Programming Explorer Team Colleague

What do you mean when you say that you get more precise floats than using the struct module ? Can you give a code example ?

Gribouillis 1,391 Programming Explorer Team Colleague

I suggest to write it as

while((tr < row) && (grid[tr][tc]=='0'||grid[tr][tc]==s[i]) && (count<strlen(s)))
       { ++count;
         ++tr;
       }

If tr < row fails, grid[tr] will not be computed :)

Gribouillis 1,391 Programming Explorer Team Colleague

I wrote a nice class to convert between various ieee754 formats

#!/usr/bin/env python
# -*-coding: utf8-*-
# Title: anyfloat.py
# Author: Gribouillis for the python forum at www.daniweb.com
# Created: 2012-05-02 06:46:42.708131 (isoformat date)
# License: Public Domain
# Use this code freely.

from collections import namedtuple
from math import isnan
import struct
import sys

if sys.version_info < (2, 7):
    raise ImportError("Module anyfloat requires python 2.7 or newer.")


class anyfloat(namedtuple("anyfloat", "sign log2 mantissa")):
    """A class storing real numbers independently from the ieee754 format.

    This class stores a real number as a triple of integers (sign, log2, mantissa)
    where sign is alway 0 or 1 and:
        a) mantissa == -2 is used to represent NaN values. In this case, sign == 0 and log2 == 0.
           There is only one NaN value in this representation.
        b) mantissa == -1 is used to represent +Infinity if sign == 0 and -Infinity if sign == 1.
           In this case, log2 == 0
        c) mantissa == 0 is used to represent 0.0 if sign == 0 and -0.0 if sign == 1. In this
           case, log2 == 0
        d) mantissa > 0 is used to represent any other real number with a finite number of binary
           digits. The real number x corresponding to the anyfloat instance is mathematically
                x = +/- pow(2, log2) * y
           where y is the number in [1, 2[ which binary digits are the binary digits of the mantissa.
           For example the real number corresponding to anyfloat(1, 5, 39) …
Gribouillis 1,391 Programming Explorer Team Colleague

I once wrote this function to get the ieee754 representation of a float

def ieee754(x):
    """Return a string of 0 and 1 giving the ieee754 representation of the float x
    """
    import struct
    from binascii import hexlify
    p = struct.pack("d", x)
    s = bin(int(b"1" + hexlify(p), 16))[3:]
    return " ".join(reversed([s[i:i+8] for i in xrange(0, len(s), 8)]))

The following test shows that the 16 bits version is very close to the 64 bits version. It means that you should be able to find the 3 numbers easily

0011110000000000
00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000
1.0 1.0
0011110000000001
00111111 11110000 00000100 00000000 00000000 00000000 00000000 00000000
1.0009765625 1.0009765625
1100000000000000
11000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
-2.0 -2.0
0111101111111111
01000000 11101111 11111100 00000000 00000000 00000000 00000000 00000000
65504 65504.0
0000010000000000
00111111 00010000 00000000 00000000 00000000 00000000 00000000 00000000
6.103515625e-05 6.103515625e-05
0000001111111111
00111111 00001111 11111000 00000000 00000000 00000000 00000000 00000000
6.097555160522461e-05 6.097555160522461e-05
0000000000000001
00111110 01110000 00000000 00000000 00000000 00000000 00000000 00000000
5.960464477539063e-08 5.960464477539063e-08
0000000000000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0.0 0.0
1000000000000000
10000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
-0.0 -0.0
0111110000000000
01000000 11110000 00000000 00000000 00000000 00000000 00000000 00000000
65536 inf
1111110000000000
11000000 11110000 00000000 00000000 00000000 00000000 00000000 00000000
-65536 -inf
0011010101010101
00111111 11010101 01010100 00000000 00000000 00000000 00000000 00000000
0.333251953125 0.3333333333333333

Apparently, removing bits [2, 8[ and truncating brings back the 16 bits float.

Gribouillis 1,391 Programming Explorer Team Colleague

Here is a way using my snippet on non recursive tree traversal 1.4

from functools import partial
import walktree as wt

def subn(letters, k, node):
    if len(node) == k:
        return
    for x in letters:
        yield node + x

for path in wt.walk("", partial(subn, "ABC", 3), wt.event(wt.leaf)):
    print path[-1]

"""my output -->
AAA
AAB
AAC
ABA
ABB
ABC
ACA
ACB
ACC
BAA
BAB
BAC
BBA
BBB
BBC
BCA
BCB
BCC
CAA
CAB
CAC
CBA
CBB
CBC
CCA
CCB
CCC
"""
Gribouillis 1,391 Programming Explorer Team Colleague

I have a starting point using module bitstring (from pypi) and the wikipedia page

import bitstring

test_data = [ 
    ("0 01111 0000000000", 1),
    ("0 01111 0000000001", 1.0009765625),
    ("1 10000 0000000000", -2),
    ("0 11110 1111111111", 65504),
    ("0 00001 0000000000", 2.0 ** (-14)),
    ("0 00000 1111111111", 2.0**(-14) - 2.0**(-24)),
    ("0 00000 0000000001", 2.0**(-24)),
    ("0 00000 0000000000", 0.0),
    ("1 00000 0000000000", -0.0),
    ("0 11111 0000000000", float("infinity")),
    ("1 11111 0000000000", -float("infinity")),
    ("0 01101 0101010101", 1.0/3),
]

fmt = ['uint:1', 'uint:5', 'uint:10']

for u, res in test_data:
    a, b, c = (int(x, 2) for x in u.split())
    s = bitstring.pack(fmt, a, b, c)
    print s.bin
    s, e, m = s.unpack(fmt)
    if e:
        v = 2 ** (e - 25) * (1024 + m)
    else:
        v = 2 ** (-24) * m
    if s:
        v = -v
    print repr(v), repr(float(res))

"""my output -->
0011110000000000
1.0 1.0
0011110000000001
1.0009765625 1.0009765625
1100000000000000
-2.0 -2.0
0111101111111111
65504 65504.0
0000010000000000
6.103515625e-05 6.103515625e-05
0000001111111111
6.097555160522461e-05 6.097555160522461e-05
0000000000000001
5.960464477539063e-08 5.960464477539063e-08
0000000000000000
0.0 0.0
1000000000000000
-0.0 -0.0
0111110000000000
65536 inf
1111110000000000
-65536 -inf
0011010101010101
0.333251953125 0.3333333333333333
"""
Gribouillis 1,391 Programming Explorer Team Colleague

Take care that you have strings, not integer values, so '9' would be maximum at column 1 and '10000' would be minimum at column 2.

Oh yes it is true. Here is the corrected version

def extremes(records):
    for key, group in itt.groupby(records, itemgetter(0)):
        M, m = (int(x) for x in next(group)[1:3])
        for record in group:
            M, m = max(M, int(record[1])), min(m, int(record[2]))
        yield (key, str(M), str(m))

print list(extremes(valuearray))
weblover commented: thank you +2
Gribouillis 1,391 Programming Explorer Team Colleague

A better solution would probably to write a generating function like this one

def extremes(records):
    for key, group in itt.groupby(records, itemgetter(0)):
        M, m = next(group)[1:3]
        for record in group:
            M, m = max(M, record[1]), min(m, record[2])
        yield (key, M, m)

print list(extremes(valuearray))

It's advantages over the previous versions is that it does not store more than one record at a time. The argument can be an iterable (not necessarily a list). For example one could read the records in a file and write the result in another file, and very little memory would be used. It also works if there are more than 3 columns. Notice that the initial data need to be sorted on the key.

Gribouillis 1,391 Programming Explorer Team Colleague

Here is one way to do it

from operator import itemgetter
import itertools as itt

valuearray = [['A', '21', '45'], ['A', '12', '23'], 
              ['A', '54', '21'], ['A', '15', '54'], 
              ['B', '23', '53'], ['B', '34', '53'], 
              ['B', '32', '54'], ['B', '24', '13'], 
              ['C', '31', '43'], ['C', '42', '54'], 
              ['C', '35', '54'], ['C', '12', '11']]

result = list()
for key, group in itt.groupby(valuearray, itemgetter(0)):
    # replace valuearray with sorted(valuearray, key=itemgetter(0))
    # if valuearray is not initially sorted.
    u, v = itt.tee(group)
    u, v = max(u, key=itemgetter(1))[1], min(v, key = itemgetter(2))[2]
    result.append((key, u, v))

print result
""" my output -->
[('A', '54', '21'), ('B', '34', '13'), ('C', '42', '11')]
"""

I don't understand which result you want when there are more than 3 columns. What should the extra columns contain in the result ?

An alternative is

result = list()
for key, group in itt.groupby(valuearray, itemgetter(0)):
    u, v = zip(*(t[1:3] for t in group))
    result.append((key, max(u), min(v)))
Gribouillis 1,391 Programming Explorer Team Colleague

n i have to make mini project in python.... n i have completed it half bt i cannot do the remaining... plx help me anyone bcx i hv 2 submit the project on monday

Start a new thread describing your project and post the code you have so far, someone may help you.

Gribouillis 1,391 Programming Explorer Team Colleague

This is a typical use case for the __new__() method:

class BalancedTernary(long):
    def __new__(cls, n):
        instance = long.__new__(cls,
            balanced_ternary_value(n) if isinstance(n, str) else n)
        return instance

    def __repr__(self):
        return make_balanced(self)

    __str__ = __repr__

Your code seems to work now.

TrustyTony commented: Thanks for __new__ teaching! +12
Gribouillis 1,391 Programming Explorer Team Colleague

There is no such rule. import myScript reads the file on the disk if there is no module in sys.modules['myScript']. This module may exist even if the name myScript is not defined in the current global namespace. Apparently, your app closes and reopens the shell window in the same python process. One thing you could do is

import myScript
reload(myScript)
Gribouillis 1,391 Programming Explorer Team Colleague

You can use sub() with a method as argument

import re
from functools import partial

repl_dict = {'cat': 'Garfield', 'dog': 'Oddie' }

def helper(dic, match):
    word = match.group(0)
    return dic.get(word, word)

word_re = re.compile(r'\b[a-zA-Z]+\b')
text = "dog ate the catfood and went to cat's bed to see dog dreams on caterpillars"

print word_re.sub(partial(helper, repl_dict), text)

""" my output -->
Oddie ate the catfood and went to Garfield's bed to see Oddie dreams on caterpillars
"""
TrustyTony commented: Elegant partial+function +12
Gribouillis 1,391 Programming Explorer Team Colleague

Normally in windows, if you click on a .py file, it runs the file in a cmd console. If you want to run the program without console, you can rename it with the .pyw extension.

Gribouillis 1,391 Programming Explorer Team Colleague

Gribouillis thank you for all your effort. At the moment I am snowed under but this has to be done by next week so I will definitely look at it. I'm told we use lxml here at work. I definitely have to dissect what you have done here. Thanks again

lxml is very good, but I think it needs to store the whole parse tree.

Gribouillis 1,391 Programming Explorer Team Colleague

Please add id_saved = '' between line 61 and 62 :)

Gribouillis 1,391 Programming Explorer Team Colleague

This new code generates the index at 13 MB/s with a few assumptions. It should handle the 4GB in a little more than 5 minutes. It uses this code snippet http://www.daniweb.com/software-development/python/code/418239/1783422#post1783422

#!/usr/bin/env python
# -*-coding: utf8-*-
# Title: dups2.py
# Author: Gribouillis

"""Generate the index file with regexes and chunked input and output

    This code does not parse xml, but it assumes that:
    * records are delimited by <tag> and </tag> items, and that these items
        are only used with this meaning in the file.
    * within <tag> and </tag> sections, record's identity is delimited by <id> and </id>
        tags containing an integer value, and these items are only used with this meaning in the file.
    
    The code contains a few assert statements to check these assumptions.
"""

import re
from writechunks import MB, ChunkedOutputFile

class State:
    BASE = 0
    TAG = 1
    ID = 2
    TAGEND = 3

expected_state = {
    '<tag>': State.BASE,
    '<id>': State.TAG,
    '</id>': State.ID,
    '</tag>': State.TAGEND,
}

def next_state(state):
    return (state + 1) % 4

tag = re.compile("</?(?:tag|id)>")

def main2(input_filename, input_chunk, ofh):
    with open(input_filename) as ifh:
        state = State.BASE
        offset = 0
        last_end = 0
        id_saved = ''
        tail = ''
        while True:
            s = ifh.read(input_chunk)
            if s:
                if tail:
                    s = tail + s
            else:
                ofh.write("%d\teof\n" % (offset + len(tail)))
                return
            size = len(s)
            for match in tag.finditer(s):
                t = match.group(0)
                assert expected_state[t] == state
                last_end = match.end()
                if state == State.TAG:
                    begin_id = last_end
                elif state == State.ID:
                    id = id_saved …
Gribouillis 1,391 Programming Explorer Team Colleague

A variation on woooee's idea

def F1(v, u, a, t):
    args = (v, u, a, t)
    if args.count(None) != 1:
        raise TypeError("Exactly one argument of F1() must be None")
    index = args.index(None)
Gribouillis 1,391 Programming Explorer Team Colleague

Here is a tested version of my code. It generated a 11 MB index file out of a 136 MB input file on my machine in 30 seconds, meaning that you can expect 15 minutes for 4 GB. Note that input data was obtained by duplicating the data you provided. In your actual file, the 'lorem ipsum' and 'blah blah' part may be longer, which means a relatively smaller index file.

The input file was read by chunks of 32 MB. You can probably increase this value.

I think writing the index file is a very useful step. The index file will be much shorter than your input file, and it contains everything you need to sort and select items and prepare the output step without the need to read the big file every time you want to test the program.

Gribouillis 1,391 Programming Explorer Team Colleague

You must not read the file line by line. For 3 GB, if each line is 70 B long, it means about 50 millions calls to read(). I don't know how many operations it means on your hard drive, but it is probably evil. You must read the input file by large chunks. In the same way, you must write the output file by large chunks. The advantage of my file adapter is that allows you to transparently read the file by arbitrary sized chunks.

Gribouillis 1,391 Programming Explorer Team Colleague

It occured to me that if the previous code is too slow, we can try to speed it up by writing 1 MB chunks at a time in the index file. For this you can use this class

from cStringIO import StringIO

class Output(object):
    def __init__(self, ofh, chunk = 1 * MB):
        self.ofh = ofh
        self.io = StringIO()
        self.sz = 0
        self.chunk = chunk
        
    def write(self, s):
        self.io.write(s)
        self.sz += len(s)
        if self.sz > self.chunk:
            self.ofh.write(self.io.getvalue())
            self.io = StringIO()
            self.sz = 0
            
    def flush(self):
        if self.sz:
            self.ofh.write(self.io.getvalue())
            self.io = None
            self.sz = 0
        self.ofh.flush()
            
    def __enter__(self):
        return self
        
    def __exit__(self, *args):
        self.flush()

And replace the run() method in class IndexCreator with

def run(self, input_filename, output_filename, chunk_size):
        self.parser = p = xml.parsers.expat.ParserCreate()
        self.state = 0
        with open(output_filename, "w") as output:
            with Output(output) as ofh:
                self.ofh = ofh
                p.StartElementHandler = self.start_element
                p.EndElementHandler = self.end_element
                p.CharacterDataHandler = self.char_data
                p.ParseFile(self.source(input_filename, chunk = chunk_size))
                self.eof_byte = self.byte_index() - len(self.SUFFIX)
                ofh.write("%d\teof\n" % self.eof_byte)
Gribouillis 1,391 Programming Explorer Team Colleague

This code should run against the big file and create a shorter index file which you could use to find duplicate id's and select which items you want to rewrite

from itertools import chain
from adaptstrings import adapt_as_opener
import xml.parsers.expat

KB = 1024
MB = 1024 * KB

class IndexCreator(object):
    PREFIX = "<document>\n"
    SUFFIX = "</document>"

    def __init__(self):
        self.parser = None
        self.in_id = None
        self.ofh = None
        self.state = None
        
    def byte_index(self):
        return self.parser.CurrentByteIndex - len(self.PREFIX)
    
    def run(self, input_filename, output_filename, chunk_size):
        self.parser = p = xml.parsers.expat.ParserCreate()
        self.state = 0
        with open(output_filename, "w") as ofh:
            self.ofh = ofh
            self.in_id = False
            p.StartElementHandler = self.start_element
            p.EndElementHandler = self.end_element
            p.CharacterDataHandler = self.char_data
            p.ParseFile(self.source(input_filename, chunk = chunk_size))
            self.eof_byte = self.byte_index() - len(self.SUFFIX)
            ofh.write("%d\teof\n" % self.eof_byte)



    def start_element(self, name, attrs):
        if name == 'tag':
            assert self.state == 0
            self.ofh.write(str(self.byte_index()))
            self.state = 1
        elif name == 'id':
            assert self.state == 1
            self.in_id = True
            self.state = 2
            
    def end_element(self, name):
        if name == 'tag':
            assert self.state == 4
            self.state = 0
        elif name == 'id':
            assert self.state == 3
            self.in_id = False
            self.state = 4
            
    def char_data(self, data):
        if self.in_id:
            assert self.state == 2
            self.ofh.write('\t%d\n' % int(data))
            self.state = 3

    @adapt_as_opener
    def source(self, filename, mode='r', chunk= 1 * MB):
        yield self.PREFIX
        with open(filename, mode) as ifh:
            while True:
                s = ifh.read(chunk)
                if s:
                    yield s
                else:
                    break
        yield self.SUFFIX

if __name__ == "__main__":
    ic = IndexCreator()
    ic.run("dups.txt", "index.txt", 8 * MB)

""" After running this, the content of index.txt is

0       32444
172     32344
347 …
Gribouillis 1,391 Programming Explorer Team Colleague

It won't work on the big file because you can't store that much data in memory. I'm going to post a code to create an index file first.