I'm writing a script that will search files that contain certain keywords, and then output the line to a text file.

The problem I'm having is that the script is opening every type of file, including executables.

The script seems to work fine, as in it doesn't crash when it opens these types of files, but the problem I'm having is with the output.

When the script outputs text from some type of non-text-based file, I cannot open the file with gedit. It gives me the "gedit has not been able to detect the character encoding." error.

So here is my question.

Either A) I need to have the script only open and search text-based files or B), I need to force the script to output using a specific encoding so that I can open and read the output file.

I tried having the script using the file extension, but there are way too many extensions to try to list, so I was hoping for a more elegant solution.

One more piece of information, this script will be used to search files on both Windows and *NIX boxes (but mainly Windows). However, the script itself will be run on Linux.

Thanks.

Jason

Recommended Answers

All 10 Replies

The problem is that there are other types of text-based files, such as html, csv, inf, even bat files.

I guess I should repost with a different title, but then is there a way to at least force Python to use a specific output format so that I can open it in a text editor (such as gedit)?

I was meaning this Linux command:
http://linux.about.com/library/cmd/blcmdl1_file.htm

I think he also needs a Windows method.

If you want to write in a specific encoding,
then you can do this:

import codecs
out = open("myfile.txt", "w")
out.write(codecs.BOM_UTF8)  #makes it obvious what encoding it is
out.write(unicode(mytext,"UTF-8"))
out.close()

If you try to write something that is binary into a file and try to make the file readable, then that's impossible.
Which also means,
you can do unicode() on a non-unicode string to determine whether it is binary or ASCII text.

Poster said that script will run in Linux machine.

File is gnu utility, you can install it in Windows also:

For example file in my shedskin directory:

Microsoft Windows XP [versio 5.1.2600]
(C) Copyright 1985 - 2001 Microsoft Corp.

D:\Tony\shedskin-0.3\shedskin>file --help
Usage: file [OPTION...] [FILE...]
Determine type of FILEs.

      --help                 display this help and exit
  -v, --version              output version information and exit
  -m, --magic-file LIST      use LIST as a colon-separated list of magic
                               number files
  -z, --uncompress           try to look inside compressed files
  -b, --brief                do not prepend filenames to output lines
  -c, --checking-printout    print the parsed form of the magic file, use in
                               conjunction with -m to debug a new magic file
                               before installing it
  -e, --exclude TEST         exclude TEST from the list of test to be
                               performed for file. Valid tests are:
                               ascii, apptype, compress, elf, soft, tar, tokens, troff
  -f, --files-from FILE      read the filenames to be examined from FILE
  -F, --separator STRING     use string as separator instead of `:'
  -i, --mime                 output MIME type strings (--mime-type and
                               --mime-encoding)
      --apple                output the Apple CREATOR/TYPE
      --mime-type            output the MIME type
      --mime-encoding        output the MIME encoding
  -k, --keep-going           don't stop at the first match
  -n, --no-buffer            do not buffer output
  -N, --no-pad               do not pad output
  -0, --print0               terminate filenames with ASCII NUL
  -p, --preserve-date        preserve access times on files
  -r, --raw                  don't translate unprintable chars to \ooo
  -s, --special-files        treat special (block/char devices) files as
                             ordinary ones
  -C, --compile              compile file specified by -m
  -d, --debug                print debugging messages

D:\Tony\shedskin-0.3\shedskin>file *
anaprof;            data
anaprof_rv_3_fi;    data
anaprof_tv_2_fi;    data
bin;                directory
cpp;                directory
crypt.cpp;          ASCII C program text, with CRLF line terminators
crypt.hpp;          ASCII C program text, with CRLF line terminators
crypt.py;           ASCII text, with CRLF line terminators
crypt.pyd;          PE32 executable for MS Windows (DLL) (console) Intel 80386 32-bit
crypt_shedskin.zip; Zip archive data, at least v2.0 to extract
crypt_time.py;      ASCII Java program text, with CRLF line terminators
est.xor;            data
isprime.cpp;        ASCII C program text, with CRLF line terminators
isprime.hpp;        ASCII C program text, with CRLF line terminators
isprime.py;         ASCII English text, with CRLF line terminators
isprime.pyd;        PE32 executable for MS Windows (DLL) (console) Intel 80386 32-bit
kokeilu.cpp;        ASCII C program text, with CRLF line terminators
kokeilu.hpp;        ASCII C program text, with CRLF line terminators
kokeilu.py;         Non-ISO extended-ASCII Java program text, with CRLF line terminators
kokeilu2.py;        ASCII Java program text, with CRLF line terminators
lib;                directory
life.py;            ASCII Java program text, with CRLF line terminators
lifelib.cpp;        ASCII C program text, with very long lines, with CRLF line terminators
lifelib.hpp;        ASCII C program text, with CRLF line terminators
lifelib.py;         ASCII Java program text
lifelib.pyc;        python 2.6 byte-compiled
lifelib.pyd;        PE32 executable for MS Windows (DLL) (console) Intel 80386 32-bit
Makefile;           ASCII make commands text, with CRLF line terminators
sana.cpp;           ASCII C program text, with very long lines, with CRLF line terminators
sana.exe;           PE32 executable for MS Windows (console) Intel 80386 32-bit
sana.hpp;           ASCII C program text, with very long lines, with CRLF line terminators
sana.py;            Non-ISO extended-ASCII Java program text, with CRLF line terminators
SanaS.ico;          MS Windows icon resource - 1 icon
sanas.zip;          Zip archive data, at least v1.0 to extract
SanaS60.ico;        MS Windows icon resource - 1 icon
SanaSC.cpp;         ASCII C program text, with very long lines, with CRLF line terminators
SanaSC.exe;         PE32 executable for MS Windows (console) Intel 80386 32-bit
SanaSC.hpp;         ASCII C program text, with very long lines, with CRLF line terminators
SanaSC.py;          Non-ISO extended-ASCII Java program text, with CRLF line terminators
SanaSC.ss.py;       Non-ISO extended-ASCII Java program text, with CRLF line terminators
Saved;              directory
shedskin;           directory
shedskin.bat;       DOS batch file text
source;             directory
test.py;            ASCII text
text_100kb.txt;     UTF-8 Unicode (with BOM) C program text, with CRLF line terminators

D:\Tony\shedskin-0.3\shedskin>

This is little hilarious this Java text for Python files, but anyway good enough, I think.

You can also try to access the magic by python utility: http://pypi.python.org/pypi/python-magic/0.1

Proper link to home page: http://github.com/ahupp/python-magic

Poster said that script will run in Linux machine.

He also said it would mainly be run on Windows machines,
so your cross-platform solution would work.

He also said it would mainly be run on Windows machines,
so your cross-platform solution would work.

search files on both Windows and *NIX boxes (but mainly Windows). However, the script itself will be run on Linux.

I interpreted this to mean that there is network of machines, and some of the machines ('boxes') are Windows based, but the script executing machine is Linux server.

The links where not so well so repost:
http://github.com/ahupp/python-magic

Instruction for Windows (my main platform here in family computer) from this page

# python setup.py install

= Installation on Win32 =

You need magic1.dll from http://gnuwin32.sourceforge.net/, grab the
binaries and dependencies ZIP-file, extract magic1.dll, regex2.dll
and zlib1.dll and put it in C:\Windows\System32. You also need a
magic file from Linux, compatible with file version 5.0.

To build and install run:

# python setup.py install

= Example Usage =

>>> import magic
>>> m = magic.Magic()
>>> m.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> m.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'

# For MIME types
>>> mime = magic.Magic(mime=True)
>>> mime.from_file("testdata/test.pdf")
'application/pdf'
>>>

tonyjv is correct. This script will be run from Linux, but will be used against network systems, which include Windows and *NIX boxes.

For now, I ended up going by file extension since the majority of our network systems are running Windows.

But I'll try out your tips to see if they might work a little better for me.

Thanks for the info.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.