So, in an exercise in futility, I decided to write a script that will take either a file or a string and find patterns in the words, and display the results for a nice friendly human use.

Right now, it simply searches forwards and backwards(ish), but Im wondering if there is a cleaner way to iterate the middle part of the words without manually slicing them.

Any help is much appreciated:

import sys

try:
    document = open(sys.argv[1], 'r')
except:
    selection = input('Would you like to: \n1)Input a file name relative to path.\n2)Input a string\nSelection: ')

if selection == str(1):
    doc = input('File: ')
    try:
        document = open(doc, 'r')
    except Exception as e:
        print (e + "File Not Found...")
elif selection == str(2):
    document = input("String: ")
else:
    print("There are only 2 choices here...")
    sys.exit()

#define characters to ignore
extra_chars = "!@#$%^&*()-_=+?><,.;:'\"[]{}|\/"

patterns = dict()

try:
    words = document.read().split()
except:
    words = document.split()

for word in words:
    for char in extra_chars:
        word = word.replace(char, "")

    for i in range(len(word)+1):
        if i == 0:
            pass
        else:
            p = word[0:i]
            if p not in patterns:
                patterns[p] = 1
            else:
                patterns[p] += 1

    for i in range(len(word)+1):
        if i==0:
            pass
        else:
            p = word[-i:]
            if p not in patterns:
                patterns[p] = 1
            else:
                patterns[p] += 1


for k, v in sorted(patterns.items()):
    print(k +": "+str(v))

sys.exit()

Recommended Answers

All 9 Replies

hmmm.. ok.. clarification on lowercase-word-generator

First, if the reference is to strip(string.punctuation+....) I intentionally made the ignore list so it can be edited and we can allow certain characters. What if we want to find a pattern of "#word#" for whatever reason, or something similar? But I doubt that's what we were going for, so on to the next part....

you use:

while line:

From my understanding, this will run until the end of the line (being, it would go through word by word until the end of the 'list').

In my case, are you suggesting I use a similar technique, and then split the word forwards and backwards with each pass? (much like you do for If so, wont that duplicate results? How would I compensate for that?

I looked through a few other snippets (the cracking caesar one was quite interesting), and I understand the concept behind the recursive matching, but the purpose of this project is to find random matches, that are not in any way pre-defined.

So.. with that in mind, is this a bit of a goofy exercise, or is this something that will be useful outside of a "can I do it?" context? I will still pursue it just for kicks, but getting a more experienced opinion would be nice.

Thanks!

Ryan

Do you need the front parts first instead of getting splitted word in all places?

If not, you can just loop splitting point from 1 until length of the current word and yield the both starting and ending parts in inner loop instead of yielding words as whole.

Maybe you can share your goals of this exercise, so maybe we can respond with something appropriate to the use case you have in mind.

  while line:

Tests that line is not an empty string.

Let me rephrase - while line: is meant to check if there is something there for the cursor to use.

So the idea is simple - you are given a file that has some form of repetition (or a string, or whatever), but you do not know what that repetition is. The goal is to find that repition, or any repeated patterns, for whatever reason - trying to cipher an encryption by finding repeatable patterns for small words, looking for textual clues in an essay... whatever the reason, the exercise is simply to find patterns as they emerge, useful or not. It is up to the user to determine the usefulness of the information.

Ironically enough, I was just reading about how zip compression does this, but Im not sure how...

Interesting! I never thought of using regex for this... would make total sense, I guess... I shall refactor and see how I do.. thanks for the tip! :)

Out of curiosity, is this the way that virus scanners work? Or is it a bit more complicated than that?

I see no connection between viruses and repeated string, of course virus after having spread is in beginning of many files.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.