Hi all,

I need a perl script that can open files (given in command line arguments) and extract/print out any dates or times found in it. The format of the dates and times can be any reasonable format.

The problem I have is I don't know how to print out the matching part of the file once i find it. Any ideas?

-Skyrim

Recommended Answers

All 7 Replies

Hi all,

I need a perl script that can open files (given in command line arguments) and extract/print out any dates or times found in it. The format of the dates and times can be any reasonable format.

The problem I have is I don't know how to print out the matching part of the file once i find it. Any ideas?

-Skyrim

Let's start by assuming all dates will look like 'mm/dd/yyyy' and all times will look like 'hh:mm'. You need to express these two patterns as regular expressions, aka regexes. Hopefully, no date or time will extend beyond one record so you can read each file one line at a time. (Otherwise you would have to 'slurp' each input file so the entire file goes into a scalar variable.) Look for matches in each input record you read from your file(s) in such a way as to capture the matching portions into variables which you can print. Easier said than done, right? But once you have a script that does this, and you need to find all the other reasonable formats that dates and times could have in the file you will be reading, you will simply create and incorporate additional regex patterns into your script.

Let me begin by saying that I started perl about 2 1/2 days ago and the only thing I've done with it is make a script to monitor changes in a given directory. This means that I have the question: What does 'slurp' mean. Also I know WHAT I need to do i'm just stuck on HOW. I know how to use regular expressions to find a match, I just don't know how to collect it. I've searched online but nothing I've found is relative enough to be usefull

Ah, well, if you are simply trying to print the results, it is very simple:

if ($_ =~ /(foo)bar(baz)/) {
print "$1 $2\n"; #Prints 'foo baz' from the pattern 'foobarbaz'
}

See? You can put groups of parentheses around certain parts of the regular expression to 'capture' them in Perl variables. The first set stores it's match in $1, the second in $2, etc.

'Slurping' is a method of reading a file; normally in Perl a file is read one line at a time, or the entire file is read into an array, where the 0th index is the first line, the first index is the second line, etc. Slurping is when you read the entire file into a scalar variable.

How you write the Perl script depends on what assumptions, if any, you can make about the input files. If they are generated by another computer program, such as one making an activity log or list of transactions, then there should be a finite number of possible ways the dates and times can be formatted and where they will occur in each line. If, however, they are a collection of documents or web pages from a variety of sources composed in free-form text then you will have to write a fairly complex script with many regex patterns to distinguish the dates and times from everything else.

You can read about creating regex patterns to match a valid date at http://www.regular-expressions.info/dates.html

There are many modules freely available on CPAN that do some kind of parsing of dates and/or times. I have no experience with parsing dates in Perl and so I don't know if any of these modules will do what you want.

I can't give you an example file as I myself am not provided with one. All I've been told is that the script should be able to find ANY date or time in ANY format used by english speaking people. This could be dd/mm/yy or mm/dd/yy or 1st Jan 2010 or Jan 1st 2010 etc. and for time it can be 12 or 24 hour time. Dodgy dates are acceptable, like 29/02/2010 (2010 not a leap year) or 31/4/2010 (April only has 30 days).

I can't give you an example file as I myself am not provided with one. All I've been told is that the script should be able to find ANY date or time in ANY format used by english speaking people. This could be dd/mm/yy or mm/dd/yy or 1st Jan 2010 or Jan 1st 2010 etc. and for time it can be 12 or 24 hour time. Dodgy dates are acceptable, like 29/02/2010 (2010 not a leap year) or 31/4/2010 (April only has 30 days).

I hope whoever asked you for this will pay you by the hour, because they have given you a lot of work, Here is a simplified example of a script that matches only two date formats. It expects at least one filename following the program name when you run it on the command line. For example: perl FindDates.pl Dates_1.txt

#!/usr/bin/perl
#FindDates.pl
use 5.006;
use strict;
use warnings;

my $d1 = qr{(?:0[1-9]|1[012])[- /.](?:0[1-9]|[12][0-9]|3[01])[- /.](?:19|20)\d\d}; #Matches mm/dd/yyyy
my $d2 = qr{(?:0[1-9]|[12][0-9]|3[01])[- /.](?:0[1-9]|1[012])[- /.](?:19|20)\d\d}; #Matches dd/mm/yyyy
my @dates;

while (<>){
	my @d = m/$d1 | $d2/gx;
	push @dates, @d if scalar(@d);
}

print "$_ \n" foreach @dates;

The file 'Date_1.txt which I used for testing contains the following data:

Let's go somewhere on 04/23/2010 or 23/04/2010 as the British would say.
This would be so much easier if we could assume all the dates would
be at the start or the end of each line.
12/25/2009 Like that.
Or if we could assume that every date would occur between the xth and yth
position on every line, as 06/30/2005 is found from position 27 through 36.
Oh well... 2/2/2002.

Thanks guys. All this should help a lot.

No I'm not getting paid as this is a uni assignment. Don't worry, you haven't done it for me or anything. Its just one part of it and you've given me a nice solid starting point to get it done. Hopefully this is all I need to get me through it.

Thanks,

-Skyrim

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.