I'm new to perl, some experience in python. I've been tasked to create a script to remove comments in files. So basically id like to remove <!-- comment here --> from every line that has it in an assortment of files in a directory tree. Some comments may span multiple lines.

I've done some research and this one here seems to want to remove anything. My test file looks like:

some code <!-- this is a comment-->
more code

And after i run the script, the file is empty.

script:

#!/usr/bin/perl

use File::Find;
use strict;

my $directory = "/home/rmcleod/perltest";
find (\&process, $directory);

sub process
{
    my @outlines
    my $line;
   
    if ($File::Find::name=~/\.xsd$/) {
        open (FILE, $File::Find::name) or
        die "Cannot open file $!";

        print "\n". $File::Find::name. "\n";
        while ($line = <FILE>){
            foreach ($line =~ /<!--.*?(^>]*)-->/is) {
                push(@outlines, $line);
            }

        }

        close FILE;
        open(OUTFILE, ">$File::Find::name") or
        die "Cannot open file:$!";

        print (OUTFILE my @outlines);
        close (OUTFILE);

        undef (@outlines);
    }

}

I've played around with some stuff. The foreach was an if statement, and before that was just the $line =~ regex. Which the guide I used had. But my limited knowledge of perl has kinda stopped me from any more playin around.

Thanks

Recommended Answers

All 14 Replies

use strict;
use warnings;
open(FILE,"<com.txt");
open (OUT,">comout.txt");
while(<FILE>){
	chomp;
	s/<!--[^>]*-->//g;
	print OUT "$_\n";
}
close FILE;
close OUT;

To make sure that comments that span multiple lines are matched:

use strict;
use warnings;
undef($/);
open(FILE,"<com.txt");
open (OUT,">comout.txt");
my $file=<FILE>;
$file=~s/<!--[^>]*-->//g;
print OUT $file;
$/="\n";
close FILE;
close OUT;
use strict;
use warnings;
open(FILE,"<com.txt");
open (OUT,">comout.txt");
while(<FILE>){
	chomp;
	s/<!--[^>]*-->//g;
	print OUT "$_\n";
}
close FILE;
close OUT;

I tried this, and only thing I changes was to have the OUT as the same file as the IN. And this causes all the data to be removed. Is there a way to get it to put the necessary data back onto the same file? or have it only remove the unnecessary data? Only other way i can think of is renaming the file back to its original name, an os mv.

You can't read and write to the same file at the same time. You will need to do this to get that to work. Open the file, read it, close it and then open it again for writing.

use strict;
use warnings;
undef($/);
open(FILE,"<com.txt");
my $file=<FILE>;
close FILE;
open (OUT,">com.txt");
$file=~s/<!--[^>]*-->//g;
print OUT $file;
$/="\n";
close OUT;

Ah that works well, and avoids the hassle of having to mv the file. Thanks

The main issue I had has been solved, with the removal of part of a line. However the script you've written I havent been able to adapt it successfully to run on multiple files with the same extension.

Here's what I have:

use strict;
use warnings;
use File::Find;

undef($/);
my $directory = "/home/user/perltest";
find (\&process, $directory);
sub process
{
    if ($File::Find::name=~/\.xsd$/) {
        open (FILE, $File::Find::name);
        my $file = <FILE>;
        
        print "\n". $File::Find::name. "\n";
        $file = ~s/<!--[^>]*-->//g;
        close FILE;
        open (OUT, ">", $File::Find::name);
        print OUT $file;
        $/="\n";
        close OUT;
    }
}

What happens here is it writes a series of numbers to the file and deletes everything else. It basically looks like:

2304985201274924

I'd have to imagine I need a change either with "sub process" or how the Find is used maybe. The files themselves are properly found, as displayed by the first print statement.

I have a file in each sub dir in perltest:

1/test1.xsd 2/test1.xsd 3/test1.xsd

Here's some updated changes:

#!/usr/bin/perl -w
use strict;
use warnings;
use File::Find;

undef($/);
my $directory = "/home/user/perltest";
find (\&subr, $directory);

sub subr
{
    if ($File::Find::name=~/.*\.xsd$/) {
        open (FILE, "<", $File::Find::name);
        my $f=<FILE>;
        print $f;
        $f=~s/<!--[^>]*-->//g;
        close FILE;
        open (OUT, ">", $File::Find::name);
        print OUT $f;
        $/="";
        close OUT;
    }
}

Well this works, but what does happen is its saying theres an use of uninitialized value $f on lines 15, 16 and 19. looks initialized to me. They are just warnings, so not a huge deal

Try the following. I made a couple of changes to your script, indicated by comments. I changed the regex slightly because $f=~s/<!--[^>]*-->//g; will not remove comments if the character '>' occurs anywhere between the comment tags. It's better to use a dot that represents all characters.

#!/usr/bin/perl
use strict;
use warnings;
use File::Find;

#undef($/); Better to take care of $/ in subr as local variable

my $directory = "/home/user/perltest";
find (\&subr, $directory);

sub subr
{
    foreach ($File::Find::name=~/.*\.xsd$/) {
        open (FILE, "<", $File::Find::name);
        local $/;
        my $f=<FILE>;
        print $f;
        $f=~s/<!--.*-->//gs; #s option means dot (.) includes newline character
        close FILE;
        open (OUT, ">", $File::Find::name);
        print OUT $f;
        close OUT;
    }
}

ah thanks, no warnings there.

good call on the regex, i did initially have it like you suggested, but changed it somewhere along the way when i was having problems.

AH the regex needs the [^>] otherwise it fails to span multiple lines

Hmm its having trouble spanning multiple lines. Which doesnt make sense, it works in my rx toolkit. The way i had before spanned multiple lines[^>], but there are a couple comments with a > in them.

Sorry for so many posts, but:

$f=~s/<!--[\w\W]*-->//gi;

works

Sorry for so many posts, but:

$f=~s/<!--[\w\W]*-->//gi;

works

#!/usr/bin/perl
#remove_comments.pl
use strict;
use warnings;

#Put sample xsd file contents into a string for purpose of testing
my $f = <<END;
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<!-- definition of simple elements -->
<xs:element name="orderperson" type="xs:string"/>
<xs:element name="name" type="xs:string"/>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<!--Here is a multi-line
comment to remove
for test purposes -->
<xs:element name="country" type="xs:string"/>
<xs:element name="title" type="xs:string"/>
<xs:element name="note" type="xs:string"/>
<xs:element name="quantity" type="xs:positiveInteger"/>
<xs:element name="price" type="xs:decimal"/>

<!-- definition of attributes -->
<xs:attribute name="orderid" type="xs:string"/>

<!-- definition of complex elements -->
<xs:element name="shipto">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="name"/>
      <xs:element ref="address"/>
      <xs:element ref="city"/>
      <xs:element ref="country"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="item">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="title"/>
      <xs:element ref="note" minOccurs="0"/>
      <xs:element ref="quantity"/>
      <xs:element ref="price"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

END

$f=~s/<!--[\w\W]*-->//gi;#regex has greedy quantifier *
print $f;

Running the above gives the following output:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">


<xs:element name="shipto">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="name"/>
      <xs:element ref="address"/>
      <xs:element ref="city"/>
      <xs:element ref="country"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="item">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="title"/>
      <xs:element ref="note" minOccurs="0"/>
      <xs:element ref="quantity"/>
      <xs:element ref="price"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

Looks like it removed way too much code along with the comments! The regex I suggested had the same flaw, which I realised only after further testing today. Let's try adding a ? after the * to make the quantifier lazy so it won't remove so much. Change your regex to this: $f=~s/<!--[\w\W]*?-->//gi;#regex has lazy quantifier *? After making this change, running the test program gives the following output:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">


<xs:element name="orderperson" type="xs:string"/>
<xs:element name="name" type="xs:string"/>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>

<xs:element name="country" type="xs:string"/>
<xs:element name="title" type="xs:string"/>
<xs:element name="note" type="xs:string"/>
<xs:element name="quantity" type="xs:positiveInteger"/>
<xs:element name="price" type="xs:decimal"/>


<xs:attribute name="orderid" type="xs:string"/>


<xs:element name="shipto">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="name"/>
      <xs:element ref="address"/>
      <xs:element ref="city"/>
      <xs:element ref="country"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="item">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="title"/>
      <xs:element ref="note" minOccurs="0"/>
      <xs:element ref="quantity"/>
      <xs:element ref="price"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

The result looks better to me but before trying this on your directory tree I hope you have a good backup.:)

haha THANKS. This is the issue Im currently running into. It was removing everything up till the last -->. Ill try this and let you know.

Yep all appears to be well with a few files I looked at. Thanks for the help, much appreciated :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.