I am trying to parse an html file. But unable to remove spaces using \s (matching character for whitespace)

use strict;
use warnings;
open(FILE,"<paragraph.txt")|| die "Can't open para.txt";
my @file = <FILE>;
my $all = join("",@file);
$all =~ s/\n/ /g;
$all =~ s/\./\. /g;
$all =~ s/\s\s*/ /g;
open (FIL,">paraone.txt")||die "Can't open para.txt";
print FIL $all;
close(FILE);
close(FIL);

I hava attached the paragraph.txt which contains spaces that are not removable.
<p style='text-align:justify'> this space  Confusion and ensuing controversy also arose from the PANDAS
Please help...

Recommended Answers

All 2 Replies

According to my text editor they are line feeds (0x0A). Perhaps this link can help.

Your file has space and other things:

$ od -tx1 -c -A d paragraph.txt | more
0000000  3c  70  20  73  74  79  6c  65  3d  27  74  65  78  74  2d  61
          <   p       s   t   y   l   e   =   '   t   e   x   t   -   a
0000016  6c  69  67  6e  3a  6a  75  73  74  69  66  79  27  3e  a0  a0
          l   i   g   n   :   j   u   s   t   i   f   y   '   > 240 240
0000032  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  20  43  6f
        240 240 240 240 240 240 240 240 240 240 240 240 240       C   o
0000048  6e  66  75  73  69  6f  6e  20  61  6e  64  20  65  6e  73  75
          n   f   u   s   i   o   n       a   n   d       e   n   s   u

Notice that what you think is space (0x20) is really (0xa0). Use this to start making it work. Think there are other non-ascii to filter out as well. Like after infection there is a 0x94, will need to filter that out.

#!/usr/bin/perl
use strict;
use warnings;
open(FILE,"<paragraph.txt")|| die "Can't open para.txt";
my @file = <FILE>;
my $all = join("",@file);
print $all;
$all =~ s/\n/ /g;
$all =~ s/\./\. /g;
# ---> Notice the use of \xa0
$all =~ s/\xa0+\s+|\s+/ /g;  # <------ Look here
print $all;
open (FIL,">paraone.txt")||die "Can't open para.txt";
print FIL $all;
close(FILE);
close(FIL);
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.