Unable to remove spaces. What are these spaces exactly are.
I am trying to parse an html file. But unable to remove spaces using \s (matching character for whitespace)
use strict;
use warnings;
open(FILE,"<paragraph.txt")|| die "Can't open para.txt";
my @file = <FILE>;
my $all = join("",@file);
$all =~ s/\n/ /g;
$all =~ s/\./\. /g;
$all =~ s/\s\s*/ /g;
open (FIL,">paraone.txt")||die "Can't open para.txt";
print FIL $all;
close(FILE);
close(FIL);
I hava attached the paragraph.txt which contains spaces that are not removable.
this space Confusion and ensuing controversy also arose from the PANDAS
Please help...
$ od -tx1 -c -A d paragraph.txt | more
0000000 3c 70 20 73 74 79 6c 65 3d 27 74 65 78 74 2d 61
< p s t y l e = ' t e x t - a
0000016 6c 69 67 6e 3a 6a 75 73 74 69 66 79 27 3e a0 a0
l i g n : j u s t i f y ' > 240 240
0000032 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 20 43 6f
240 240 240 240 240 240 240 240 240 240 240 240 240 C o
0000048 6e 66 75 73 69 6f 6e 20 61 6e 64 20 65 6e 73 75
n f u s i o n a n d e n s u
Notice that what you think is space (0x20) is really (0xa0). Use this to start making it work. Think there are other non-ascii to filter out as well. Like after infection there is a 0x94, will need to filter that out.
#!/usr/bin/perl
use strict;
use warnings;
open(FILE,"<paragraph.txt")|| die "Can't open para.txt";
my @file = <FILE>;
my $all = join("",@file);
print $all;
$all =~ s/\n/ /g;
$all =~ s/\./\. /g;
# ---> Notice the use of \xa0
$all =~ s/\xa0+\s+|\s+/ /g; # <------ Look here
print $all;
open (FIL,">paraone.txt")||die "Can't open para.txt";
print FIL $all;
close(FILE);
close(FIL);