0

I am trying to parse an html file. But unable to remove spaces using \s (matching character for whitespace)

use strict;
use warnings;
open(FILE,"<paragraph.txt")|| die "Can't open para.txt";
my @file = <FILE>;
my $all = join("",@file);
$all =~ s/\n/ /g;
$all =~ s/\./\. /g;
$all =~ s/\s\s*/ /g;
open (FIL,">paraone.txt")||die "Can't open para.txt";
print FIL $all;
close(FILE);
close(FIL);

I hava attached the paragraph.txt which contains spaces that are not removable.
<p style='text-align:justify'> this space  Confusion and ensuing controversy also arose from the PANDAS
Please help...

Edited by PhoenixInsilico: n/a

Attachments
<p style='text-align:justify'> Confusion and ensuing controversy also arose from the PANDAS criterion requiring that symptom onset and exacerbations be temporally associated with a GAS infection<sup>1</sup>. Difficulties establishing this association were predicted by observations of SC patients, where the onset of chorea typically may lag behind the inciting GAS infection              by 4  6 months or longer.
Two years later
3
Contributors
2
Replies
3
Views
5 Years
Discussion Span
Last Post by histrungalot
0

Your file has space and other things:

$ od -tx1 -c -A d paragraph.txt | more
0000000  3c  70  20  73  74  79  6c  65  3d  27  74  65  78  74  2d  61
          <   p       s   t   y   l   e   =   '   t   e   x   t   -   a
0000016  6c  69  67  6e  3a  6a  75  73  74  69  66  79  27  3e  a0  a0
          l   i   g   n   :   j   u   s   t   i   f   y   '   > 240 240
0000032  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  20  43  6f
        240 240 240 240 240 240 240 240 240 240 240 240 240       C   o
0000048  6e  66  75  73  69  6f  6e  20  61  6e  64  20  65  6e  73  75
          n   f   u   s   i   o   n       a   n   d       e   n   s   u

Notice that what you think is space (0x20) is really (0xa0). Use this to start making it work. Think there are other non-ascii to filter out as well. Like after infection there is a 0x94, will need to filter that out.

#!/usr/bin/perl
use strict;
use warnings;
open(FILE,"<paragraph.txt")|| die "Can't open para.txt";
my @file = <FILE>;
my $all = join("",@file);
print $all;
$all =~ s/\n/ /g;
$all =~ s/\./\. /g;
# ---> Notice the use of \xa0
$all =~ s/\xa0+\s+|\s+/ /g;  # <------ Look here
print $all;
open (FIL,">paraone.txt")||die "Can't open para.txt";
print FIL $all;
close(FILE);
close(FIL);
This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.