954,525 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Unable to remove spaces. What are these spaces exactly are.

I am trying to parse an html file. But unable to remove spaces using \s (matching character for whitespace)

use strict;
use warnings;
open(FILE,"<paragraph.txt")|| die "Can't open para.txt";
my @file = <FILE>;
my $all = join("",@file);
$all =~ s/\n/ /g;
$all =~ s/\./\. /g;
$all =~ s/\s\s*/ /g;
open (FIL,">paraone.txt")||die "Can't open para.txt";
print FIL $all;
close(FILE);
close(FIL);


I hava attached the paragraph.txt which contains spaces that are not removable.
 this space  Confusion and ensuing controversy also arose from the PANDAS
Please help...

Attachments paragraph.txt (0.44KB)
PhoenixInsilico
Newbie Poster
20 posts since Sep 2008
Reputation Points: 10
Solved Threads: 0
 

According to my text editor they are line feeds (0x0A). Perhaps this link can help.

pritaeas
Posting Expert
Moderator
5,479 posts since Jul 2006
Reputation Points: 653
Solved Threads: 874
 

Your file has space and other things:

$ od -tx1 -c -A d paragraph.txt | more
0000000  3c  70  20  73  74  79  6c  65  3d  27  74  65  78  74  2d  61
          <   p       s   t   y   l   e   =   '   t   e   x   t   -   a
0000016  6c  69  67  6e  3a  6a  75  73  74  69  66  79  27  3e  a0  a0
          l   i   g   n   :   j   u   s   t   i   f   y   '   > 240 240
0000032  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  a0  20  43  6f
        240 240 240 240 240 240 240 240 240 240 240 240 240       C   o
0000048  6e  66  75  73  69  6f  6e  20  61  6e  64  20  65  6e  73  75
          n   f   u   s   i   o   n       a   n   d       e   n   s   u

Notice that what you think is space (0x20) is really (0xa0). Use this to start making it work. Think there are other non-ascii to filter out as well. Like after infection there is a 0x94, will need to filter that out.

#!/usr/bin/perl
use strict;
use warnings;
open(FILE,"<paragraph.txt")|| die "Can't open para.txt";
my @file = <FILE>;
my $all = join("",@file);
print $all;
$all =~ s/\n/ /g;
$all =~ s/\./\. /g;
# ---> Notice the use of \xa0
$all =~ s/\xa0+\s+|\s+/ /g;  # <------ Look here
print $all;
open (FIL,">paraone.txt")||die "Can't open para.txt";
print FIL $all;
close(FILE);
close(FIL);
histrungalot
Posting Whiz in Training
266 posts since May 2008
Reputation Points: 76
Solved Threads: 34
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You