Regular expressions - HTML - to TXT

Question

Lisa_14 0 Newbie Poster

7 Years Ago

Hello friends, I'm trying to remove all tags from a Wikipedia entry, leaving a simple text file. I have downloaded an HTML file from Wikipedia and hunted through my program.
But the tags are not removed properly. Instead, only nonsense comes out of it. Where is my thinker?
Java:

import java.util.Scanner;
import java.io.File;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Files;
import java.nio.file.Paths;

public class Reg {
    public static void main(String[] args)  throws Exception {
        File file = new File("test.html");
        Path path = Paths.get(args[0]);
        byte[] raw = Files.readAllBytes(path);
        String text = new String(raw, "UTF8");
        text = text.replaceAll("<script.*>.*</script>", "");
        text = text.replaceAll("<.*>", "");
        text = text.replaceAll("</.*>", "");
        PrintWriter output = new PrintWriter("test.txt");
        output.print(text);
    }
}

html-css java xml

Edited 7 Years Ago by happygeek because: spam link deleted

3 Contributors
2 Replies
381 Views
19 Hours Discussion Span
Latest Post 7 Years Ago Latest Post by pty

All 2 Replies

JamesCherrill 4,733 Most Valuable Poster

7 Years Ago

Line 13 you create a File but you never use it
Line 14 - what value are you entering for the (first) run time argument?

Your regex searches for a < followed by any number of any character (including >) so it basically matches the whole string following the first <
I'm no regex expert but I used "<[^>]*>" to match < followed by any number of any char EXCEPT >, followed by >, which worked for me. I'm sure our experts here can show you a better way to construct your regex

Edited 7 Years Ago by JamesCherrill

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

pty 882 Posting Pro · Answer 1 · 2018-02-22T12:37:34+00:00

Also, using regular expressions to parse HTML on a large scale is a bad idea.

Regular expressions - HTML - to TXT

Recommended Answers Collapse Answers

All 2 Replies

Recommended Answers