0

Hello friends, I'm trying to remove all tags from a Wikipedia entry, leaving a simple text file. I have downloaded an HTML file from Wikipedia and hunted through my program.
But the tags are not removed properly. Instead, only nonsense comes out of it. Where is my thinker?
Java:

import java.util.Scanner;
import java.io.File;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Files;
import java.nio.file.Paths;

public class Reg {
    public static void main(String[] args)  throws Exception {
        File file = new File("test.html");
        Path path = Paths.get(args[0]);
        byte[] raw = Files.readAllBytes(path);
        String text = new String(raw, "UTF8");
        text = text.replaceAll("<script.*>.*</script>", "");
        text = text.replaceAll("<.*>", "");
        text = text.replaceAll("</.*>", "");
        PrintWriter output = new PrintWriter("test.txt");
        output.print(text);
    }
}

Edited by happygeek: spam link deleted

3
Contributors
2
Replies
23
Views
3 Months
Discussion Span
Last Post by pty
0

Line 13 you create a File but you never use it
Line 14 - what value are you entering for the (first) run time argument?

Your regex searches for a < followed by any number of any character (including >) so it basically matches the whole string following the first <
I'm no regex expert but I used "<[^>]*>" to match < followed by any number of any char EXCEPT >, followed by >, which worked for me. I'm sure our experts here can show you a better way to construct your regex

Edited by JamesCherrill

Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.