Hello friends, I'm trying to remove all tags from a Wikipedia entry, leaving a simple text file. I have downloaded an HTML file from Wikipedia and hunted through my program.
But the tags are not removed properly. Instead, only nonsense comes out of it. Where is my thinker?
Java:

import java.util.Scanner;
import java.io.File;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Files;
import java.nio.file.Paths;

public class Reg {
    public static void main(String[] args)  throws Exception {
        File file = new File("test.html");
        Path path = Paths.get(args[0]);
        byte[] raw = Files.readAllBytes(path);
        String text = new String(raw, "UTF8");
        text = text.replaceAll("<script.*>.*</script>", "");
        text = text.replaceAll("<.*>", "");
        text = text.replaceAll("</.*>", "");
        PrintWriter output = new PrintWriter("test.txt");
        output.print(text);
    }
}

Recommended Answers

All 2 Replies

Line 13 you create a File but you never use it
Line 14 - what value are you entering for the (first) run time argument?

Your regex searches for a < followed by any number of any character (including >) so it basically matches the whole string following the first <
I'm no regex expert but I used "<[^>]*>" to match < followed by any number of any char EXCEPT >, followed by >, which worked for me. I'm sure our experts here can show you a better way to construct your regex

Also, using regular expressions to parse HTML on a large scale is a bad idea.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.