preprocessing with html file

Question

aseeman 0 Newbie Poster

9 Years Ago

Hi everybody
I have a file(newfolder.html). I want to do preprocessing on its content. Some operations like tokenization, deleting stop words, counting the number of words. I know how to do these operations if I have a text file(.txt) .but now I have to do it with a html file.
How can I do it?
Thanks

4 Contributors
6 Replies
220 Views
5 Days Discussion Span
Latest Post 9 Years Ago Latest Post by hericles

All 6 Replies

blackmiau 0 Junior Poster

9 Years Ago

You just put what you have on the text file, using the HTML tags.

JorgeM 958 Problem Solver

9 Years Ago

I know how to do these operations if I have a text file(.txt)

An HTML file is text as well, so haven't you tried using the same approach as if it was a text file. HTML is plain text, just saved with a .html exention so the computer knows which application to execute when working with this file type. Not sure if I understand your question.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

aseeman 0 Newbie Poster · Answer 1 · 2014-05-23T11:45:56+00:00

aseeman 0 Newbie Poster

9 Years Ago

would you explain it more?

aseeman 0 Newbie Poster · Answer 2 · 2014-05-29T03:11:10+00:00

                    int i;
                    string line;
                    StreamReader file = new StreamReader("e:\\New folder.html");
                    StreamWriter w = new StreamWriter("e:\\new folder1.txt");
                    while ((line = file.ReadLine()) != null)
                      {
                           string value;
                            //Remove Html tags
                            value = Regex.Replace(line, @"<.*?>", string.Empty);

                            //Remove everything but letters, numbers and whitespace characters
                            value = Regex.Replace(value, @"[^\w\s]", string.Empty);

                            //Remove multiple whitespace characters
                            value = Regex.Replace(value, @"\s+", " ");


                        char[] delimeters={' ',',','\r','.'};
                        string[] word = value.Split(delimeters);
                        for (i = 0; i < word.Length; i++)
                            w.WriteLine(word[i]);

                      }

                    file.Close();
                    w.Close();

aseeman 0 Newbie Poster · Answer 3 · 2014-05-29T03:15:37+00:00

Hi,
As you recommend me I suppose it is a text. I use the above cod.But my output is strange, I think that’s because I'm not aware of html tags.
My html file is like this:
    Mining Software Repositories
    From Wikipedia, the free encyclopedia
    This article provides insufficient context…

And I want to have this output:

    Mining
    Software
    Repositories
    From
    Wikipedia
    .
    .
    .
but my output is like this:
   Mining
   Software
   Repositories
   Wikipedia
   the
   free
   encyclopedia

   alangaralangkkarabalangmznalangpsalangurtextdecorationnone
   cache
   key

   this
   article
   . . .
What's the problem?
Thanks in advance.

hericles 289 Master Poster Featured Poster · Answer 4 · 2014-05-29T03:39:52+00:00

There is clearly some HTML code that you're stripping the HTML tags off that is gettings left behind. The last part of the random looking string is
"textdecorationnone" which equals "text-decoration:none" I'm presuming which is, of course, CSS.
So have a look at the actual source of your HTML and see what else you need to consider.

preprocessing with html file

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers