Hi everybody
I have a file(newfolder.html). I want to do preprocessing on its content. Some operations like tokenization, deleting stop words, counting the number of words. I know how to do these operations if I have a text file(.txt) .but now I have to do it with a html file.
How can I do it?
Thanks

I know how to do these operations if I have a text file(.txt)

An HTML file is text as well, so haven't you tried using the same approach as if it was a text file. HTML is plain text, just saved with a .html exention so the computer knows which application to execute when working with this file type. Not sure if I understand your question.

                    int i;
                    string line;
                    StreamReader file = new StreamReader("e:\\New folder.html");
                    StreamWriter w = new StreamWriter("e:\\new folder1.txt");
                    while ((line = file.ReadLine()) != null)
                      {
                           string value;
                            //Remove Html tags
                            value = Regex.Replace(line, @"<.*?>", string.Empty);

                            //Remove everything but letters, numbers and whitespace characters
                            value = Regex.Replace(value, @"[^\w\s]", string.Empty);

                            //Remove multiple whitespace characters
                            value = Regex.Replace(value, @"\s+", " ");


                        char[] delimeters={' ',',','\r','.'};
                        string[] word = value.Split(delimeters);
                        for (i = 0; i < word.Length; i++)
                            w.WriteLine(word[i]);

                      }

                    file.Close();
                    w.Close();
Hi,
As you recommend me I suppose it is a text. I use the above cod.But my output is strange, I think that’s because I'm not aware of html tags.
My html file is like this:
    Mining Software Repositories
    From Wikipedia, the free encyclopedia
    This article provides insufficient context…

And I want to have this output:

    Mining
    Software
    Repositories
    From
    Wikipedia
    .
    .
    .
but my output is like this:
   Mining
   Software
   Repositories
   Wikipedia
   the
   free
   encyclopedia

   alangaralangkkarabalangmznalangpsalangurtextdecorationnone
   cache
   key

   this
   article
   . . .
What's the problem?
Thanks in advance.

There is clearly some HTML code that you're stripping the HTML tags off that is gettings left behind. The last part of the random looking string is
"textdecorationnone" which equals "text-decoration:none" I'm presuming which is, of course, CSS.
So have a look at the actual source of your HTML and see what else you need to consider.

This article has been dead for over six months. Start a new discussion instead.