I am building a program that will 'integrate' with a website I have built, the user wants a client based app that can be used to upload text from a .doc file to the website.

The website uses a database to store the text in HTML format.
The HTML produced by word is bloated and pretty useless to any program other than word (But then you already know that)

What I want, is to take the HTML output from Word, strip it of all the unneeded tags and send it to the DB (that last bit I can do)

Or even better, take the text and formatting straight from a .doc, and strip all formatting other than the basic ones (bold, italics, underlines, tables etc...)

Apart from a short look at VB.net in college, I haven't really used it, no idea if this is feasible, but Google doesn't bring up much useful information.

Any ideas?

Recommended Answers

All 3 Replies

The HTML produced by word is bloated and pretty useless to any program other than word (But then you already know that)

Yes, I do know that very well.

First solution that came in to my mind is to search for a third-party component that would "parse" Word documents to plain HTML.

But if I got it right, your customer saves Word files as HTML, right? There are two types of HTML files Word can produce (see 'Save as type'), Web Page and Web Page Filtered. Could your customer use the latter option? The HTML file still has MS 'formatting' in the HTML but the result would be less bloated.

Thanks.

Just looked at that, the filtered actually outputs reasonable HTML!

One more question now:
Can I get VB to open a document, save it as filtered HTML (say in the applications path) and then get the contents of the filtered document (To be sent as POST data to the server) as the people who will be updating the site know very little, they just type up the files and 'can't' learn new things (I do hate working with people who say can't before making any attempt).

So basically,
1. Open a .doc
2. Save as filtered .html
3. Get the .html
4. (Processing)
5. Delete the .html file

I remember reading about a component on the Microsoft MSDN site which handled Word files, but I can't remember the name of it :(

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.