Hi everyone,

I'm a javascript/actionscript/java/ruby developer who's been given a project that must be done in Sharepoint. One of the tasks of the project is I have to write a console application that replicates data from the secure sharepoint site and pushes it out to a public site in a read-only format. Some of the data is transformed before it's sent to the public site, and somewhere during this tranformation or file transfer process I'm getting a bunch of weird characters in my content.

I'm using the HTMLAgilityPack to read and navigate the content of the sharepoint site and a bunch of Xpath transformations and regex's to handle the transformations of the content itself. Things like spaces, empty divs, quotation marks, etc in the content are changing into things that look like this:

double quotes become "â€"

single quotes become "’"

"<div> </div>" becomes "<div>Â </div>"

Can anyone explain to me why this is happening? This is my 1st sharepoint project, 1st .NET project in about 8 years, and 1st C# project. I've been impressed with the speed of development so far, and really hope I can figure out where these strange characters are coming from! If this project goes well, I may actually consider developing in C# on a regular basis. Any help is greatly appreciated!

Can you upload a small sample project demonstrating this behavior? Include a data file that is being translated and the libraries required to compile the project. This could be an encoding issue but its hard to say since you're doing transformations on the data.

Hi and thanks.

I've a frozen version of my project at:


You will not be able to access the remove files on our Secure sharepoint server, but the unaltered files are included in the /unaltered directory, so you can see the source files there.

You should be able to easily track a single file through the process. A HtmlAgilityPack.HtmlDocument is created, the remote html is loaded into it, the document is parsed, and then uploaded to it's new location on the public server. Somewhere along the way the strange characters are being introduced. It doesn't appear to be in the parse function or other transformation functions, as I can comment those functions out and the strange characters are still in the resulting files.

Thanks again for any help, let me know if you have any questions about specific bits.

Excellent. I am about to leave for the day but I will take a look at this in the morning and see what we can do to help. I haven't opened the project yet but its normally like pulling teeth to get this much code and sample data :)

The site isn't responding to my GET requests.

Strange the file is out there . . .

That's alright, I figured out the problem. It looks like there was a text encoding problem with the WebClient object. I fixed this by writing the content from the webclient to a byte array, then creating a string from that by running it through a UTF8Encoding object. Here's the code:

WebClient client = new WebClient();
            client.Credentials = CredentialCache.DefaultCredentials;  // USE CURRENT USER'S CREDENTIALS
            string base_url = "http://sharepoint.dis.wa.gov/athena/";
            UTF8Encoding utf = new UTF8Encoding();  // THIS IS USED FOR TEXT ENCODING
            foreach (string file in files_to_update){ // LOOP THRU EACH FILE
                String full_path = base_url + file + @".aspx";
                HtmlDocument document = new HtmlDocument();  // CREATE A DOCUMENT
                byte[] content_b = client.DownloadData(full_path); // DUMP DATA TO BYTE ARRAY
                String content = utf.GetString(content_b); // RUN THROUGH UTF OBJECT TO GET RID OF FUNKY CHARS
                document.LoadHtml(content); // LOAD THE RESULT INTO OUR HTMLDOCUMENT OBJECT
                document.Save(full_temp_path); // WRITE THE RESULTS TO A TEMP FILE                

Good catch. Thank you for posting the solution back on the thread.

Please mark this thread as solved (as you answered your own question :P) and good luck!

This question has already been answered. Start a new discussion instead.