Greetings all,

I'm actually proud to have a question I've never seen asked here. I've got a 1.5GB text file (YES, it's really 1.5GB) where each line needs a set of modifications. I've got the application running in a single-thread, whereby each line is read in, the modification is made, and the line is written back out.

This is fine, but it's slow. It took about an hour for the application to traverse this file. The IO itself is very fast, it's just the processing takes a second to do. So here's what I'd like to try:

While the reader has text in it, read a line.
Start a new thread to process the line, write the modified line to a file.

Sounds simple enough. The only thing is that each "line" could be anywhere from 1KB to a couple hundred MB a piece. This isn't a problem with the single-threaded application, but it is an issue when you're dealing with multiple threads, even on a quad-proc machine with 4GB of RAM.

Basically, I'd like to dynamically create a new thread to deal with each line that is in the TextReader. Once the thread finishes, it should write its data to file and release its resources. Past that, I'd like to consider a way to do it without using too much memory, or deal with OutOfMemory exceptions "cleanly", ie, without loss of data from the TextReader.

This might sound kind of complicated, I don't know. Can anyone suggest a direction to roll, or even a better strategy?

A dynamic count of threads isn't always a good idea, in fact it's often a bad idea. Since you have a quad core machine you should start with 4 threads (one per core), then try 8, 16, 20, etc until you see the logrythmic effect of too many threads. Adding more and more threads won't always make things faster as context switching at the CPU level isn't cheap.

What's interesting is that I did take that approach, and it did seem to work. However, I had some concerns. What if say, 20 threads per core worked on one machine, but not on another?

I wound up using the ThreadPool. With a 4 core processor it made 1000 threads available. Oddly enough, it worked like a charm-- I didn't run into any Out Of Memory Exceptions. I know it could have been a fluke, so I'm still going to investigate a good way to manage memory in this environment. I don't want to have to write my own TextReader to properly seek when an exception happens, but if I must, I must...

How do you manage threads? are just you triggering and then joining them?

This link may help: http://www.cs.cf.ac.uk/Dave/C/node29.html i also work with threads, my matter is that threads needs to be concurrent (and they share information between them), just like small daemons inside the process.

Good luck!

How do you manage threads? are just you triggering and then joining them?

This link may help: http://www.cs.cf.ac.uk/Dave/C/node29.html i also work with threads, my matter is that threads needs to be concurrent (and they share information between them), just like small daemons inside the process.

Good luck!

Thanks!

The only thing concurrent about my threads, so much, is that they share a single text reader and text writer. I deal with that using the Synchronized() methods on each of those. The order I write the lines I'm processing to file don't matter much. It just matters they all get in.

I'm managing the threads "Automagically", I guess you could say. I'm using ThreadPool.QueueUserWorkItem to add jobs to the thread pool. Then the thread pool runs jobs as the threads become available. It's actually working now, which I found kind of funny... Still doesn't seem like the memory management issue is being addressed. But hey, in this case I'm just going to run with it for now, because it's showing results that are satisfactory enough... for now.

This article has been dead for over six months. Start a new discussion instead.