Hello,

I have a very strange problem. I have created a code that I beleive is multicore/parallell threading to make a task to run faster using more cores in the computer.

My computer is a server:
32 cores (2 CPU with 16 cores each)
32 GB

Refering to my code:
- If I use 1 process (1 core), the task takes 12 minutes.
- If I use 2 process (2 cores), the task takes 7 minutes (This is what my goal is with multithreading)
- If I use 4 process (4 cores), the task takes also 7 minutes (How is this possible? It should take 3.5 minutes?)
- If I use 25 process (25 cores), the task takes 18 minutes. (But what happens with 25 cores. It takes even longer time?)

As seen if I use 25 processes, it even takes longer which is my very big question mark here. I have no idéa why this is happening.

The thing is that I have 50 files that I will read, so every file if using 1 core takes: 12/50 minutes = 14 seconds
Now I do start 25 backgroundworkers and send along 2 files to each backgroundworker in hope that this will go much faster, in theory approx: 14 seconds x 2 = 28 seconds to complete all 25 backgroundworkers.

All backgroundworkers do work and in the end they produce the same result as 1 backgroundworker doing all 50 files. But it takes 18 minutes.

I simply can't understand why this is happening and what I am doing wrong. There is no code that blocks like lockobjects in the backgroundworker where only one worker can work at a time, so there are nothing that should stop it from running freely.

My code example is like below:

//Form1.cs
using System;
using System.Collections.Generic;
using System.IO;
namespace HelloWorld
{
        private void button1_Click(object sender, EventArgs e)
        {
            //Use 1 core out of 32 cores (This process will take 12 minutes)
            int numberOfProcesses = 1;

            //Use 25 cores out of 32 cores (This process will take 18 minutes which is VERY strange? This should go faster than 12 minutes?)
            //When I use 25 cores. I divide the work in 25 pieces, so each process only needs to do 2 tasks instead of 50 tasks when only using 1 core.
            numberOfProcesses = 25; 

            for (int i = 0; i < numberOfProcesses; i++)
            {
                HelloWorld.Instance ts = new HelloWorld.Instance();

             //1 core sends along 50 files. 25 cores sends along 2 different files each loop
          //I have not send along anything here for the purpose of the example and structure of code
                ts.doSomething(); 
            }
        }
}

//Instance.cs
using System;
using System.Collections.Generic;
using System.IO;
namespace HelloWorld
{
    public class Instance
    {
        public void doSomething()
        {

            BackgroundWorker bgWorker = new BackgroundWorker();
            bgWorker.WorkerSupportsCancellation = true;
            bgWorker.WorkerReportsProgress = true;
            bgWorker.DoWork += new DoWorkEventHandler(bgWorker_DoWork);
            bgWorker.ProgressChanged += (sender, args) => { progressCallBack(args.UserState.ToString()); };
            bgWorker.RunWorkerCompleted += (sender, args) => { completionCallBack(args.Result.ToString()); };


            //Start backgroundWorker
            if (bgWorker.IsBusy == false)
            {
                bgWorker.RunWorkerAsync();
            }
        }

        private void bgWorker_DoWork(object sender, DoWorkEventArgs e)
        {
            String getline = ""; StreamReader reader = null; FileStream fs = null;
            fs = new FileStream("C:/file1.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite); reader = new StreamReader(fs);
            while (reader.Peek() >= 0)
            {
                getline = reader.ReadLine();

                //Do ALOT of work here
                //This takes 12 minutes with 1 Core
            }
            reader.Close(); fs.Close();
        }
    }
}

Edited 1 Year Ago by Darth Vader

I can think of 2 areas that have impact on time.

If this is a classic HDD then seek times pile up with more files open. Those seek times are in milliseconds seeking around that many files could result in longer processing times. If you could read a lot of lines at a time then you avoid a lot of seeks.

The second is not as likely where you go over the number of threads in one CPU. You may find this not to be true here but I want to toss that out here.

More threads would mean more seeks. As some point the CPU gain is lost in IO.
Bob

I have a RevoDrive in my computer which reads/writes 1250 MB/s vs a HDD that only reads/writes about 100 MB/s. So my harddrive is almost 3 times faster than a SSD which means extremely fast, so this can't be the main problem in this case. Also 95% of the timeconsuming work is in the code itself. With other words Substrings,calculations,IndexOf etc so it depends on the cycles of the CPU.

(Added text:
I just tried to read all txt files into memory before the process will run. So I use a for loop instead with the text files in memory.
I get the same slow result as the below benchmark. So the harddrive is not the problem.)

As I have 32 cores and as seen the below benchmarks, problem already occurs if I try to run the backgroundworker with 4 cores which really confuses me. It sounds strange that I have gone over the limit of threads that exist in my computer with only 4 cores?

  • If I use 1 process (1 core), the task takes 12 minutes.
  • If I use 2 process (2 cores), the task takes 7 minutes (This is what my goal is with multithreading)
  • If I use 4 process (4 cores), the task takes also 7 minutes (How is this possible? It should take 3.5 minutes?)
  • If I use 25 process (25 cores), the task takes 18 minutes. (But what happens with 25 cores. It takes even longer time?)

Edited 1 Year Ago by Darth Vader

I have done a simple example test that really shows the difference and problem.
Using the code below and changing:
The number "10" in numberOfTasks = numberOfTasks / 10; (How many thread to be runned at the same time)

And comment out so many threads to be started: thread = new Thread(doSomething); thread.Start();

As seen below will be the benchmark which is a big question mark:

            Thread thread = new Thread(doSomething); thread.Start(); //1 core: 19.51 seconds
            thread = new Thread(doSomething); thread.Start(); //2 core: 11.44 seconds (Should be: 9.75 seconds)
            thread = new Thread(doSomething); thread.Start(); //3 core: 9.0 seconds (Should be: 6.5 seconds)
            thread = new Thread(doSomething); thread.Start(); //4 core: 7.2 seconds (Should be: 4.9 seconds)
            thread = new Thread(doSomething); thread.Start(); //5 core: 6.4 seconds (should be: 3.9 seconds)
            thread = new Thread(doSomething); thread.Start(); //6 core
            thread = new Thread(doSomething); thread.Start(); //7 core
            thread = new Thread(doSomething); thread.Start(); //8 core
            thread = new Thread(doSomething); thread.Start(); //9 core
            thread = new Thread(doSomething); thread.Start(); //10 core: 4.8 seconds (should be: 1.9 seconds)

-------------------------------------------------------------------------------------------------
Now comes the mystery. If I open up 10 instances of the complete application itself instead and run "1 task" in all applications at the Very same time which is equivalent where we use 10 cores/threads and divide numberOfTasks / 10. This will then only take: 2 seconds INSTEAD OF 4.8 seconds as it should do where 19.51 / 10 = 2 seconds approx.

This means that my computer are able to handle 10 tasks at the same time at a 10 times greater speed but why doesn't this work when starting 10 threads as in the below code. Why does it take 4.8 seconds and not 2 seconds as it should do?

        private void button1_Click(object sender, EventArgs e)
        {
            numberOfTasks = 10000000;


            //Divide by 10 if using 10 threads etc to get below results
            //Divide by 5 if using 5 threads etc to get below results
            numberOfTasks = numberOfTasks / 10;

            Thread thread = new Thread(doSomething); thread.Start(); //1 core: 19.51 seconds
            thread = new Thread(doSomething); thread.Start(); //2 core: 11.44 seconds (Should be: 9.75 seconds)
            thread = new Thread(doSomething); thread.Start(); //3 core: 9.0 seconds (Should be: 6.5 seconds)
            thread = new Thread(doSomething); thread.Start(); //4 core: 7.2 seconds (Should be: 4.9 seconds)
            thread = new Thread(doSomething); thread.Start(); //5 core: 6.4 seconds (should be: 3.9 seconds)
            thread = new Thread(doSomething); thread.Start(); //6 core
            thread = new Thread(doSomething); thread.Start(); //7 core
            thread = new Thread(doSomething); thread.Start(); //8 core
            thread = new Thread(doSomething); thread.Start(); //9 core
            thread = new Thread(doSomething); thread.Start(); //10 core: 4.8 seconds (should be: 1.9 seconds)
        }


        int numberOfTasks = 10000000;
        void doSomething()
        {
            DateTime start = DateTime.Now;
            String Stringtest = "123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789";
            String getline = "";String newLine = "";
            for (int i = 0; i < numberOfTasks; i++)
            {
                getline = Stringtest;
                for (int i2 = 0; i2 < 50; i2++)
                {
                    newLine = getline.Substring(i2, 2);
                }
            }
            DateTime end = DateTime.Now;
            TimeSpan span = new TimeSpan();
            span = end - start;

            //Show
            MessageBox.Show(span.TotalSeconds.ToString());
        }

Edited 1 Year Ago by Darth Vader

I can't tell if you've read prior discussions on how more cores/threads is not linear in speed. 2 threads is not twice as fast. The OS has work to do so that piles up and slows us down.

In one old app I didn't use new Thread but simply launched my app again with some parameters so it knew what to do. That way the OS would pick which CPU to run it on. And yes, it worked out great.

You may want to watch what's running on which CPU the next time.

>> In one old app I didn't use new Thread but simply launched my app again with some parameters so it knew what to do. That way the OS would pick which CPU to run it on

This is what I am thinking of to do also. I have discovered like you that when launching the app again, this do double the speed. I tried just now to launch the app 8 times and it was exactly 8 times faster on the second almost.

I do have 32 cores, 32GB RAM in my system. So in that sense I have resources.
But as you mentioned: "OS would pick which CPU to run it on"
How does that work. Is one application "assigned" to a CPU, then 1 of the 32 cores?
When we create "new Thread", doesn't this then "assign" the thread to a separate CPU in the same way?
It seems that something different is happening here. I am very curious to what that is so it is possible to sort it out why?

Edited 1 Year Ago by Darth Vader

Glad my old app idea worked out for you. As to the new Thread question I never deep dived into that area except to think that Microsoft's C# engine was the reason behind it all. Launching the app, while a bit more piggy on RAM seemed to work better for me too. Sorry but I am not on MSFT's compiler team, just a coder.

There are probably a few things in play here. First, you got to know that there is a hierearchy of resources on computers, resources which are available in excess are at the top of the chart, and resources at the bottom of the chart are slow. Typically in books this is represented in a triagnle like fashion. I can't find the exact chart right now, but this is what it attempts to convey;

CPU -Very fast, lots of resources
RAM
Hard Disk
Network -Very slow, limited resources

I don't think it is incorrect to say that hard disks are always slower than CPU etc. You also have to take into account that your program appears to be doing no buffering. Hard disk reads are really slow, so programmers designed buffering so that they could read whole gobs of data all in one go instead of having slow single line reads for files. The Buffered Stream class reads n bytes into memory and allows you to use it just like another stream, so you could use your readlines with it and still have similar code. After you exaust your n bytes it will grab n more or something along those lines. If you wrapped your file stream in a buffered stream you could probably speed things up, or you could read it all in one go with a method in the C#/.NET API. Essentially without buffering you have to ask the hard disk to seek to the file each time you want to access it, so essentially every time you call read line. Way back when they would call read lines etc expensive OS calls. So every time one of your threads asks for a read line you are essentially telling your hard disk to seek to that location, constituting a lot of overhead.

Another thing that you have to understand about threading is that threads are very expensive resources, which is one of the reasons that sometimes people use thread pools to do their processing, it allows for the reuse of threads. There is high overhead due to thread creation. I might be wrong in this next statement, but if you think about it you seem to be essentially creating a new "runtime", so it would make sense that threads are expensive (don't step all over me, I'm sure there is a more accurate way of describing this). I made an object oriented brute force program in C#, was very interesting, used all that fun crypto api stuff and threading, and I also noticed that sometimes the multithreaded version would execute slower than the single threaded version, and for very short passwords it would always execute slower. Back to the drawing board essentially.

Know that threading is not always slower, lots of times it's faster, you just have to balance the thread creation with the amount of work done on each thread.

One final observation I have is that you do not necissarily know which processor the thread is running on. There is no way to know really, it is an abstraction, and if the OS you are running on decides that it shouldn't bother firing up additional processors to handle it's work load, it is going to dump them all on one processor (Of course all the processors are running anyway, but I think you get my point). For quick tasks this would be acceptable, even though your threads would essentially be running serially. The more work you assign to threads, or the longer they run, I think the OS would reshuffel them to other processors, but this is just speculation. You don't know which processor the thread will run on, the OS just makes a good guess on where to put it, not always the best guess.

Edited 1 Year Ago by overwraith: more content, better info

I'm a bit late here but the disk has to be the issue.

If you copy a file from one drive to another manually with explorer - eg a video file it will go at a certain rate. If you then copy 5 more at the same time, their speed will all drop. The bottleneck is the disk.

*>> Hard disk reads are really slow, so programmers designed buffering so that they could read whole gobs of data all in one go

it shouldn't bother firing up additional processors to handle it's work load, it is going to dump them all on one processor*

Thanks for the great answers. It really helped out to understand this better.
With the arguments above, I have red in the data from the file into memory (a "string"). That should take up approx 15 MB.
Then I use StringReader to read line by line. So I beleive this will go much faster than "telling the hard disk to seek to that location"

Then it seems to be that the application may dump all Threads to one processor which makes the threads go slower.
So in combination with the StringReader, I will create a separate application to only execute this thread. As my test was that if I launch my app now 10 times, this will work 10 times faster, so it seems that the OS assigns the Apps to different processors much better.
I hope to be able to launch about 20-25 apps with this thread to make it more than 20 times faster as they will work in parallell.

You also don't know however whether the applications will all execute on different processsors. It is all up to the OS's discression.

That is interesting, at this moment I do have Windows Enterprise 64 bit. I choosed that version of windows as it can handle two CPU.
I have 2 CPU with 16 cores in each. Total of 32 cores.

I have 4 CPU sockets on the motherboard so I will add 2 CPU/16 cores so I will have a total of 64 cores in the end.
Why I built this computer is just because this thread that I need to execute in parallell to make it faster.

TO THE QEUSTION:
The computer I have is a server - and Windows do have server versions of windows like for example "Windows Server 2012 R2"
I will need to upgrade to that version later as that version can handle 4 CPU as Windows Enterprise only can "see" 2 CPU.
I wonder if Windows Server 2012 R2 are better to execute work on different processors?

Edited 1 Year Ago by Darth Vader

Sorry but I didn't dive into how C# makes and runs code in depth.

In short it compiles to byte code (remember Java?) then the CLR runs that byte code. Your thread is run in that CLR so to escape the CLR we launch another .EXE and there we are at full speed tapping all cores.

This would not be OS dependant but a nature of the CLR, C# and how the code is not native x86/x64 code after compiling.

Hopefully this will be enough for you to figure out that threads in C# have nothing to do with threads on the OS.

About CPUs and Windows Versions, there can be limits which are well published but I get the feeling you never figured out how C# works under the hood. Hope this helps.

I googled "Windows Enterprise 64 bit." I didn't find what that was except it would cost me a suitcase of green.

This article has been dead for over six months. Start a new discussion instead.