Can anyone plz explain why , according to this article, an x86 processor can't read simply 4bytes from an odd address and has to make 2 reads then shift bytes to get aligned to the correct value? I dot see why a processor can't read from an odd address fine!??

That's just how processors are built. When one talks about the "natural word size" of an architecture, it relates to the size of the registers and the granularity of memory addressing. CPUs are expected to spend the greatest part of their time doing operations on numbers of the natural word size, it is normal to expect that they will be optimized in some way to read/write and operate on those data elements faster than any other "more unusual" data elements. This is why it is always recommended to use the natural word size (int) even if a char or short would be enough to store your largest number, because you will get a performance penalty in memory alignment, conversion, and addressing.

Think of memory alignment as if you had a big warehouse full of boxes that are all of the same "standard" size. If you didn't store things in boxes, it would be hard to find anything in the mess of all things. If you have boxes, you can assign an address to each of them and put them in order in the warehouse, so it is easy to find. Say you want to store things that are each smaller than the standard box size (say you have a bunch of jelly beans, a bunch of raisins and a bunch of M&Ms). Without memory alignment (i.e. a very stupid compiler), this would mean that you put all the jelly beans in box 1, then fill the remaining space with half of the raisins, and the other half goes in box 2, then the M&Ms fill the remainder of box 2. This is compact storage, but if someone asks you to get all the raisins out (or the jelly beans, or the M&Ms) you will have to pick the two boxes and sort them out again. With memory alignment, you just put each thing in its own box. That means, you will need 3 boxes, and will have some waste of space, but fetching any one of the things is much easier.

The moral of the story is that you do need to pick a box size, and that box size will never fit perfectly to what you want to store (and if boxes are too small (1 byte), it will take more time to sort them out and find the correct one, if boxes are too big, you end up wasting a lot of space), but you are almost always better off aligning the stored elements with the natural word size, even if it wastes some memory. In the case of CPUs, the box size is picked as the most useful size (which is the size used for pointers and integers, and floating-point values are usually an integer multiple of that size as well).

If you really want to know why it is more efficient or required to be that way, then study computer engineering.

great information :) i really liked how you put your own simple twist to it :]!! i still wish i knew why about odd address's and why they cause unaligned memory accesses :( i do understand what u mean about the "working with the natural word size of the processor"

The "natural word size" is the 32bit in "32bit architectures" or the 64bit in "64bit architectures". It is just the size of normal integral types which are used all over the place in a program (e.g. pointers and integers).

>>i still wish i knew why about odd address's and why they cause unaligned memory accesses

It's the other way around. Misaligned memory cause "odd" addresses. Normally, compilers will align the memory to avoid "odd" addresses (and by "odd", it is meant as a non-integral multiple of the natural word size of the architecture, of course, not the regular meaning of "odd numbers"). For example, consider this class:

class Foo {
  char c;
  short s;
  int i;

On a 32bit architecture, the int type will normally be of 4 bytes, the short type of 2 bytes and the char type of 1 byte. So, one could expect that the above class would take 7 bytes of memory. However, the compiler is allowed to take the architecture of the computer into account and decide to align the memory on 4 bytes intervals, leading to a total size for this class of 12 bytes (5 of which are not used). This will avoid any "odd" addressing, and will generally be more efficient. There are a few ways to force a compiler not to do this alignment, but these are complicated and should not be used unless you have a really good reason to do so. And some compilers won't do alignment by default, depending on target architecture specifications.

Taking up my warehouse with boxes analogy, you have to realize that memory close to a CPU is not just like a warehouse from which boxes have to be fetched, it is like a highly sophisticated, automated manufacturing plant where the amount of boxes that go through the assembly line per second has to be optimized to the maximum. The choice of the size of the boxes and how they are filled (aligned or not) becomes crucial for optimal performance. You have to remember that a computer chip is an electronic device with many data channels and control systems in place (e.g. if you need one channel per bit (parallel bus) then having 64 channels instead of 32 gives you a much higher data rate, but it also means that you have to always read 64 bits at a given time).

it just seems wierd that if u truly want to read 4-bytes from memory address x00000007 you the processor must read starting from x00000004-x0000007 then make another read from x00000008-x0000000B then shift the unneeded bytes off! I was wondering why u cant say hey man lets read 4 bytes starting at address x00000007 ;) see what i mean :P?

it just seems wierd that if u truly want to read 4-bytes from memory address x00000007 you the processor must read starting from x00000004-x0000007 then make another read from x00000008-x0000000B then shift the unneeded bytes off! I was wondering why u cant say hey man lets read 4 bytes starting at address x00000007 ;) see what i mean :P?

Because mis-alignment isn't a regularly occurring event, thus computer are optimized against it, that is make each address a multiple of word size.

I think the problem is that you are looking at this problem with an oversimplified model of what a computer does and how memory is fetched. I cannot claim to be an expert, but I can try to explain my understanding of it in general terms.

What you are saying is perfectly logical when you think of memory fetching as "send a guy to the warehouse to fetch a collection of bits for you". In that very simple model, it makes sense that alignment should not matter. But, this simple model is flawed and impractical in so many ways.

Obviously, for speed, it makes no sense to "send a guy to fetch", then wait for the data to come back, and send him again on another fetch or deposit "mission". What you want, for speed, is a continuous flow of data in and out of the warehouse. So, for that, you would try to look-ahead in time to see what data items you might want in the near future and start asking for it right away (just like a restaurant doesn't wait until it runs out of food to order more from their suppliers). To perform the look-ahead at future instructions, the CPU uses an "instruction pipeline" (so it is not one instruction executing at a time, it is a whole series of instructions continuously flowing through that pipeline, where one instruction is executed, but many before and after that one are being decoded, inspected, fetched, and finalized). And to retrieve memory beforehand, the CPU uses cache-level memory where chunks of data that is relevant for the near future is temporarily stored physically closer to the CPU. But, still, this does not require much care for alignment.

The second flaw in your very simple model is about the means of transport of information. If you ask the guy to fetch X amount of bits, how is he going to transport them? Just like in a normal warehouse setting, it is more efficient to have some dedicated piece of machinery to transport the goods, like, for example, a fork-lift or a conveyor belt. In any case, they most often require a fixed-size container or bundle (or palette) for the goods. In a computer architecture, these "conveyor belts" are called buses (as in: front-side bus, PCI bus, system bus, etc.), and they, too, are designed to transport, very efficiently, some regularly sized chunks of data (called "words" or "bauds"). Some buses are serial which send one bit at a time, in which case the words don't need to have a fixed size (although it is simpler and more efficient if they are), but these are used more in peripherals (USB, RS232, I2C, etc.). Most buses at the very core of the system are parallel, which mean one wire for each bit, and enough wires to send one complete word at a time, obviously, this requires a fixed size for the words and is also much faster than a serial bus. But still, that does not imply that alignment is necessary, only that you have to always read a fixed amount of data at a time (i.e. to fetch 1 byte, you still need to fetch 4 through the bus and discard the other 3). Obviously, the larger the word size, the larger the throughput, in theory, but in practice, if you are always fetch much more memory than you need, you don't gain much, in fact, you loose performance in post-processing and generally waste memory with a lot of padding.

The last major flaw in your simple model is that you assume that pre-processing can be done at the warehouse, which is not the case. Memory addressing is very "dumb", because it has to be very fast. There is no point in having highly optimized CPUs with 3 levels of caching, instruction pipelining, branch prediction, and very fast system buses, if the guy at the warehouse has to do a bunch of fiddling around with the data there before putting it on the bus. It is easier to store all the memory in chunks of sizes that already match the bus on which they will go (word size) and not have to "construct" words at the warehouse. Just like the box/warehouse analogy, we know already that boxes have to be shipped ahead of time (for caching) and that they need to be expedited in a standard box-size (words), then it only makes sense that all the data already be packed in boxes at the warehouse such that all the guy stationed there has to do is ship the right boxes when requested. If you require the guy to either pack new boxes for every request (leaving the goods in the warehouse all unpacked) or to unpack two boxes to pack a new one, it will slow down the entire process and be the bottleneck of the entire system. Just putting the requested boxes on the conveyor belt is much faster, even if it means shipping more boxes than needed.

The point is, computer engineers, over time, have deduced that the most streamlined approach was to favor physically aligned memory storage, no pre-processing before putting data on the bus (just "dumb" address-based switches), and let the CPU deal with misaligned data and padding (after all, the CPU is optimized for doing logical operations like this).