Here is the thing with the "drawing a line". Draw a simple cartesian plot. Consider that 1 = true and -1 = false. Put a dot in the graph for each possibility (1,1), (1,-1), (-1,-1), (-1,1). According to XOR logic, only the points (-1,1) and (1,-1) yield "true" after a XOR. Can you draw a single line on that graph such that the two "true" points are on one side of it and the two "false" points are on the other?

The point with perceptrons is that all they can do is compute a linear combination of the inputs and use that to fire a 0 or 1 output (or through some other basis function). A "linear combination" is another word for a line, in fact, a line is defined by a linear combination of the coordinate variables (in this case, the coordinate variables are the inputs). So, literally, the only thing that a perceptron can do is draw a line and tell you on which side the input lies (0 or 1). So, this is why we say that a perceptron can only deal with a linearly separable problem, because that's all a perceptron does.

If you have a second layer in your ANN, you can get it to do two linear separations (e.g., two lines) and then combine those two outputs to figure out if the XOR output should 1 or 0. So, with one hidden layer, you can solve the XOR problem. It's that simple.

You must understand that ANNs are not magical. I used to think, when I first was interested in this, that ANNs were like "simulating a brain" and that they could do awesome things. In reality, they don't. Yes, they kinda work the same as the neurons in a brain, but the big difference is that the structure of the neurons in the brain is so deep (many many layers), with cross-layer connections, with cycles, with states and dynamics. In other words, a brain-like ANN goes way beyond our current capabilities, both in working out the math and in simulating it.

You must treat ANNs as just one of many techniques in machine learning, and often not a particularly good one. Bayesian inference methods, support-vector machines, clustering methods, locally-weighted regressions, Q-learning, genetic algorithms, etc., etc., are amongst many methods that work very well and are used extensively these days (especially in my field of robotics), you don't see people using ANNs too much.

As for explanation of the learning methods, for ANNs, the learning is almost always a gradient descent method (so called "back-propagation"). You must see it as just that, a gradient descent method, forget about the fluff and the stupid vocabulary used by ANN fanatics.

Get a machine learning textbook, and go in order, by the time you get to ANNs, you'll see what I mean.