I have noticed a few questions popping up from time to time on the subject of formal languages in general, and on regular languages in particular. As I'm pretty bored and enjoy the study of formal languages to a reasonable extent, I thought I might try to give a quick-and-dirty treatment of the subject here. Who knows, maybe somebody will find it useful some day.
If a moderator sees this and feels it is appropriate to move to the Tutorials section, that might not be a bad idea. Then again, that's not my call.
An Introduction to Alphabets, Strings, and Formal Languages
We'll start at the very beginning of the topic of formal languages. What is a formal language? To be prepared to answer this question, we must first introduce a few other concepts with which programmers are likely already familiar, if informally.
We will call an alphabet any finite, non-empty set of symbols. I won't attempt a rigorous definition of symbol here, but don't worry about it, it's usually not too hard to tell what is a symbol and what isn't. I'll be sure to make clear what I'm talking about.
We will call a string over an alphabet any finite sequence of symbols from that alphabet. For notational convenience we will omit braces and commas.
Example: Let the alphabet E be the set of symbols {a, b, c}. One string s over E is (a, b, b, a, c) (from now on, we will write this as abbac).
Before moving on, it will be useful to define some terminology concerning strings.
We will call the length of a string the number of symbols which it contains. We will denote the length of string s by len(s).
We will call the concatenation of two strings s and r that string t which consists of the symbols of s followed by the symbols of r. In other words, if s = (s1, s2, ..., sn) and r = (r1, r2, ..., rm) then t = (s1, s2, ..., sn, r1, r2, ..., rm). We will denote concatenation of strings either by the dot operator (s.r) or simply by juxtaposition (sr).
We will call the empty string the string of length zero (which contains no symbols). We will denote the empty string by the tilde ~.
We are now ready to define what a formal language is.
We will call a language over an alphabet E any (possibly infinite) set of strings over E.
Example: Let the alphabet E be the set of symbols {0, 1}. One language L over E is {0, 00, 000} and another language R is the set of all strings over E which contain exactly one occurrence of the symbol 1.
We will normally hold the alphabet E fixed in our discussion of any given language. Therefore we will not dwell on operations on alphabets; they are the same as operations on sets, except that empty and infinite sets are never allowed to be alphabets.
We will however discuss a few of the operations on languages, which include all of the operations on sets and then some.
We will call the union of two languages L and R over an alphabet E that language S which contains all the strings in L as well as all the strings in R. Therefore any string in the union must be in either L or R. We will denote this by L union R.
We will call the intersection of two languages L and R over an alphabet E that language S which contains only the strings which are both in L and in R. Therefore any string in the intersection must be in both L and R. We will denote this by L intersect R.
We will call the difference of two languages L and R over an alphabet E that language S which contains only those strings which are in L but not in S. Therefore any string in the difference must be in L but not in S. We will denote this by L diff R.
We will call the concatenation of two languages L and R over an alphabet E that language S which contains any string l.r (the concatenation of strings l and r) such that l is any string from L and r is any string from R. We will denote this by either L.R or juxtaposition (LR).
We will call the empty language the language containing no strings. We will denote the empty language by {}.
We will call the Kleene closure of a language L that language S which equals the following: {~} union L union L.L union L.L.L union ... In other words, it is the language containing any string which can be formed by concatenating elements of L any finite number of times. we will denote the Kleene closure of L by L*.
We will also allow the alphabet E to be considered as a language consisting of strings of length one. This allows us to define the set of all strings over an alphabet E very simply: it is the Kleene closure of E, E*.
We will call the complement of a language L over E that language S = E* - L. In words, it is the set of all strings over alphabet E which are not in L.
This concludes the introductory part of the tutorial on formal languages. By now, the reader should know very generally what symbols, alphabets, strings, and languages are, as well as a few of the most common operations on them.
I encourage other posters to add exercises or problems for the reader as I go along, although I will likely come back and do this at a later date. Please feel free to point out any mistakes I might have made, and suggestions for improvement are similarly welcome.
The next installment will deal with regular languages, regular expressions, and finite automata. It will likely be in at least three parts - regular expressions, finite automata, and the equivalence of the two and the class of languages they define.