I have noticed a few questions popping up from time to time on the subject of formal languages in general, and on regular languages in particular. As I'm pretty bored and enjoy the study of formal languages to a reasonable extent, I thought I might try to give a quick-and-dirty treatment of the subject here. Who knows, maybe somebody will find it useful some day.

If a moderator sees this and feels it is appropriate to move to the Tutorials section, that might not be a bad idea. Then again, that's not my call.

An Introduction to Alphabets, Strings, and Formal Languages

We'll start at the very beginning of the topic of formal languages. What is a formal language? To be prepared to answer this question, we must first introduce a few other concepts with which programmers are likely already familiar, if informally.

We will call an alphabet any finite, non-empty set of symbols. I won't attempt a rigorous definition of symbol here, but don't worry about it, it's usually not too hard to tell what is a symbol and what isn't. I'll be sure to make clear what I'm talking about.

We will call a string over an alphabet any finite sequence of symbols from that alphabet. For notational convenience we will omit braces and commas.

Example: Let the alphabet E be the set of symbols {a, b, c}. One string s over E is (a, b, b, a, c) (from now on, we will write this as abbac).

Before moving on, it will be useful to define some terminology concerning strings.

We will call the length of a string the number of symbols which it contains. We will denote the length of string s by len(s).

We will call the concatenation of two strings s and r that string t which consists of the symbols of s followed by the symbols of r. In other words, if s = (s1, s2, ..., sn) and r = (r1, r2, ..., rm) then t = (s1, s2, ..., sn, r1, r2, ..., rm). We will denote concatenation of strings either by the dot operator (s.r) or simply by juxtaposition (sr).

We will call the empty string the string of length zero (which contains no symbols). We will denote the empty string by the tilde ~.

We are now ready to define what a formal language is.

We will call a language over an alphabet E any (possibly infinite) set of strings over E.

Example: Let the alphabet E be the set of symbols {0, 1}. One language L over E is {0, 00, 000} and another language R is the set of all strings over E which contain exactly one occurrence of the symbol 1.

We will normally hold the alphabet E fixed in our discussion of any given language. Therefore we will not dwell on operations on alphabets; they are the same as operations on sets, except that empty and infinite sets are never allowed to be alphabets.

We will however discuss a few of the operations on languages, which include all of the operations on sets and then some.

We will call the union of two languages L and R over an alphabet E that language S which contains all the strings in L as well as all the strings in R. Therefore any string in the union must be in either L or R. We will denote this by L union R.

We will call the intersection of two languages L and R over an alphabet E that language S which contains only the strings which are both in L and in R. Therefore any string in the intersection must be in both L and R. We will denote this by L intersect R.

We will call the difference of two languages L and R over an alphabet E that language S which contains only those strings which are in L but not in S. Therefore any string in the difference must be in L but not in S. We will denote this by L diff R.

We will call the concatenation of two languages L and R over an alphabet E that language S which contains any string l.r (the concatenation of strings l and r) such that l is any string from L and r is any string from R. We will denote this by either L.R or juxtaposition (LR).

We will call the empty language the language containing no strings. We will denote the empty language by {}.

We will call the Kleene closure of a language L that language S which equals the following: {~} union L union L.L union L.L.L union ... In other words, it is the language containing any string which can be formed by concatenating elements of L any finite number of times. we will denote the Kleene closure of L by L*.

We will also allow the alphabet E to be considered as a language consisting of strings of length one. This allows us to define the set of all strings over an alphabet E very simply: it is the Kleene closure of E, E*.

We will call the complement of a language L over E that language S = E* - L. In words, it is the set of all strings over alphabet E which are not in L.

This concludes the introductory part of the tutorial on formal languages. By now, the reader should know very generally what symbols, alphabets, strings, and languages are, as well as a few of the most common operations on them.

I encourage other posters to add exercises or problems for the reader as I go along, although I will likely come back and do this at a later date. Please feel free to point out any mistakes I might have made, and suggestions for improvement are similarly welcome.

The next installment will deal with regular languages, regular expressions, and finite automata. It will likely be in at least three parts - regular expressions, finite automata, and the equivalence of the two and the class of languages they define.

mrnutty commented: Let me offset that for you then. +5

Recommended Answers

All 8 Replies

Upon rereading, I realized I left out a few things that will be important later. This is the first of what might be several addenda to An Introduction to Alphabets, Strings, and Formal Languages.

We will call s' a substring of s if s' occurs anywhere within s. That is, if s' is a substring of s if and only if s can be written as a concatenation of the form rs't for arbitrary strings r and t. We call s' a prefix of s if and only if s can be written as a concatenation of the form s't for arbitrary string t, and a postfix of s if and only if s can be written as a concatenation of the form rs' for arbitrary string t. Note: prefixes and postfixes are substrings.

We will say that a language L is a subset of language R if and only if every string in L is also in R.

We will say that a language L is equal to a language R if and only if L is a subset of R and R is a subset of L.

We will also allow some shorthand notations. By L^n (where L is a language and n is natural number) we will denote that language T which contains all strings which can be formed by concatenating any n strings of L.

There may be further addenda, but for now this is all I can think of.

Upon rereading, I realized I left out a few points that typically give students trouble in the early stages. This is the second of what might be several addenda to An Introduction to Alphabets, Strings, and Formal Languages.

I think it is wise to give a few properties of the operations mentioned previously for those who may be unaccustomed to dealing with sets.

First, let us discuss strings. In what follows, let s, t, r, u, v, w, x, y, and z be strings over any alphabet.
1. ~.s = ~s = s (right concatenation with empty string)
2. s.~ = s~ = s (left concatenation with empty string)
3. s.(r.t) = s(rt) = (sr)t = (s.r).t (associativity of concatenation)
4. ~ is a substring, prefix, and postfix of s
5. s is a substring, prefix, and postfix of s
6. if s is a (prefix, postfix, substring) of t, and t is a (prefix, postfix, substring) of r, then it follows that s is a (prefix, postfix, substring) of r.

Next, we will discuss languages. In what follows, let L, R, S, and T be languages over any alphabet.
1. {}* = {~}
2. {~}* = {~}
3. {}.L = {}L = {} (right concatenation with the empty language)
4. L.{} = L{} = {} (left concatenation with the empty language)
5. {~}.L = {~}L = L (right concatenation with the language consisting of the empty string)
6. L.{~} = L{~} = L (left concatenation with the language consisting of the empty string)
7. L.(R.T) = L(RT) = (LR)T = (L.R).T (associativity of concatenation)
8. We will denote the complement of a language L with respect to alphabet E by complement_E L.
9. complement_E {} = E*, complement_E E* = {}
10. if complement_E L = R then complement_E R = L (symmetry of complement)

These are just some of the most commonly confusing properties. For the full set of properties, a more thorough reference should be consulted, including set theoretic material. Remember, languages are just special kinds of sets, and strings are just special kinds of sequences, so anything that applies in the general case also applies in the specific cases.

Ughh. No thanks. I had enough discrete mathematics class at my school for my lifetime.
Anyways, looks good. Keep it up. Might read it fully, when I got the time.

Alright, as I can't think of any further addenda for the first part, I suppose it's time to get started on the second. We will start with a discussion of Regular Expressions.

Regular Expressions

We will call a regular expression over alphabet E a string over the alphabet {(, ), +, *, ., -estr-, -eset-} union E (where the symbols (, ), +, *, ., -estr-, and -eset- are not in E) which has a finite derivation using the following rules:
1. -eset- is a regular expression.
2. -estr- is a regular expression.
3. r is a regular expression where r is any symbol from E.
4. (r+s) is a regular expression where r and s are regular expressions.
5. (r*) is a regular expression where r is a regular expression.
6. (r.s) is a regular expression where r and s are regular expressions.
7. Nothing else is a regular expression.

Example 1:
Let E = {a, b, c}. The following is a regular expression:
(((-estr-+a).b)+((-eset-*).(c*)))

Example 2:
Here is the derivation of the regular expression in Example 1:
1. (((-estr-+a).b)+((-eset-*).(c*))) is a regular expression if ((-estr-+a).b) and ((-eset-*).(c*)) are by rule (4)
2. ((-estr-+a).b) is a regular expression if (-estr-+a) and b are by rule (6)
3. (-estr-+a) is a regular expression if -estr- and a are by rule (4).
4. -estr- is a regular expression by rule (2)
5. a is a regular expression by rule (3)
6. b is a regular expression by rule (3)
7. ((-eset-*).(c*)) is a regular expression if (-eset-*) and (c*) are by rule (6)
8. (-eset-*) is a regular expression if -eset- is by rule (5)
9. -eset- is a regular expression by rule (1)
10. (c*) is a regular expression if c is by rule (5)
11. c is a regular expression by rule (3)

Example 3:
Here is another derivation of the regular expression in Example 1:
1. (r+s) is a regular expression for regular expressions r and s by rule (4)
2. ((t.u)+(v.w)) is a regular expression for regular expressions t, u, v, w by two applications of rule (6)
3. (((x+x').b)+((y*).(y'*)) is a regular expression for regular expressions x, x', y, y' by rule (4) and two applications of rule (5)
4. (((-estr-+a).b)+((-eset-*).(c*))) is a regular expression by rule (1), (2), and two applications of rule (3)

Now that we know what a regular expression is, we will answer the question of why regular expressions are useful. We may define an interpretation of regular expressions which associates with each regular expression over E some language over E. We will denote the language defined by a regular expression r as L(r). The language defined by a regular expression is defined as follows:
1. L(-eset-) = {}
2. L(-estr-) = {~}
3. L(r) = {r} where r is any symbol from E.
4. L(r+s) = L(r) union L(s)
5. L(r*) = L(r)*
6. L(r.s) = L(r).L(s) = L(r)L(s)

We will usually not require that regular expressions be fully parenthesized; * has the highest precedence, followed by ., followed by +. We will also not usually require that the . symbol be written out; juxtaposition will be used instead. Both + and . are assumed to be left associative by default.

Example 4:
The regular expression (((-estr-+a).b)+((-eset-*).(c*))) can be written as (-estr-+a)b+-eset-*c* using our new conventions.

Example 5:
The regular expression a*b+c-estr-+-eset-* can be written using the strict conventions by replacing the . symbol in place of juxtaposition and by inserting parentheses as appropriate. After replacing . we arrive at a*.b+c.-estr-+-eset-*. After replacing parentheses we arrive at ((((a*).b)+(c.-estr-))+(-eset-*)).

We will call two regular expressions s and r equivalent if and only if L(s) = L(r). We will call two regular expressions equal if and only if s = r.

Complicated regular expressions can often be reduced to simpler equivalent forms by using rules derived from the properties of operations on languages. This will be discussed in greater detail in a later installment. That being said, that's all I've got for now. Others are welcome to interject with questions, comments, or exercises. I will come back at the end and try to make a lot of exercises for each section... and work a few, perhaps.

I see that my first post received a vote of -1... I would encourage readers who have a problem with any of this to let me know what the problem is. I welcome suggestions, corrections, etc. Unless I get some written feedback, I will have to assume that the guy who gave the rating is the lazy student who trashed my rep by voting down all of my posts he could find, or some professional who doesn't have an appreciation for theory but still visits the CS forum.

Again, feedback - constructive or otherwise - is welcome. I'd prefer more than a down vote with no explanation; not for my sake, but for the children's.

Or, you know, just keep coming back to the thread and giving the first post a -1 vote. Whatever.

Or, you know, just keep coming back to the thread and giving the first post a -1 vote. Whatever.

Don't sweat it. Some people just don't appreciate good writing. Usually, these are immature
little kids, that hasn't gone through puberty yet, who still thinks the word 'boobies' is funny.

'boobies' is funny.

It is a pretty funny word. Oops. I think I just overplayed my hand.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.