I want to create a word processor. It seems like a useful project because I have strong feelings about how modern big word processors like MS Word and Open Office don't do things right, so my solution is to make one with more modest features but with the huge advantage of working the way I want it to work. It seems like a fun and rewarding project, but it is also challenging and I am doing it single-handedly, so I look to the internet for help and ideas.

A project like this is too big and complicated to just leap into it. I need to break it into small well-designed pieces that I can work on independently and build up to a working word processor. This seems obvious and it is a basic principle of good design, but it is troublesome because a word processor seems so monolithic by nature. I can't see the natural dividing lines where I should break up my project, and every time I attempt to break it into pieces, the pieces end up confused and tightly coupled. I think I may need advice from someone more wise than myself in the ways of object-oriented design, since I'm sure that object-oriented programming is the way to go here.

Best of all would be to look at the design of an existing word processor, perhaps something open-source, but it's really not easy to get something good. I don't want something monolithic that I can't understand, and I don't want a mere text-editor that's so simple that it could never be expanded to work on proper documents. I intend to use a piece table data structure for the text, because that is naturally superior for large documents, but examples I find on the web seem to like using gap buffers or other inferior structures. A gap buffer may work, but it involves copying text around needlessly.

The principles of a piece table are simple enough that there's no reason not to use it, in my opinion, but using it tends to flavor everything else in the project. A piece table tends to be represented using a linked list and so it is unsuited to using character index numbers for access. When you are using a gap buffer, you can access then 521st character in the document (for example) in constant time, but when you have a piece table you would need to search from the beginning of the document to find which character is the 521st. Fortunately there is nothing in the nature of a word processor that says being able to look up the 521st character should be fast. A word processor only needs to know where the currently visible part of the document starts and iterate through the document from that point until the screen is filled with text, which is more suited to a linked list than an array. Unfortunately, the best examples that I've found on the internet so far throw character index numbers all over the place, making it very difficult to use any of it in a piece table design. Most frustrating of all are the piece tables that conform to a character-index interface, and thereby practically destroy the advantages of the piece table.

Even if it makes the project more difficult, I am committed to making something with quality, something that uses a piece table, even if that means that I can't take advantage of existing libraries that depend on character index numbers. I have reluctantly discarded more than one such library.

With the right design, any project no matter how huge, becomes simple. I'd just like to find something that clearly shows how a word processor is organized. I imagine it would take the form of a graph that shows data flowing along arrows through nodes that manipulate the data, like you would have with a compiler or a graphics renderer. Or perhaps it should be represented as layers like an operating system. Has anyone ever seen a picture of a word processor broken down into pieces like that?

Recommended Answers

All 3 Replies

I don't know much specifically about a word processor, but I would certainly expect all of them to be using some form of a MVC pattern (Model-View-Control). I would also expect that all the word processor use some form of a markup language to represent the actual text along with all the meta-data (formatting, images, labels, cross-references, etc.), that's the "Model". This markup'ed text would certainly be stored internally to the word processor in the form of a graph, not just a simple buffer of text characters. Creating a complete specification for such a markup language, even for simple documents, is quite an endeavour. I would recommend you pick one that already exists and base your graph structure on that. Then, the "control" part would be merely about creating functions to do graph surgery, which has its challenges, but at least, it is incremental and modular in nature.

I also don't like typical word documents like MS Word or LibreOffice, and for that reason, I, like many others in academic fields, use LaTeX instead. LaTeX is usually written directly in LaTeX commands, as opposed to using some word processor. I would highly suggest that you base your markup language on LaTeX (or a subset of it), that's gonna take care of a significant part of this project. On that front, you might want to take a look at LyX, which is a simple word processor based on LaTeX.

By the way, I don't buy your arguments about using a "piece table". I think you are buying the hype that linked-list are so much better because they lead to less copying of memory when things are inserted into it. You are on the right track when you are saying that for the most part the word processor only has to manipulate the data corresponding to stuff around where the user is currently editing stuff. However, IMO, you took a wrong turn after that. You remark, correctly, that you don't need random-access (i.e., looking up any character in the text by index), and that linked-list structure would be all you need. But then, you jump to the conclusion that it would be better to use a linked-list structure because it would save some copying. First, saving on the copying is not as trivial as it sounds when you have highly dynamic data. With a naive implementation, you will save on the data-copying, but it might lead you to significant overhead in memory allocations and articifially inflate your memory consumption, while, on the other hand, even a naive implementation of a segmented dynamic array would easily out-perform it, even with its inevitable copying on logN intervals (which is rather insignificant in practice, just shows up as little blips on performance results). And a more sophisticated implementation of a segmented dynamic array, such as one which uses a cache-aware or cache-oblivious segmentation strategy would be unrivaled in performance.

Second, the main reason for using contiguous buffer (even with "gaps") is not random-access, it is locality of reference, which is by far the main advantage of contiguous buffers. As you correctly remarked, most word processor needs probably don't require random-access, and it's just used as an available luxury. However, you do need to read, write, parse and manipulate a lot of data around the current edit point, or traversal point. These operations are usually simple and quick, but their performance can be completely destroyed by cache misses. To prevent this, all the chunks of data that will be manipulated in the near future should be physically close (in RAM) to each other such that it all gets cached together, and all operations can be done without a miss. The huge problem with linked-list style structures is that the links lead to random and far-reaching jumps in requested memory from the RAM, which throws the CPU prefetcher off, and causes vastly increased cache misses. Linked-lists are only recommended for large objects which are expensive to copy, because only then can the draw-backs of linked-lists be compensated by the savings in copying. And we're not talking a few dozen characters as "a large object", maybe a few hundred bytes might start to qualify as a "large object", and at this point, you'd better start talking about a segmented dynamic array implementation which is really a list of dynamic arrays whose sizes are kept to within a constant (or recursive) size. Such a segmented array can certainly be used to implement this "gap buffer" concept, and I wouldn't be surprised if most decent word processors do exactly that.

You have given me a large amount to think about. You must be right that a linked list suffers from poor locality of reference compared to an array. It might be necessary to temporarily story the currenlty visible part of the document in an array-like structure if I do end up using a piece table linked list.

On the other hand, there is also no doubt that I will be dealing with large objects here. The goal is a word processor, and that means it's intended for a document, not merely a few lines of text. A few hundred bytes is the least I can expect to deal with. Your post alone is better described as a few thousand bytes, and I'd call that a very small document. I would consider my word processor seriously broken if it couldn't support viewing and editing the entire content of Tolstoy's War and Peace (a few megabytes) without problems. I will look into the benefits of segmented dynamic arrays.

LyX looks amazing. I can't thank you properly for pointing it out to me. I still want to make my own, but from brief inspection LyX is far closer to how a word processor should be than any I've seen before, and best of all it is open source so I should be able to learn from how it was made. If it turns out to be piece-table based, then that would be the icing on the cake.

A few hundred bytes is the least I can expect to deal with.

By a few hundred bytes, I didn't mean that the entire document is only a few hundred bytes, but I meant that the chunks that make up your "piece table" (i.e., the pieces) are a few hundred bytes each, at least. This is basically to the point of how fine-grained your pieces are. On the one extreme, you can make a "piece" to represent each individual addition / subtraction of text, which would probably average at a few dozen characters each (e.g., like what would typically show up in different colors when you turn "track changes" on in MS Word). On the other extreme, you would store the entire text in one continuous (gap) buffer. Neither of these solutions would be good when working with a large document. The optimum would certainly lie in the middle somewhere, i.e., as a sequence of (gap) buffer segments which are maintained to a reasonable size (merged when shrinking too small, split when expanding too big), which I would roughly estimate to be around 1000 or 2000 characters on average.

If it turns out to be piece-table based, then that would be the icing on the cake.

I doubt that. The LaTeX format, which I encourage you to learn / play-with, is essentially in the form of a tree. This is also similar to the way HTML works. And I think that the docx format for MS Word is also XML-based tree structure. It is only natural that this would be loaded and manipulated as a tree structure by whatever software uses it. Of course, that doesn't say anything about the underlying storage, which is independent of the tree structure. The body of the text (or fields) would probably be stored as simple (gap) buffers, since individual paragraphs wouldn't be very big anyways. If I had to guess, I would imagine that LyX uses a tree structure where each node is a LaTeX command (or environment), and fields (like text, titles, labels, etc.) are stored with a simple (gap) buffer. At least, that's the most natural way to go, and it is pretty close to optimal in terms of dealing with large documents, since individual paragraphs or fields would rarely exceed a couple thousand characters (because that would be horrible writing style).

Storing the entire text in a (gap) buffer is probably much more common in (enhanced) text editors, like emacs, vim, kate, gedit, notepad++, etc., where there is no formatting (switching fonts, make titles, sections, figures, tables, citations, bibliographies, math equations, etc..., all the things that LaTeX or word formats do beyond raw ASCII text).

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.