22 May 2003 Thatcher Ulrich http://tulrich.com I've gradually come around to the belief that reference-counting is the best way to do garbage collection. I spend a little time thinking about GC in game engines, and also in Lua, and have done some work and experimenting in the past. I used to read all the GC papers and conclude that incremental generational mark-sweep was the holy grail, and that ref-counting was lame due to the counter overhead and the problem of collecting cycles. But incremental generational mark-sweep is a tremendously complicated thing, rarely seen in captivity, that needs all kinds of cooperation from your run-time system. Meanwhile ref-counting has three gigantic advantages going for it: * implementation is really simple and reliable * it's deterministic * aside from cycles, garbage gets collected very early Because of the second two properties, you can easily use it for precious resources like file handles or whatever and not be totally at the mercy of the mysterious workings of some background GC. Plus there are the existence proofs: Perl and Python both use ref-counting, and those are both very practical and successful systems. Not to mention the last two game engines I've worked on, where I have to admit, despite my initial skepticism, C++ smart-pointer ref-counting has worked quite well. That leaves the drawbacks. I've decided the counter overhead doesn't really matter; most sources who cite counter overhead as a problem are probably thinking about GC'ing Lisp cons-cells, which are tiny little structures, where the overhead is obviously unacceptable. But, I believe that's irrelevant, because cons-cells are not a good basic data structure at all; structs, resizable arrays and hash-tables are what you want. They're more efficient and easier to use for almost all purposes. An extra int for a ref-count in your larger heap objects is not bad, especially when you compare against the typical incremental mark-sweep, which needs two pointers per heap object to make a doubly-linked list. The cycles problem is slightly stickier. Pure ref-counting, with no attempt to collect cycles, is very appealing because it's so simple and elegant and deterministic. And actually many systems get away with pure ref-counting. But it does mean the programmer has to be a little bit aware, and sometimes be careful to NULL out pointers for the benefit of GC, or do other hacks. Then I found this quote from Ken Thompson (http://www.computer.org/computer/thompson.htm), and it crystalized my little rant about GC: ---- Thompson, talking about the Limbo language in Inferno: In addition, the language implementation -- and again I don't want to take any credit -- doesn't have big mark-and-sweep type garbage collection. It has reference counting: If you open something and then return, it's gone by reference count. Thus, you don't have high and low watermarks because 99 percent of the garbage goes away as soon as it is dereferenced. If you store a null in a pointer, it chases the pointer and all that stuff goes. If you make cycles, there is a background distributed mark-and-sweep algorithm that just colors the next step a little bit at a time. It doesn't have to be very aggressive because there is not much garbage around in most applications: People really don't leave dangling loop structures that require this kind of algorithm. So you can devote just one percent of your time in the background looking for garbage without these monster mark-and-sweep things. So, again, it's pragmatic. It's not the theoretical top-of-the-line garbage collection paper. It's just a way of doing it that seems to be very, very effective. ---- end Thompson quote More reading ------------ This is the start of a thread on the "gclist", a garbage-collection mailing list: http://lists.tunes.org/archives/gclist/2002-August/002341.html I also discovered this amusing and typically opinionated posting by Linus Torvalds on the gcc mailing list: http://gcc.gnu.org/ml/gcc/2002-08/msg00544.html