A Favorite Data Structure

January 17th, 2009

One of my favorite data structures is the Ullman set. (This name isn’t standard. As far as I know, the idea originated in a problem in Aho, Hopcroft and Ullman’s Data Structures and Algorithms.) Like bit sets, Ullman sets represent sets over a dense range of integers. Let’s fix the range from 0 to N-1 and call N the capacity of the set. Let n denote the cardinality of the set. Like bit sets, Ullman sets use O(N) storage and support constant time add, delete and membership test. However, unlike bit sets, Ullman sets support construction, clear (delete all elements) and cardinality in constant time and iteration over the members of the set in O(n). How does this magical data structure work?

Prototyping in C, the Ullman set

struct UllmanSet
{
  unsigned n;
  unsigned *members;
  unsigned *position;
};

has three fields: n is the cardinality of the set, members contains the members of the set, stored in positions 0 to n-1, and finally, position maps an element to its position in members. The Ullman set maintains the following invariant: for all 0 < = i < n,

  position[members[i]] = i.

In particular, if n == 0, then contents of the vectors are unconstrained. When the Ullman set is constructed, cardinality n is set to zero and the vectors are allocated but left uninitialized. This makes construction a constant time operation.

To check if m is a member of the set, we just need to verify that position[m] is in the correct range (that is, in [0, n)) and m is the actual value stored at that position[m]. Here is the code for membership test:

int contains (UllmanSet *s, unsigned m)
{
  return s->position[m] < s->n
   && s->members[s->position[m]] == m
}

From here, implementing the rest of the operations should be straightforward. (This would be a good point to stop reading and implement the rest of the Ullman set operations yourself.) For the lazy:

void add (UllmanSet *s, unsigned m)
{
  if (contains (s, m))
    return;
  s->position[m] = s->n;
  s->member[s->n] = m;
  s->n ++;
}
void del (UllmanSet *s, unsigned m)
{
  unsigned p = position[m];
  /* nothing to do if m is not in s */
  if (p >= s->n || s->members[p] != m)
    return;
  /* delete the last element */
  s->n --;
  /* if that wasn't m, move that element (aka x) into m's position */
  if (p != s->n)
    {
      unsigned x = s->member[s->n];
      s->position[x] = p;
      s->member[p] = x;
    }
}
void clear (UllmanSet *s)
{
  s->n = 0;
}
unsigned cardinality (UllmanSet *s)
{
  return s->n;
}
/* return the nth element of `s' */
unsigned nth (UllmanSet *s, unsigned p)
{
  assert (p < s->n);
  return s->member[p];
}

The idiom to iterate through a set is:

  UllmanSet *s;
  int i;
  ...
  for (i = 0; i < cardinality (s); i ++)
    ... nth (s, i) ...

This is all described (in more detail with performance numbers and applications) in this paper by Preston Briggs and Linda Torczon. Not surprisingly, they were working on compilers and used Ullman sets in the algorithm to construct the interference graph for a graph coloring register allocator.

By adding a function (pop) that removes and returns the most recently added element, the Ullman set becomes what I call a “set stack.” A set stack is a stack such that each element on the stack appears at most once. A set stack can be useful, for example, to maintain a work list in an iterative algorithm like dataflow analysis. Here is pop:

unsigned pop (UllmanSet *s)
{
  assert (s->n > 0);
  return s->members[-- s->n];
}

The same idea can be used to build a map from a dense integer range, which I call an Ullman map:

struct UllmanMap
{
  unsigned n;
  unsigned *position;
  unsigned *keys;
  T *values;
};

UllmanMap *create (unsigned N);
void contains (UllmanMap *m, unsigned k);
void add (UllmanMap *m, unsigned k, T v);
void del (UllmanMap *m, unsigned k);
T lookup (UllmanMap *m, unsigned k);
void clear (UllmanSet *m);
unsigned nth_key (UllmanSet *m, unsigned p);
T nth_value (UllmanSet *m, unsigned p);
unsigned cardinality (UllmanMap *m);

The Ullman map maintains the analogous invariant: for 0 < = i < n,

  position[keys[i]] = i.

Like the Ullman set, the keys in the map are stored in keys[i] for 0 < = i < n. The value associated to the key stored at keys[i] is stored at values[i].

Not only does the Ullman set (and map) support clear in constant time, but it supports setting the cardinality in constant time. When used in this way, I think of the Ullman set as supporting save and restore of the current state. cardinality returns the current state, and

void restore (UllmanSet *s, unsigned old_n)
{
  assert (old_n < = s->n);
  s->n = old_n;
}

sets the Ullman set back to a saved state. This is useful in algorithms that do backtracking. For example, I used UllmanMaps in a pattern matching engine that supported backtracking and variable unification. The variable bindings were stored in an UllmanMap and the backtracking operator simply saved and, in the case of failure, restored the binding map cardinalities.

So, what’s one of your favorite data structures?

Better banking

July 12th, 2007

Schooling

April 30th, 2007

Health

April 25th, 2007

Latex

April 24th, 2007

Will program for math lessons

April 23rd, 2007

Lucky You

April 3rd, 2007

Grobner bases IV: Grobner bases and Buchberger’s algorithm

March 8th, 2007

Don’t need no stinkin’ Y

February 28th, 2007

Proof F[x] is a PID

February 26th, 2007