Arithmetic Coding With Rust

2024-07-31

Arithmetic coding is an entropy coding technique. You can use it to compress text and binaries alike.

There's some interesting projects I was thinking of working on, but most of them seem to depend on some form of entropy coding, so I thought "why not implement it myself?" And I did.

The idea of entropy coding is to compress more common symbols with fewer bits and less common symbols with a few more bits, approaching the "entropy" of the entire message. Usually what's referred to is a kind of average entropy, but there's other ways of thinking about it, as well. This always has to do with probability modeling or prediction. Arithmetic coding uses these modeled probabilities to then encode some symbols. However, it is not the only technique for this, and there's others, such as Huffman coding and ANS. I may look at those later.

Arithmetic coding is interesting in that the implementation requires a little something extra. The basic principle is to code for conditional probabilities (fed by the model) into one long number within the range [0,1). The principle is sound but the limited accuracy of floating point becomes a problem for anything but the shortest of messages. That's why practical implementations are based around integer arithmetic, to provide for just enough accuracy.

For my implementation, I used the very well known CACM87 article on arithmetic coding, which details a similar implementation in C. It is in fact the only kind of implementation I've seen, although once you've seen it, you could probably figure out other ways of doing it, as well. My implementation is otherwise the same except I replaced the probability modeling data structures used in the article with a Fenwick tree. You can read about Fenwick trees from Mr. Fenwick himself, here. Long story short, the Fenwick tree is a very simple, elegant way to store cumulative frequencies. Even though it provides logarithmic access and updates, I'm not sure it provides that much of a speed advantage here because the amount of different symbols is usually not very large, so I believe most data structures you could use fit in the cache anyway. You'd have to measure it and see. I used it just because of how simple it is to implement.

While I don't intend to exactly teach how arithmetic coding works (the articles should explain this very well in detail), here's some snippets. You can usually recognize CACM87-style implementations from this bit:

let lower = self.model.sum(s-1);
let upper = self.model.sum(s);
let denom = self.model.total();

let range = (self.high - self.low) + 1;
self.high = self.low + (range * upper) / denom - 1;
self.low  = self.low + (range * lower) / denom;
loop {
    if self.high < HALF {
        self.write_bit_plus_pending(false, bits_out)?;
    } else if self.low >= HALF {
        self.write_bit_plus_pending(true, bits_out)?;
    } else if (self.low >= FIRST_QTR) && (self.high < THIRD_QTR) {
        self.pending += 1;
        self.low -= FIRST_QTR;
        self.high -= FIRST_QTR;
    } else {
        break;
    }
    self.low = (self.low << 1) & TOP;
    self.high = (self.high << 1) & TOP | 1;
}

It's where the encoding of a symbol takes place. The decoder mirrors the encoder:

let range = (self.high - self.low) + 1;
let denom = self.model.total();
let cum = (((self.value - self.low) + 1) * denom - 1) / range;
let s = self.model.upper(cum);
let upper = self.model.sum(s);
let lower = self.model.sum(s-1);

self.high = self.low + (range * upper) / denom - 1;
self.low  = self.low + (range * lower) / denom;

loop {
    if self.high < HALF {
        // Do nothing
    } else if self.low >= HALF {
        self.value -= HALF;
        self.low -= HALF;
        self.high -= HALF;
    } else if (self.low >= FIRST_QTR) && (self.high < THIRD_QTR) {
        self.value -= FIRST_QTR;
        self.low -= FIRST_QTR;
        self.high -= FIRST_QTR;
    } else {
        break;
    }
    self.low <<= 1;
    self.high = (self.high << 1) | 1;

    if let Some(b) = bits_in.next_bit()? {
        self.value = (self.value << 1) | b as Value;
    } else {
        // input is "garbage" (zeroes)
        self.value <<= 1;
    }
}

The implementation is adaptive, in that it updates its frequencies as it encounters more symbols. There's a simple driver code bit that you can easily separate from the library it's using. The whole thing you can download from here. The work is licensed under the MIT license.

More efficient implementations would use different kinds of hacks to dispense with the slow integer division and replace it with some bitwise operations and possibly a lookup table, with some risk of losing accuracy regarding the probability model. In my experience, they're also somewhat obscure. Additionally, there's also better models you could use, or possibly either a binary or a range coder. Although range coders are more efficient, I'm not too interested in those just yet. I may look at the binary arithmetic coder later, if I find a cool application for it (it's called binary simply because there are only two symbols from which to choose). For now, the regular version of the arithmetic family of coders is enough.

Update: the code is now available on crates.io, here.