Converting a stream of binary digits to a stream of base $n$ digits

2012-06-11

James Coglan asked on twitter:

https://twitter.com/jcoglan/status/212163174626115584

https://twitter.com/jcoglan/status/212163325579108353

So you have an infinite stream of uniform random binary digits, and want to use it to produce an infinite stream of uniform random base $n$ digits.

The obvious really easy way to do it is to find the smallest $k$ such that $2^{k} >= n$ , and generate numbers in the range $0 \dots 2^{k} - 1$ .

If the number you generate is less than $n$ , yield it, otherwise chuck it away.

This works, but has the problem that you might go a long time before you generate a number that you don’t throw away.

So what can we do with the numbers that are thrown away?

Subtract $n$ from them, and use them as a stream of infinite base $(2^{k} - n)$ digits.

You can then do the same trick of generating enough of those digits until you can generate numbers greater than or equal to $n$ . Numbers less than $n$ are yielded; otherwise, the trick is repeated yet again!

Example

Here’s an example, generating base $5$ digits:

The smallest power of $2$ greater than $5$ is $2^{3} = 8$ . So take digits from the binary stream three at a time.

There are three unusable numbers we can generate: $5, 6, 7$ . So subtract $5$ from those, giving $0, 1$ or $2$ , and treat them as digits from a stream of base $3$ digits.

The smallest power of $3$ greater than $5$ is $3^{2} = 9$ , so take digits from that stream two at a time. Numbers $5, 6, 7, 8$ from that range can’t be used, so subtract $5$ from them, giving $0, 1, 2$ or $3$ , and treat them as base $4$ digits.

The smallest power of $4$ greater than $5$ is $4^{2} = 16$ , so take digits from that stream two at a time. The offcasts from that stream should be considered as base $16 - 5 = 11$ digits. $11$ is bigger than $5$ already, so take digits from that stream one at a time.

The offcasts from the base $11$ stream look like base $6$ digits. There’s only one unusable number from that range — $5$ — which doesn’t give you any information, so stop being clever at that point and throw away those numbers.

Now, let’s use the binary stream $110101011111001$

My first three digits give $rand (0 \dots 2^{3}) = 110_{2} = 4 + 2 + 0 = 6.$ That’s not a base $5$ digit, so add $6 - 5 = 1$ to the stream of base $3$ digits.

My next three digits give $rand (0 \dots 2^{3}) = 101_{2} = 4 + 0 + 1 = 5.$ Again, that isn’t a base 5 digit, so add $5 - 5 = 0$ to the stream of base $3$ digits.

We now have two base 3 digits, $1$ and $0$ , which means it yields $rand (0 \dots 3^{2}) = 10_{3} = 3 + 0 = 3$ . That’s in the range we want, so yield it.

Now we need to take three more binary digits: $rand (0 \dots 2^{3}) = 011_{2} = 0 + 2 + 1 = 3.$ That’s in the range we want again, so yield it.

More binary: $rand (0 \dots 2^{3}) = 111_{2} = 4 + 2 + 1 = 7.$ Too big, so add $7 - 5 = 2$ to the base $3$ stream.

$rand (0 \dots 2^{3}) = 001_{2} = 1.$ That’s fine, so yield it.

At the moment, our stream of base $5$ digits is $[3, 3, 1]$ and we have a $2$ in the stream of base $3$ digits. We can keep generating numbers like this indefinitely, and we hardly ever throw information away. It would probably take a lot more steps to get to the point where we use the bigger accumulators.

Working code

I’ve written some Python code which implements the algorithm above to produce an infinite stream of digits of any base, from a stream of binary digits provided by Python’s random module.

[gist id=2911970 file=basestream.py]

The Python code has some debug lines in it – turn those on to get a narration of what it does.

I hope that answers your question, James!

I should mention that this method is based on the first bit of this page. I couldn’t think of the right words to use to find a proper crypto textbook on the subject, which must surely exist.

Comments

Colin Beveridge

2012-06-12 11:56:38

I think this is information theory, and you might want to have a look at the kind of algorithm used for zipping files for clues.

Assuming you want letters to be equally likely, if you have an alphabet of k letters, and generate n random bits, you should (in principle) be able to encode a message of length $\frac{n}{\log_{2} (k)}$ (i.e., with 32 letters and 10 bits, you could encode two letters).

Quite how you do it, I’m not sure – I suspect you’d need a splitting tree with rules to use if you get to a non-letter node.

Christian Perfect

2012-06-12 12:13:10

That’s a good point. I did some tests to look at $| digits produced | \div \frac{n}{\log_{2} (k)}$ , on an input of 10,000 binary digits.

base 2: 1.000000 base 3: 0.597055 base 4: 1.000000 base 5: 0.589305 base 6: 0.709314 base 7: 0.818063 base 8: 0.999900 base 9: 0.607041 base 10: 0.659071 base 11: 0.717832 base 12: 0.767182 base 13: 0.775612 base 14: 0.863127 base 15: 0.915775 base 16: 1.000000 base 17: 0.622521 base 18: 0.643002 base 19: 0.662677

The worst-performing streams are bases 3 and 5, generating lower than 60% of the upper bound of digits.

Colin Beveridge

2012-06-12 13:30:59

An on-the-fly thought: can you consider the binary stream as a binary ‘decimal’ and simply convert it into a ternary (or base n) ‘decimal’?

For example: if you have 0.101110, you can place limits on it digit by digit:

0.1 means the number is between 1/2 and 1
0.10 … 2/4 to 3/4
0.101 … 5/8 to 6/8
0.1011 … 11/16 to 12/16 (in particular, between 2/3 and 1 so first ternary place is 2; between 6/9 and 7/9 so second ternary place is 0)
0.10111 … 23/32 to 24/32
0.101110 … 46/64 to 47/64 (between 19/27 and 20/27, so third ternary place is 2).

And so on – my hunch is that that’s as efficient as it gets.

Colin Beveridge

2012-06-12 20:53:46

Right: correction, that third ternary place is 1.

I put some code together to do this, and it gets pretty close to the theoretical maximum (at least for bases 3, 5 and 26).

Apologies for the length of it – let me know if it’s better in another format.

from random import *
from math import *

def hcf(no1,no2):

while no1!=no2:
if no1>no2:
no1-=no2
elif no2>no1:
no2-=no1
return no1

class Frac:
def __init__(self, top, bot):
self.top = top
self.bot = bot
self.cancel()

def cancel(self):
if self.top == 0:
return self
h = hcf(self.top, self.bot)
self.top /= h
self.bot /= h
return self

def average(self, other):
return Frac(1,2) * (self + other)

def __add__(self, other):
return Frac( self.top * other.bot + self.bot * other.top, self.bot * other.bot )

def __sub__(self, other):
return Frac( self.top * other.bot – self.bot * other.top, self.bot * other.bot )

def __mul__(self, other):
return Frac( self.top * other.top, self.bot * other.bot )

def __cmp__(self, other):
return cmp(self.top * other.bot, other.top * self.bot)

def __repr__(self):
return “(%d / %d)” % (self.top, self.bot)

def to_base(binstring, base = 26):
DIGITS = “0123456789abcdefghijklmnopqrstuvwxyz”
upper = Frac(1,1)
lower = Frac(0,1)
inp = “”
out = “”
count = 0
while binstring:
x = binstring[0]
inp += x
binstring = binstring[1:]

ave = upper.average(lower)
if x == “1”:
lower = ave
else:
upper = ave

go = True
while go:
go = False

for i in range(base):
f0 = Frac(i, base)
f1 = Frac(i+1, base)
#print f0, lower, upper, f1
if f0 < lower and upper 0.5):
bs += “1”
else:
bs += “0”

to_base(bs, 5)

Jan Van lent

2012-06-13 00:12:22

Coding theory is indeed relevant here.
The first method is closely related to Huffman coding.
You pad the initial alphabet with symbols that will not get emitted and then construct the binary Huffman tree.
A standard method to increase the efficiency is to encode pairs, triples, etc.
For this application you can also encode a list with each symbol repeated twice, three times, etc.

The method interpreting the binary stream as a number in [0, 1) is essentially arithmetic coding. This can be implemented using only integers (range coding).

Alex Fink

2014-07-04 17:44:17

[Whoops, just noticed I’m two years late, not three weeks…]

The method from your original posted can be refined to wring all the entropy out in a less drastic way than Colin’s arithmetic coding; this refinement is also what your “random bit unbiasing” link does. Namely, the portion of information you’re discarding is whether each attempt of a given accumulator to emit a digit succeeds or whether it passes through to the next one.

In your example, converting a binary string 110.101.011.111.001 to base 5, you should be not just yielding [3,1] directly and passing [1,0,2] to a base 3 accumulator, but also passing [1,1,0,1,0] to a new biased accumulator of binary digits which are 1 with probability 3/8.