James Coglan asked on twitter:

Suggestions wanted: how to turn an indefinite stream of bits (01110010…) into a stream of [A-Z] where letters are evenly distributed.

— James Coglan (@jcoglan) June 11, 2012

And I don’t mean just select character codes, I mean select from an arbitrary-sized character set of any length.

— James Coglan (@jcoglan) June 11, 2012

So you have an infinite stream of uniform random binary digits, and want to use it to produce an infinite stream of uniform random base $n$ digits.

The obvious really easy way to do it is to find the smallest $k$ such that $2^k>=n$, and generate numbers in the range $0 \dots 2^k-1$.

If the number you generate is less than $n$, yield it, otherwise chuck it away.

This works, but has the problem that you might go a long time before you generate a number that you don’t throw away.

So what can we do with the numbers that are thrown away?

Subtract $n$ from them, and use them as a stream of infinite base $(2^k-n)$ digits.

You can then do the same trick of generating enough of those digits until you can generate numbers greater than or equal to $n$. Numbers less than $n$ are yielded; otherwise, the trick is repeated yet again!

### Example

Here’s an example, generating base $5$ digits:

The smallest power of $2$ greater than $5$ is $2^3 = 8$. So take digits from the binary stream three at a time.

There are three unusable numbers we can generate: $5,6,7$. So subtract $5$ from those, giving $0,1$ or $2$, and treat them as digits from a stream of base $3$ digits.

The smallest power of $3$ greater than $5$ is $3^2 = 9$, so take digits from that stream two at a time. Numbers $5,6,7,8$ from that range can’t be used, so subtract $5$ from them, giving $0,1,2$ or $3$, and treat them as base $4$ digits.

The smallest power of $4$ greater than $5$ is $4^2 = 16$, so take digits from that stream two at a time. The offcasts from that stream should be considered as base $16-5 = 11$ digits. $11$ is bigger than $5$ already, so take digits from that stream one at a time.

The offcasts from the base $11$ stream look like base $6$ digits. There’s only one unusable number from that range — $5$ — which doesn’t give you any information, so stop being clever at that point and throw away those numbers.

Now, let’s use the binary stream $110101011111001$

My first three digits give $\operatorname{rand}(0\dots2^3) = 110_2 = 4+2+0 = 6.$ That’s not a base $5$ digit, so add $6-5 = 1$ to the stream of base $3$ digits.

My next three digits give $\operatorname{rand}(0\dots2^3) = 101_2 = 4+0+1 = 5.$ Again, that isn’t a base 5 digit, so add $5-5 = 0$ to the stream of base $3$ digits.

We now have two base 3 digits, $1$ and $0$, which means it yields $\operatorname{rand}(0\dots3^2) = 10_3 = 3+0 = 3$. That’s in the range we want, so yield it.

Now we need to take three more binary digits: $\operatorname{rand}(0\dots2^3) = 011_2 = 0+2+1 = 3.$ That’s in the range we want again, so yield it.

More binary: $\operatorname{rand}(0\dots2^3) = 111_2 = 4+2+1 = 7.$ Too big, so add $7-5 = 2$ to the base $3$ stream.

$\operatorname{rand}(0\dots2^3) = 001_2 = 1.$ That’s fine, so yield it.

At the moment, our stream of base $5$ digits is $[3,3,1]$ and we have a $2$ in the stream of base $3$ digits. We can keep generating numbers like this indefinitely, and we hardly ever throw information away. It would probably take a lot more steps to get to the point where we use the bigger accumulators.

### Working code

I’ve written some Python code which implements the algorithm above to produce an infinite stream of digits of any base, from a stream of binary digits provided by Python’s random module.

The Python code has some debug lines in it – turn those on to get a narration of what it does.

I hope that answers your question, James!

I should mention that this method is based on the first bit of this page. I couldn’t think of the right words to use to find a proper crypto textbook on the subject, which must surely exist.

I think this is information theory, and you might want to have a look at the kind of algorithm used for zipping files for clues.

Assuming you want letters to be equally likely, if you have an alphabet of k letters, and generate n random bits, you should (in principle) be able to encode a message of length $\frac{n}{\log_2(k)}$ (i.e., with 32 letters and 10 bits, you could encode two letters).

Quite how you do it, I’m not sure – I suspect you’d need a splitting tree with rules to use if you get to a non-letter node.

June 12, 2012 @ 11:56 am

|That’s a good point. I did some tests to look at $|\textrm{digits produced}| \div \frac{n}{\log_2(k)}$, on an input of 10,000 binary digits.

base 2: 1.000000

base 3: 0.597055

base 4: 1.000000

base 5: 0.589305

base 6: 0.709314

base 7: 0.818063

base 8: 0.999900

base 9: 0.607041

base 10: 0.659071

base 11: 0.717832

base 12: 0.767182

base 13: 0.775612

base 14: 0.863127

base 15: 0.915775

base 16: 1.000000

base 17: 0.622521

base 18: 0.643002

base 19: 0.662677

The worst-performing streams are bases 3 and 5, generating lower than 60% of the upper bound of digits.

June 12, 2012 @ 12:13 pm

|An on-the-fly thought: can you consider the binary stream as a binary ‘decimal’ and simply convert it into a ternary (or base n) ‘decimal’?

For example: if you have 0.101110, you can place limits on it digit by digit:

0.1 means the number is between 1/2 and 1

0.10 … 2/4 to 3/4

0.101 … 5/8 to 6/8

0.1011 … 11/16 to 12/16 (in particular, between 2/3 and 1 so first ternary place is 2; between 6/9 and 7/9 so second ternary place is 0)

0.10111 … 23/32 to 24/32

0.101110 … 46/64 to 47/64 (between 19/27 and 20/27, so third ternary place is 2).

And so on – my hunch is that that’s as efficient as it gets.

June 12, 2012 @ 1:30 pm

|Right: correction, that third ternary place is 1.

I put some code together to do this, and it gets pretty close to the theoretical maximum (at least for bases 3, 5 and 26).

Apologies for the length of it – let me know if it’s better in another format.

from random import *

from math import *

def hcf(no1,no2):

while no1!=no2:

if no1>no2:

no1-=no2

elif no2>no1:

no2-=no1

return no1

class Frac:

def __init__(self, top, bot):

self.top = top

self.bot = bot

self.cancel()

def cancel(self):

if self.top == 0:

return self

h = hcf(self.top, self.bot)

self.top /= h

self.bot /= h

return self

def average(self, other):

return Frac(1,2) * (self + other)

def __add__(self, other):

return Frac( self.top * other.bot + self.bot * other.top, self.bot * other.bot )

def __sub__(self, other):

return Frac( self.top * other.bot – self.bot * other.top, self.bot * other.bot )

def __mul__(self, other):

return Frac( self.top * other.top, self.bot * other.bot )

def __cmp__(self, other):

return cmp(self.top * other.bot, other.top * self.bot)

def __repr__(self):

return “(%d / %d)” % (self.top, self.bot)

def to_base(binstring, base = 26):

DIGITS = “0123456789abcdefghijklmnopqrstuvwxyz”

upper = Frac(1,1)

lower = Frac(0,1)

inp = “”

out = “”

count = 0

while binstring:

x = binstring[0]

inp += x

binstring = binstring[1:]

ave = upper.average(lower)

if x == “1″:

lower = ave

else:

upper = ave

go = True

while go:

go = False

for i in range(base):

f0 = Frac(i, base)

f1 = Frac(i+1, base)

#print f0, lower, upper, f1

if f0 < lower and upper 0.5):

bs += “1″

else:

bs += “0″

to_base(bs, 5)

June 12, 2012 @ 8:53 pm

|Coding theory is indeed relevant here.

The first method is closely related to Huffman coding.

You pad the initial alphabet with symbols that will not get emitted and then construct the binary Huffman tree.

A standard method to increase the efficiency is to encode pairs, triples, etc.

For this application you can also encode a list with each symbol repeated twice, three times, etc.

The method interpreting the binary stream as a number in [0, 1) is essentially arithmetic coding. This can be implemented using only integers (range coding).

June 13, 2012 @ 12:12 am