James Coglan asked on twitter:
Suggestions wanted: how to turn an indefinite stream of bits (01110010…) into a stream of [AZ] where letters are evenly distributed.
— James Coglan (@jcoglan) June 11, 2012
And I don’t mean just select character codes, I mean select from an arbitrarysized character set of any length.
— James Coglan (@jcoglan) June 11, 2012
So you have an infinite stream of uniform random binary digits, and want to use it to produce an infinite stream of uniform random base $n$ digits.
The obvious really easy way to do it is to find the smallest $k$ such that $2^k>=n$, and generate numbers in the range $0 \dots 2^k1$.
If the number you generate is less than $n$, yield it, otherwise chuck it away.
This works, but has the problem that you might go a long time before you generate a number that you don’t throw away.
So what can we do with the numbers that are thrown away?
Subtract $n$ from them, and use them as a stream of infinite base $(2^kn)$ digits.
You can then do the same trick of generating enough of those digits until you can generate numbers greater than or equal to $n$. Numbers less than $n$ are yielded; otherwise, the trick is repeated yet again!
Example
Here’s an example, generating base $5$ digits:
The smallest power of $2$ greater than $5$ is $2^3 = 8$. So take digits from the binary stream three at a time.
There are three unusable numbers we can generate: $5,6,7$. So subtract $5$ from those, giving $0,1$ or $2$, and treat them as digits from a stream of base $3$ digits.
The smallest power of $3$ greater than $5$ is $3^2 = 9$, so take digits from that stream two at a time. Numbers $5,6,7,8$ from that range can’t be used, so subtract $5$ from them, giving $0,1,2$ or $3$, and treat them as base $4$ digits.
The smallest power of $4$ greater than $5$ is $4^2 = 16$, so take digits from that stream two at a time. The offcasts from that stream should be considered as base $165 = 11$ digits. $11$ is bigger than $5$ already, so take digits from that stream one at a time.
The offcasts from the base $11$ stream look like base $6$ digits. There’s only one unusable number from that range — $5$ — which doesn’t give you any information, so stop being clever at that point and throw away those numbers.
Now, let’s use the binary stream $110101011111001$
My first three digits give $\operatorname{rand}(0\dots2^3) = 110_2 = 4+2+0 = 6.$ That’s not a base $5$ digit, so add $65 = 1$ to the stream of base $3$ digits.
My next three digits give $\operatorname{rand}(0\dots2^3) = 101_2 = 4+0+1 = 5.$ Again, that isn’t a base 5 digit, so add $55 = 0$ to the stream of base $3$ digits.
We now have two base 3 digits, $1$ and $0$, which means it yields $\operatorname{rand}(0\dots3^2) = 10_3 = 3+0 = 3$. That’s in the range we want, so yield it.
Now we need to take three more binary digits: $\operatorname{rand}(0\dots2^3) = 011_2 = 0+2+1 = 3.$ That’s in the range we want again, so yield it.
More binary: $\operatorname{rand}(0\dots2^3) = 111_2 = 4+2+1 = 7.$ Too big, so add $75 = 2$ to the base $3$ stream.
$\operatorname{rand}(0\dots2^3) = 001_2 = 1.$ That’s fine, so yield it.
At the moment, our stream of base $5$ digits is $[3,3,1]$ and we have a $2$ in the stream of base $3$ digits. We can keep generating numbers like this indefinitely, and we hardly ever throw information away. It would probably take a lot more steps to get to the point where we use the bigger accumulators.
Working code
I’ve written some Python code which implements the algorithm above to produce an infinite stream of digits of any base, from a stream of binary digits provided by Python’s random module.
[gist id=2911970 file=basestream.py]The Python code has some debug lines in it – turn those on to get a narration of what it does.
I hope that answers your question, James!
I should mention that this method is based on the first bit of this page. I couldn’t think of the right words to use to find a proper crypto textbook on the subject, which must surely exist.
I think this is information theory, and you might want to have a look at the kind of algorithm used for zipping files for clues.
Assuming you want letters to be equally likely, if you have an alphabet of k letters, and generate n random bits, you should (in principle) be able to encode a message of length $\frac{n}{\log_2(k)}$ (i.e., with 32 letters and 10 bits, you could encode two letters).
Quite how you do it, I’m not sure – I suspect you’d need a splitting tree with rules to use if you get to a nonletter node.
June 12, 2012 @ 11:56 am
That’s a good point. I did some tests to look at $\textrm{digits produced} \div \frac{n}{\log_2(k)}$, on an input of 10,000 binary digits.
base 2: 1.000000
base 3: 0.597055
base 4: 1.000000
base 5: 0.589305
base 6: 0.709314
base 7: 0.818063
base 8: 0.999900
base 9: 0.607041
base 10: 0.659071
base 11: 0.717832
base 12: 0.767182
base 13: 0.775612
base 14: 0.863127
base 15: 0.915775
base 16: 1.000000
base 17: 0.622521
base 18: 0.643002
base 19: 0.662677
The worstperforming streams are bases 3 and 5, generating lower than 60% of the upper bound of digits.
June 12, 2012 @ 12:13 pm
An onthefly thought: can you consider the binary stream as a binary ‘decimal’ and simply convert it into a ternary (or base n) ‘decimal’?
For example: if you have 0.101110, you can place limits on it digit by digit:
0.1 means the number is between 1/2 and 1
0.10 … 2/4 to 3/4
0.101 … 5/8 to 6/8
0.1011 … 11/16 to 12/16 (in particular, between 2/3 and 1 so first ternary place is 2; between 6/9 and 7/9 so second ternary place is 0)
0.10111 … 23/32 to 24/32
0.101110 … 46/64 to 47/64 (between 19/27 and 20/27, so third ternary place is 2).
And so on – my hunch is that that’s as efficient as it gets.
June 12, 2012 @ 1:30 pm
Right: correction, that third ternary place is 1.
I put some code together to do this, and it gets pretty close to the theoretical maximum (at least for bases 3, 5 and 26).
Apologies for the length of it – let me know if it’s better in another format.
from random import *
from math import *
def hcf(no1,no2):
while no1!=no2:
if no1>no2:
no1=no2
elif no2>no1:
no2=no1
return no1
class Frac:
def __init__(self, top, bot):
self.top = top
self.bot = bot
self.cancel()
def cancel(self):
if self.top == 0:
return self
h = hcf(self.top, self.bot)
self.top /= h
self.bot /= h
return self
def average(self, other):
return Frac(1,2) * (self + other)
def __add__(self, other):
return Frac( self.top * other.bot + self.bot * other.top, self.bot * other.bot )
def __sub__(self, other):
return Frac( self.top * other.bot – self.bot * other.top, self.bot * other.bot )
def __mul__(self, other):
return Frac( self.top * other.top, self.bot * other.bot )
def __cmp__(self, other):
return cmp(self.top * other.bot, other.top * self.bot)
def __repr__(self):
return “(%d / %d)” % (self.top, self.bot)
def to_base(binstring, base = 26):
DIGITS = “0123456789abcdefghijklmnopqrstuvwxyz”
upper = Frac(1,1)
lower = Frac(0,1)
inp = “”
out = “”
count = 0
while binstring:
x = binstring[0]
inp += x
binstring = binstring[1:]
ave = upper.average(lower)
if x == “1”:
lower = ave
else:
upper = ave
go = True
while go:
go = False
for i in range(base):
f0 = Frac(i, base)
f1 = Frac(i+1, base)
#print f0, lower, upper, f1
if f0 < lower and upper 0.5):
bs += “1”
else:
bs += “0”
to_base(bs, 5)
June 12, 2012 @ 8:53 pm
Coding theory is indeed relevant here.
The first method is closely related to Huffman coding.
You pad the initial alphabet with symbols that will not get emitted and then construct the binary Huffman tree.
A standard method to increase the efficiency is to encode pairs, triples, etc.
For this application you can also encode a list with each symbol repeated twice, three times, etc.
The method interpreting the binary stream as a number in [0, 1) is essentially arithmetic coding. This can be implemented using only integers (range coding).
June 13, 2012 @ 12:12 am
[Whoops, just noticed I’m two years late, not three weeks…]
The method from your original posted can be refined to wring all the entropy out in a less drastic way than Colin’s arithmetic coding; this refinement is also what your “random bit unbiasing” link does. Namely, the portion of information you’re discarding is whether each attempt of a given accumulator to emit a digit succeeds or whether it passes through to the next one.
In your example, converting a binary string 110.101.011.111.001 to base 5, you should be not just yielding [3,1] directly and passing [1,0,2] to a base 3 accumulator, but also passing [1,1,0,1,0] to a new biased accumulator of binary digits which are 1 with probability 3/8.
July 4, 2014 @ 5:44 pm
