Converting a stream of binary digits to a stream of base digits
James Coglan asked on twitter:
https://twitter.com/jcoglan/status/212163174626115584
https://twitter.com/jcoglan/status/212163325579108353
So you have an infinite stream of uniform random binary digits, and want to use it to produce an infinite stream of uniform random base
The obvious really easy way to do it is to find the smallest
If the number you generate is less than
This works, but has the problem that you might go a long time before you generate a number that you don’t throw away.
So what can we do with the numbers that are thrown away?
Subtract
You can then do the same trick of generating enough of those digits until you can generate numbers greater than or equal to
Example
Here’s an example, generating base
The smallest power of
There are three unusable numbers we can generate:
The smallest power of
The smallest power of
The offcasts from the base
Now, let’s use the binary stream
My first three digits give
My next three digits give
We now have two base 3 digits,
Now we need to take three more binary digits:
More binary:
At the moment, our stream of base
Working code
I’ve written some Python code which implements the algorithm above to produce an infinite stream of digits of any base, from a stream of binary digits provided by Python’s random module.
[gist id=2911970 file=basestream.py]The Python code has some debug lines in it – turn those on to get a narration of what it does.
I hope that answers your question, James!
I should mention that this method is based on the first bit of this page. I couldn’t think of the right words to use to find a proper crypto textbook on the subject, which must surely exist.
Comments
Comments
Colin Beveridge
I think this is information theory, and you might want to have a look at the kind of algorithm used for zipping files for clues.
Assuming you want letters to be equally likely, if you have an alphabet of k letters, and generate n random bits, you should (in principle) be able to encode a message of length (i.e., with 32 letters and 10 bits, you could encode two letters).
Quite how you do it, I’m not sure – I suspect you’d need a splitting tree with rules to use if you get to a non-letter node.
Christian Perfect
That’s a good point. I did some tests to look at , on an input of 10,000 binary digits.
base 2: 1.000000
base 3: 0.597055
base 4: 1.000000
base 5: 0.589305
base 6: 0.709314
base 7: 0.818063
base 8: 0.999900
base 9: 0.607041
base 10: 0.659071
base 11: 0.717832
base 12: 0.767182
base 13: 0.775612
base 14: 0.863127
base 15: 0.915775
base 16: 1.000000
base 17: 0.622521
base 18: 0.643002
base 19: 0.662677
The worst-performing streams are bases 3 and 5, generating lower than 60% of the upper bound of digits.
Colin Beveridge
An on-the-fly thought: can you consider the binary stream as a binary ‘decimal’ and simply convert it into a ternary (or base n) ‘decimal’?
For example: if you have 0.101110, you can place limits on it digit by digit:
0.1 means the number is between 1/2 and 1
0.10 … 2/4 to 3/4
0.101 … 5/8 to 6/8
0.1011 … 11/16 to 12/16 (in particular, between 2/3 and 1 so first ternary place is 2; between 6/9 and 7/9 so second ternary place is 0)
0.10111 … 23/32 to 24/32
0.101110 … 46/64 to 47/64 (between 19/27 and 20/27, so third ternary place is 2).
And so on – my hunch is that that’s as efficient as it gets.
Colin Beveridge
Right: correction, that third ternary place is 1.
I put some code together to do this, and it gets pretty close to the theoretical maximum (at least for bases 3, 5 and 26).
Apologies for the length of it – let me know if it’s better in another format.
from random import *
from math import *
def hcf(no1,no2):
while no1!=no2:
if no1>no2:
no1-=no2
elif no2>no1:
no2-=no1
return no1
class Frac:
def __init__(self, top, bot):
self.top = top
self.bot = bot
self.cancel()
def cancel(self):
if self.top == 0:
return self
h = hcf(self.top, self.bot)
self.top /= h
self.bot /= h
return self
def average(self, other):
return Frac(1,2) * (self + other)
def __add__(self, other):
return Frac( self.top * other.bot + self.bot * other.top, self.bot * other.bot )
def __sub__(self, other):
return Frac( self.top * other.bot – self.bot * other.top, self.bot * other.bot )
def __mul__(self, other):
return Frac( self.top * other.top, self.bot * other.bot )
def __cmp__(self, other):
return cmp(self.top * other.bot, other.top * self.bot)
def __repr__(self):
return “(%d / %d)” % (self.top, self.bot)
def to_base(binstring, base = 26):
DIGITS = “0123456789abcdefghijklmnopqrstuvwxyz”
upper = Frac(1,1)
lower = Frac(0,1)
inp = “”
out = “”
count = 0
while binstring:
x = binstring[0]
inp += x
binstring = binstring[1:]
ave = upper.average(lower)
if x == “1”:
lower = ave
else:
upper = ave
go = True
while go:
go = False
for i in range(base):
f0 = Frac(i, base)
f1 = Frac(i+1, base)
#print f0, lower, upper, f1
if f0 < lower and upper 0.5):
bs += “1”
else:
bs += “0”
to_base(bs, 5)
Jan Van lent
Coding theory is indeed relevant here.
The first method is closely related to Huffman coding.
You pad the initial alphabet with symbols that will not get emitted and then construct the binary Huffman tree.
A standard method to increase the efficiency is to encode pairs, triples, etc.
For this application you can also encode a list with each symbol repeated twice, three times, etc.
The method interpreting the binary stream as a number in [0, 1) is essentially arithmetic coding. This can be implemented using only integers (range coding).
Alex Fink
[Whoops, just noticed I’m two years late, not three weeks…]
The method from your original posted can be refined to wring all the entropy out in a less drastic way than Colin’s arithmetic coding; this refinement is also what your “random bit unbiasing” link does. Namely, the portion of information you’re discarding is whether each attempt of a given accumulator to emit a digit succeeds or whether it passes through to the next one.
In your example, converting a binary string 110.101.011.111.001 to base 5, you should be not just yielding [3,1] directly and passing [1,0,2] to a base 3 accumulator, but also passing [1,1,0,1,0] to a new biased accumulator of binary digits which are 1 with probability 3/8.