How I worked out what character set web Excel uses for its CSV export
I'm collecting registrations for a work conference through Microsoft Forms, because that's the tool my employer gives me.
I want to get the data out of Excel as soon as possible, so I clicked the button to export it as a .CSV file.
I loaded that in Python, and got an error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 4071: invalid continuation byte
That told me that 4071 characters into the file, it found a byte with hexadecimal representation E5
which couldn't be there if the file was encoded with Unicode.
Looking in Excel, that byte was supposed to represent the character å
in somebody's name.
After a bit of googling, I discovered that Excel's CSV export doesn't use Unicode. Apparently the desktop version has a UTF-8 option, but I'm using the web version on Linux and I couldn't see anything other than a single "Export to CSV" option.
So, I set about trying to work out what character set Excel uses.
Python 3 has a nice system for dealing with different character sets: you can read in a file as a string of bytes, and then use the decode
method to try to decode it to a Unicode string, assuming it uses a character set that you give.
First, I tried cp1258
, picking at random from a list of code pages.
That decoded the å
, but then I got another error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9a in position 6270: character maps to <undefined>
The byte 9A
was supposed to represent the character š
.
With a list of about 200 registrations to work through, with names from all over the world, I thought that I'd better speed things up with some code.
I don't happen to know the names of Windows code pages off by heart but at least I know they exist, so I took the easy option and found a list of encodings understood by Python and narrowed it down iteratively with this Python script:
2025/01/excel-charset.py (Source)
# The list of charsets understood by Python 3.13, omitted for brevity here.
charsets = ['ascii', 'big5', 'big5hkscs', 'cp037', ...]
# Read the CSV file as bytes
with open('registration.csv','rb') as f:
d = f.read()
# Try to read character ``i`` from the file using the given charset.
def try_read(charset,i):
s = d[i:i+1]
try:
return s.decode(charset)
except UnicodeDecodeError:
return None
# Positions of characters that have caused problems, and what they should decode to.
strings = [(4071,'å'), (6270,'š')]
# Work through each of the problems and narrow down the list of charsets.
sets = charsets[:]
for i,str in strings:
sets = [s for s in sets if try_read(s,i)==str]
print(i,str,sets)
There were two options that could decode both problem characters properly: cp1252
and cp1254
.
I picked the second one, and it turned out that there were no other problems (or at least, no more bytes that aren't used in that charset).
Now that I'm writing this up, I thought I'd look up what those code pages are.
cp1252
, or Windows-1252, is apparently the code page used in the Americas, Western Europe, and Oceania.
cp1254
is a slight variant used for Turkish.
So it's likely that Excel uses Windows-1252, for me at least.