Skip to main content

How I worked out what character set web Excel uses for its CSV export

I'm collecting registrations for a work conference through Microsoft Forms, because that's the tool my employer gives me.

I want to get the data out of Excel as soon as possible, so I clicked the button to export it as a .CSV file.

I loaded that in Python, and got an error message:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 4071: invalid continuation byte

That told me that 4071 characters into the file, it found a byte with hexadecimal representation E5 which couldn't be there if the file was encoded with Unicode.

Looking in Excel, that byte was supposed to represent the character å in somebody's name.

After a bit of googling, I discovered that Excel's CSV export doesn't use Unicode. Apparently the desktop version has a UTF-8 option, but I'm using the web version on Linux and I couldn't see anything other than a single "Export to CSV" option.

So, I set about trying to work out what character set Excel uses.

Python 3 has a nice system for dealing with different character sets: you can read in a file as a string of bytes, and then use the decode method to try to decode it to a Unicode string, assuming it uses a character set that you give.

First, I tried cp1258, picking at random from a list of code pages. That decoded the å, but then I got another error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9a in position 6270: character maps to <undefined>

The byte 9A was supposed to represent the character š. With a list of about 200 registrations to work through, with names from all over the world, I thought that I'd better speed things up with some code.

I don't happen to know the names of Windows code pages off by heart but at least I know they exist, so I took the easy option and found a list of encodings understood by Python and narrowed it down iteratively with this Python script:

2025/01/excel-charset.py (Source)

# The list of charsets understood by Python 3.13, omitted for brevity here.
charsets = ['ascii', 'big5', 'big5hkscs', 'cp037', ...]

# Read the CSV file as bytes
with open('registration.csv','rb') as f:
    d = f.read()

# Try to read character ``i`` from the file using the given charset.
def try_read(charset,i):
    s = d[i:i+1]
    try:
        return s.decode(charset)
    except UnicodeDecodeError:
        return None

# Positions of characters that have caused problems, and what they should decode to.
strings = [(4071,'å'), (6270,'š')]

# Work through each of the problems and narrow down the list of charsets.
sets = charsets[:]
for i,str in strings:
    sets = [s for s in sets if try_read(s,i)==str]
    print(i,str,sets)

There were two options that could decode both problem characters properly: cp1252 and cp1254. I picked the second one, and it turned out that there were no other problems (or at least, no more bytes that aren't used in that charset).

Now that I'm writing this up, I thought I'd look up what those code pages are. cp1252, or Windows-1252, is apparently the code page used in the Americas, Western Europe, and Oceania. cp1254 is a slight variant used for Turkish.

So it's likely that Excel uses Windows-1252, for me at least.

A script to convert a WCAG-EM report from JSON to reStructuredText for a VPAT

At work, I've had to produce VPAT documents for the software I'm responsible for.

The VPAT template asks you to list, for each of the WCAG criteria, whether you support it or not, or if it doesn't apply.

The W3C have made a WCAG-EM Report Tool which helps you to work through the WCAG criteria and make notes about whether they're satisfied.

At the end, you can download a copy of the report in either summarised HTML format, or a JSON file with all the data you entered.

The first time I did a VPAT, I mostly manually converted the information from the WCAG-EM report to a reStructuredText table, to go in our Sphinx documentation.

Now I'm doing it a second time, I know I don't want to waste my time doing that!

So I've written a Python script which takes in the JSON file from the report tool and prints out the tables for the VPAT template, in reStructuredText format.

Read more…

My development PC's homepage lists all the servers I'm running

A lot of what I do on a computer involves web development. For work, I develop several projects which involve running a Django server, which I have to run locally while I'm working on them. And for client-side stuff, I always run a simple HTTP server to serve static files, because browsers apply a lot of security restrictions to pages loaded through file://.

For years, I would type in things like http://localhost:8000 into my browser's address bar, like a chump. Then one day, a lightbulb turned on and I realised that since I already have an HTTP server running on port 80, I could make its homepage be a list of links to the ports I usually run servers on.

A little while later, another, brighter lightbulb turned on and I realised that the homepage could be a script which scans every port to automatically find every server I'm running.

A webpage with two columns. The first lists "moodle" and "wordpress" under the heading "On port 80". The other has the header "Other open ports" and there are two list items. The first reads, "Directory listing for /", with port: 1535, cwd: ~/websites/checkmyworking.com, pid: 63072. The second reads, "Numbas development server", port: 8000, cwd: ~/numbas/editor, pid: 45650.

Read more…

Finding all my recent git commits

At the moment I'm writing up one of my regular development update posts on the Numbas blog. I try to write one every couple of months. The posts act as a changelog for the various projects related to Numbas.

It's been just over seven months since the last one, because of everything, so this time there's a lot to talk about.

Read more…

Experiences writing an ActivityPub server in Python with Django

This post is about how I wrote an ActivityPub server using the Django Python framework, to provide a fediverse account on the same domain as a WordPress blog.

My motivation

On The Aperiodical, we do a monthly post collecting bits of maths news that we've seen. This is the compromise we came to as we realised that we're too old and busy to keep up with writing in-depth posts about individual things any more.

When we started doing this, I set up a /news slash command in our Slack channel which would take a URL and some explanatory text, and add it to the current draft post. Slack insists you give a username to the account that replies to this command, so we have our happy little Aperiodipal secretary.

Since we only publish the news posts once a month, we sometimes miss out on spreading the word about time-limited events, such as deadlines to register for conferences or mathematical holidays. And when there's Big Maths News such as a proof of an old conjecture, it'd be nice to put something out immediately rather than waiting until it's old news.

So I thought it would be a good idea to automatically toot on the fediverse each time one of us adds an item to the news post, as part of the /news Slack command.

Ideally, I'd like the fediverse account to belong to the aperiodical.com domain, instead of a Mastodon instance such as mathstodon.xyz.

That meant I'd have to serve the ActivityPub protocol on aperiodical.com. This is the kind of thing you do when really you're more curious about how the protocol works than making a pragmatic decision of the best course of action.

Read more…

Mastodon admin experiences

Mastodon has really taken off this month, as a result of Twitter collapsing.

The instance I run with Colin Wright, mathstodon.xyz, has grown from about a thousand active users to just over 5,000 as I write.

There are lots of people running small Mastodon instances who suddenly need to support lots more activity than they're used to. I've had to learn a lot about making a webserver run at scale, so I thought it would be worth writing down what I've learnt.

I didn't take proper notes while fixing things, so I've probably forgotten some important non-obvious stuff. Soz!

This is just the things that stuck in my mind recently.

Read more…

Can I also use?

Screenshot of the "Can I also use" tool. The screen is divided into segments: the left half shows a list of browsers, versions, and features requiring that version. On the right is a tree of features along with a search box, and a list of selected features.

While updating Numbas, I wanted to know which new CSS and JavaScript features we could use while maintaining compatibility with almost all browsers in use. Until now, we'd been stuck hanging on to support for Internet Explorer 11, but it seems to have finally faded away into insignificance.

So we needed a new target for compatibility. We agreed that aiming for compatibility with 95% of devices in use would be a reasonable target.

I have been using caniuse.com to check individual features against their usage stats. It's a really useful site!

But our policy shouldn't just be "any feature supported by 95% of browsers can be used": if two features each have 95% support, the set of browsers supporting both features could be anything between 95% and 90%. With more features, the intersection of all those sets could end up being much lower than our target, and make Numbas inaccessible for a significant number of people.

I looked around for something that would let me specify a collection of features and see what percentage of browsers support them all. I'm sure it exists, but I couldn't find it, so I spent a happy day making it.

The tool is online at somethingorotherwhatever.com/can-i-also-use.

I gathered MDN's compatibility data and caniuse.com's usage data, and presented it as a massive searchable tree of features you can tick off. The other half of the screen shows you which browsers support all of the selected features, and their relative usage and release dates. For each browser, you're shown which features force that particular version, so you know what to drop in order to be supported by older versions.

I've realised that I have to be quite careful about which features I tick when using this tool, particularly with respect to CSS: browsers usually silently ignore CSS rules that they don't support, so I should only tick features that are absolutely required for the page to function.

There's a function to download the set of selected features so you can documente what your project requires. I haven't added this yet, but next time I need to do this job I'll add a feature to load in that file, so I don't have to start again from scratch!