# Pattern-matching syntax trees

I’m doing some very fun things with pattern-matching syntax trees today.

For context: I write a computer-based assessment system called Numbas, and it’s focused on assessing mathematics. A big part of that assessment involves asking the student to enter an algebraic expression as the answer to a question. How do you decide whether to mark the student’s answer as correct or not? Historically, there have been two approaches, given a representative “correct” expression written by the teacher:

- Put the student’s answer and the correct answer into a canonical form. If their syntax trees are exactly the same, then the student’s answer is the same as the teacher’s. This is the model used by, for example, STACK.
- Identify the variables used in the student’s answer and the teacher’s. Pick a few random numbers as values for those variables, and evaluate both expressions. If they come out to the same number, then the student’s answer is the same as the teacher’s. This is the model used by systems in the CALM line.

The second method looks like a hack: it’s possible to get a false-positive result by luck, and it tells you nothing about the form of the student’s answer. When you’re asking questions about basic algebra, the form is very important: has the student really factorised that quadratic, or have they just retyped the \(ax^2+bx+c\) they were given?

By comparison, the first method seems more intellectually honest – it looks deterministic, and it treats the expressions semantically instead of boiling them down to a single number. But it’s really hard to implement: a canonical form that works for *every* equivalent version of an expression is both technically impossible to write, and hard to get a good approximation to. Additionally, while you can inspect the canonicalised version of the student’s expression to see if, for example, the \(x^2\) term is present, you will inevitably lose important information about the form of the student’s original answer.

I’ve come to the opinion that the questions *“is the student’s answer equivalent to mine?”* and *“what form is the student’s answer?”* should be considered separately. The first one is almost always enough to decide whether the student’s answer is “correct”, and then the answer to the second can be used to give feedback when the student’s answer isn’t what you expected.

So, since its inception, Numbas has followed the CALM model of using the second method – evaluating over a random selection of points – to mark algebraic expressions. But it hasn’t been able to say anything about the form of the student’s answer, and that’s a big gap. The reason it can’t do that is that I don’t think rearranging into a canonical form is the way to go, because of the drawbacks I described above.

Numbas does have a simplification algorithm hiding inside it, however: since its first release, Numbas has used a very weak pattern-matching algorithm to implement its simplification rules for displayed maths. A pattern is an expression in one or more variables, and a candidate expression matches that pattern if its syntax tree overlaps with that of the pattern, with variable capturing any subtree. For example, the expression `sqrt(3*x)*sqrt(2)`

matches the pattern `sqrt(x)*sqrt(y)`

, with `x=3*x`

and `y=2`

. That’s just about good enough for basic simplification rules, which head towards a canonical form but don’t necessarily get there, but if you want to interpret the meaning of the expression you need something more powerful.

I’ve always thought it would be nice to have an analogue of regular expression pattern matching for syntax trees, to explicitly describe the shapes of trees you’ll accept, and name capturing groups. When marking algebraic expressions, you could get all sorts of information about the form of the student’s answer as it is, instead of trusting to a black-box “simplification” algorithm. I had a bit of a brainwave yesterday, and I’ve got something that looks quite good.

Let’s state what I want from a pattern-matching algorithm:

- The ability to capture named groups from the expression, so I can do things like “get the coefficient of \(x^3\), or even “get all the \(x\) terms”.
- Take into account the laws of commutativity and associativity, so \(x+1\) matches the same things as \(1+x\), without any extra effort on my part.
- A “choice” operation, the equivalent of | in a regular expression, so I can say “match either
*this*or*that*“. - A syntax that isn’t completely torturous to write.

I suppose at some point I’ll want some control over whether to use greedy matching, or things like that, but this’ll do for now.

The reason you can’t use string-based regular expressions is that mathematical notation isn’t a regular language: the first hurdle is nested brackets, and on top of that there are the laws of commutativity and associativity to deal with. I’ve come up with an algorithm that looks at expressions as syntax *trees*, and whenever it comes across a commutative operation in the pattern it does the following:

- Find the commuting terms in the pattern, and in the candidate expression, by traversing the tree for as long as you see the commuting operation, and take every subtree underneath those you visited. (Or: suppose the operation is \(+\); chop out the contiguous tree of \(+\) nodes starting from the root, then each connected component in the remainder is a term of the sum)
- For each term in the candidate expression, compare it with the terms in the pattern that haven’t already been matched. If it matches, capture it in that position.
- If any of the terms in the pattern haven’t been matched, the candidate doesn’t match the pattern.
- For terms in the pattern that capture more than one term in the candidate, collect together all those terms.
- Work out which named capturing groups the captured terms belong to, and return a dictionary of those groups. Each group name should correspond with a single expression tree made up of the terms that matched it, joined together with the commuting operation.

I’ve mentioned named capturing groups but haven’t said how you define them: there’s a new operation `expr;g`

which says that expressions matching `expr`

should be captured in the group named `g`

.

Just adding commutativity and explicitly named capturing groups to my existing pattern-matching algorithm was very useful, but it quickly became clear that I needed more control over what gets matched. I’ll list the commands I’ve added so far, then give a few motivating examples.

`?`

- Match anything
`??`

- Match anything or nothing (optional term)
`expr;g`

- Capture
`expr`

in the group named`g`

. `m_any(expr1,expr2,...)`

- Match any of the expressions
`exprN`

`m_all(expr)`

- Capture all terms matching
`expr`

`m_pm(expr)`

- Capture
`expr`

or`-(expr)`

. `m_not(expr)`

- Match anything
*except*`expr`

`m_uses(name1,name2,...)`

- Match any expression which uses the named variables
`m_commute(expr)`

- Match the terms in
`expr`

in any order, following the laws of commutativity. (Use if you only want to use commutativity in certain places) `m_nothing`

- Match nothing. Useful as an empty term to set up a sum where you want all terms to match one pattern.

(Yes, the `m_whatever(expr)`

syntax is pretty long-winded. I’m quickly running out of sensible symbols to use as operators in JME syntax, so my solution was to invent some new function names which match expressions in different ways. Suggestions for better ways of doing it welcome.)

By the way, there’s an online demo of this system for you to play with.

Suppose you want to get each of the terms in a polynomial. Something like

`m_all(x^?);xs+m_nothing`

is a reasonable start. Except you’d never write `x^1`

, so you need to replace `x^?`

with `m_any(x,x^?)`

to catch either form. And we also need to capture the coefficients, so we need to add a factor of `m_all(??)`

to the \(x\) terms. And finally, the coefficients can be either positive or negative, so wrap the whole lot in `m_pm( )`

, and we end up with

`m_all(m_pm(m_all(??)*m_any(x,x^?)));xs+m_nothing`

This ends up matching all terms with a factor of a power of \(x\) somewhere in them, i.e. things like this:

`x - x + 2x - x*2 + (a+1)x + x^2 + 2x^3 + (1+2)x^(n+1)`

You could then consider each of those terms separately and get the coefficients and degrees:

`m_pm(m_all(??);coefficient * m_any(x,x^?;degree))`

The next example is really simple but it’s sort of what prompted this whole thing. Capture either side of an equation separately:

`?;left = ?;right`

If this matches, that confirms the student wrote an equation, but it doesn’t say any more than that. Testing that the equation holds (under whatever assumptions you give) is not a job for pattern-matching, but you might want to ask some more questions about the form of the answer. For instance: are all the \(x\) terms collected on one side?

`m_uses(x);xside = m_not(m_uses(x));otherside`

The `m_uses(x)`

command matches anything which has a variable `x`

inside it. And `m_not( )`

ensures that there are no \(x\)s on the other side. Because equality is a commutative relation, this pattern matches no matter which side the \(x\) terms are on.

Something that Numbas/CALM has always struggled with is quadratics. For a question like “factorise the quadratic \(ax^2+bx+c\), the form of the student’s answer is obviously a lot more important than the function it describes. CALM’s hacky solution was to add restrictions on the student’s answer considered as a string, for example by requiring brackets and forbidding the exponentiation operator. It should be immediately obvious that that doesn’t really do the job: `(a*x*x+b*x+c)`

passes those criteria. So, here’s a pattern that matches a factorised quadratic:

`(m_pm(??*x);a+?;b) * (m_pm(??*x);c+?;d)`

Here you want to be fairly restrictive about the form of the expression, but there are still quite a few “correct” answers – the student could swap the signs in both factors, or swap the orders of the \(x\) terms and the constants. You might want to also insist that the coefficients are numbers:

`(m_pm(m_number*x);a+m_number;b)*(m_pm(m_number*x);c+m_number;d)`

I’ve made it so `m_number*x`

*does* match `x`

, for convenience.

My last example captures terms of the form \(a \cdot x^ny^m\):

`m_all( m_any( ??x, ??y, ??x^??, ??y^??, m_any(x,x^??)*m_any(y,y^??)*?? ) );terms + m_all(??;rest)`

There are a few ways it could be made more or less restrictive – you could capture any number of factors, or require the coefficients to be numbers. Like traditional regular expressions, pattern expressions can get very fiddly and long-winded indeed when you take them seriously.

This stuff should make its way into Numbas fairly soon, once I’ve tested it a bit more and settled on a good syntax. I’m still not hugely happy with the lengthy `m_any(..)`

syntax.

## Comments

## Comments

Bill Foster

Excellent development!

Two comments:

1.The pattern

m_pm(m_all(??);coefficient * m_any(x,x^?;degree))

cannot capture coefficient or degree when equal to 1

2. The pattern looking for powers of x and y

m_all( m_any( ??x, ??y, ??x^??, ??y^??, m_any(x,x^??)*m_any(y,y^??)*?? ) );terms + m_all(??;rest)

when applied to the example

2x*y^2+(2*b)*x+x*y*3+3x^2y^3+c+x^3+3y^2+x+y+3

Does not give 2*b*x as a ‘term’ – it is included in the ‘rest’. However on changing to b*x it is included in the terms.

Christian Perfect

To be more precise, you don’t capture anything when the coefficient or degree are missing, so you could infer that it’s 1 in that case.

You can fix the second one by using

`m_all(??)`

for the coefficients.