26 Jul

I’m working on a mobile app for photographing sheet music and then playing it.

When I first approached this, I considered using a Hough transform, which is a mathematical tool for finding lines in an image. It produces a matrix based on MC space (the tangent and y offsets of the lines).

I could then use the matrix to figure out what was being shown on the sheet.

That method is very computationally expensive.

After considering it, trying it, then abandoning it, a better solution came to me while I was thinking about something totally different.

Sheet music is composed of mostly horizontal lines, while everything that is not a horizontal line is part of the notation itself.

So, all I need to do is first locate the horizontal lines, and everything else will be easy to find.

The first problem, then, is how to make sure that the sheet itself is level.

How I ended up doing this was to measure mean difference of the average colours of each ‘y’ coordinate of the image, and try offsetting one side of the sheet up and down until I reached the maximum mean difference.

This is easier to understand visually.

Let’s consider this image: As a human, we find it easy to spot the skew and fix it, but a computer is not so intuitive.

Here is the same image with the “x” coordinates of each “y” coordinate averaged out (motion-blurred, basically) That’s a simple average of the “x” coordinates, and there already appears to be a pattern.

Next we shift/skew one side of the image up or down a few pixels and test it again. In my tests, I use a naive “brute-force” test of all offsets from -15 to +15. Here are blurs of a -11 offset and a +11 offset:  -11  +11

Obviously, the right one is the -11 one. But how do we tell a computer what the “obvious” solution is?

Well, the right answer is probably to come up with a way to measure which one is more “noisy”, but I couldn’t think of a simple way to do that.

Instead, what I did was to measure the average colour in the each image, use that average to find the mean difference in each image (how far from the average “gray” each line is), and the one we are looking for is the one with the highest mean difference.

Having found the right offset (-11), we then simply shift the pixels in the image by that much (in Y and X space), and end up with these images:  original image  straightened

The next task is to fix skewing, but it will use basically the same technique.

demo