W(a|o)ndering

April 10, 2013 at 6:24pm

Home

Characteristic Fun

Today I heard about Kaggle from a classmate. It’s a lovely little system whereby companies submit interesting Data Analysis problems for data scientists (and hobbyists) to play with. Think of it kind of like Top Coder but for interesting data sets.

image

So, after perusing their site for a bit I found my way through their active competitions section to a digit recognition tutorial competition on the MNIST handwriting recognition data set.

What MNIST provides is a set of 60,000 training examples and a test set of 10,000 images. Each image is composed of a 20x20 grayscale pixel grid containing a digit 0-9 with a blank 4 pixel border (making for 28x28 pixel images). These images are provided in a custom binary format that packs the images into a linear list of pixel buffers. The first task you are faced with is to decode this into a useful format, and as I tend to play in Python a lot, I decided to see how easily one could parse this data.

After a moment of searching, I stumbled across this lovely little piece of code:

But I faced one primary problem… I’m not in need of CVXOPT, so why should I need to install it to extract these images? Thus I set about converting this to a more SciPy friendly format. The result is as follows:

As you can see, I switched everything over to NumPy and provided a simple show() function which allows one to verify that they are looking at a correct image.

Next step… try my hand at writing some actual recognition algorithms to consume this data.

Notes

  1. jeffgator reblogged this from echoet
  2. echoet posted this