Handling Motion JPEG Streams on iOS

July 02, 2014

I have several Foscam Cameras around the outside of my house. They’re very easy to setup, tolerate the outdoor conditions admirably, and are incredibly affordable for what they offer.

As with everything else around my house, I like to build software that customizes my view into my home (or in this case, outside my home). To that end I’ve build an app I call Argos that lets me monitor all sorts of sensors on my property.

Once I installed the first set of cameras I wanted to be able to implement some views that would display the current video stream from each camera. After looking into the documentation I discovered that the cameras I have offer two types of video streams: windows streaming video (asf) and motion JPEG.

I don’t have a lot of experience writing software to handle video streams. But as I read the basic description it seemed that a motion JPEG stream is just an http stream that continually pushes out a series of jpeg images.

Oh, well that’s easy. Right?

Well, not so fast.

What Is Motion JPEG?

It also turns out that there is no such thing as a true motion jpeg standard. However, there are two typical implementations, Motion JPEG-A and Motion JPEG-B. Motion JPEG-A supports the concept of markers, while Motion JPEG-B does not. This difference is important. For the rest of this discussion however all we need to know is that the Foscam camera stream is Motion JPEG-A.

A Motion JPEG-A stream looks (to me) a lot like a multipart email message. There are several sections, each separated by a long string of semi-random characters. Within each section is some encoded (or not) binary data that represents the object in that section. In our case, each section is a JPEG image.

We can see what this looks like by using the curl command:

{% highlight ruby %} › curl -D - “http://192.168.300.301/videostream.cgi?user=admin&pwd=SECRETS” HTTP/1.1 200 OK Server: Netwave IP Camera Date: Wed, 02 Jul 2014 22:28:03 GMT Accept-Ranges: bytes Connection: close Content-Type: multipart/x-mixed-replace;boundary=ipcamera

–ipcamera Content-Type: image/jpeg Content-Length: 43996

????JFIF???!???

[a lot of binary data]

–ipcamera Content-Type: image/jpeg Content-Length: 44176

????JFIF???!???

[a lot more binary data] {% endhighlight %}

It just goes on and on like this.

We can see in the header of the response that the boundary text will be ipcamera, and the two lines following each boundary include the content type and the content length.

So How Do We Parse This?

This is the basic approach to parsing a data stream like this:

Read in the first chunk of data
Does the chunk contain a boundary marker?
If so, is that boundary marker the first boundary marker?
If it is the first one, then skip it.
Is there another marker? If so, then we have a complete image.
If we have a complete image, find the start and end of the image, remove those from our buffer, and process the image.
If we do not yet have a complete image, append the data to the buffer, and wait for the next chunk of data.

The key here is that we never know how many chunks it will take to make one image. In an ideal world we’d just get one chunk per image and we could throw that right into an NSData object and convert it to a UIImage.

Here’s the code I have so far for parsing the Motion JPEG stream:

The heart of the code is in func URLSession(session: NSURLSession!, dataTask: NSURLSessionDataTask!, didReceiveData: NSData!) . That’s where we attempt to see if we’ve hit the end of an image, and if so, extract it from the buffer.

Bugs

So far the code works fairly well, except that from time to time when I attempt to make a UIImage out of this I get a failure. I’m not sure if my data out of my camera is bad (unlikely) or if I’m just messing up the process of extracting the data (much more likely).

Improvements

What I’m currently not doing, but probabaly should be doing, is using the Content-Length header to verify the length of the image data before passing it off. I do wonder if that wouldn’t be a far more reliable way to extract the data from the buffer.

Future

Beyond cleaning up the code a bit and trying to make it more reliable, I would love for this view to include some other nice features down the road, like the gesture recognizers to allow me to implement panning/tilt via gesture. Several apps dedicated to IP Camera viewing do this, and it wouldn’t be very difficult at all to get it right.