Magical Hand Gestures with Mediapipe

Julia from The Magicians
Magicians

I want to do magic

I’ve been fascinated by wizards and magicians since I was very little, but to my disappointment you can’t do real magic. Still, if you go far back enough, our technology would be indistinguishable from magic. Having not completely given up on at least mimicking some aspects of magic for its joyous kinetics, I figured it would be awesome to control technology with hand gestures like finger tutting, The Magicians style. Then, I searched for the web for anything of the sort that would already exist, but the technology has only recently become feasible where you don’t need any additional gear, and you can only use the camera for reading hand gestures. This is supported by Mediapipe with its state of the art (SOTA) models for hand and finger tracking.

Getting closer to magic

So how could I use my hands in a neat way making gestures at my camera, but also in a way that is meaningful? I figured I’d need to make a game of Simon Says where you could cycle through different gestures and would have to memorize them in order. But experimenting with the idea, it appeared that a much better experience would be yielded by not focusing so much on memory (that would rely on the camera having to always perfectly read or keep giving false negatives to the user in gesture classification). So, even though a bit of a simpler game in terms of complexity, the speed run idea was born. Eventually, I figured it would be fun to attempt to cycle through as many gestures as you can in a short amount of time. We can draw helpful goal circles on the camera feed to assist with the hand placement, not needing to memorize anything.

Building the project

Initially I thought to build it mobile, but I haven’t done much mobile development yet. More importantly, there is more work for the user to set this game up as the phone cannot be in your hand - you need to use two hands for the game to work. So it ended up being a Javascript (React) based solution that works in most modern browsers.

Even though the Mediapipe site has some examples, Mediapipe seems to be more of a demo tool of its AI capabilities at this point rather than something production-ready, and is still in alpha. Nonetheless, this is an amazing project and since the game is not too serious, using it in alpha versions is fine.

Since this blog is written in React, some React features are also used in this example, even though definitely not necessary. The starting minimum code:

  import React, {useEffect} from 'react'
  import { Hands } from "@mediapipe/hands"
  import { Camera } from "@mediapipe/camera_utils"
  import Webcam from "react-webcam"
  const webcamRef = React.useRef(null)
  const canvasRef = React.useRef(null)
  const cameraWidth = 640
  const cameraHeight = 480

  useEffect(() => {
      const hands = new Hands({
          locateFile: (file) => { return `https://cdn.jsdelivr.net/npm/@mediapipe/hands@0.4/${file}` },
      })
      hands.onResults(onResults)
      hands.setOptions({
        modelComplexity: 1
      })
      if (!!webcamRef.current && !!webcamRef.current.video) {
          const camera = new Camera(webcamRef.current.video, {
              onFrame: async () => { await hands.send({ image: webcamRef.current.video })},
              width: cameraWidth,
              height: cameraHeight
          })
          camera.start()
      }
  }, [])

  return (
      <div class="c-app c-default-layout">
          <Webcam
              hidden={true}
              autoPlay={false}
              audio={false}
              ref={webcamRef}
              width={cameraWidth}
              height={cameraHeight}
          />
          <canvas ref={canvasRef} />
      </div>
  )

So you can see that getting hand sensory data from Mediapipe using the camera feed is pretty straightforward, but I still had plenty of problems even getting to that point.

The rest is just drawing the video feed and additional markers on the canvas using the results from Hands.

Problem 1

The first time around I had assumed that installing the @mediapipe/hands package you get the full solution, meaning that no additional external requests need to be made. Thus having const hands = new Hands() resulted in a lot of 404 errors for files such as hand_landmark_full.tflite and hands_solution_simd_wasm_bin.js. Basically, it is searching for the recognition models that do not come with the npm package, it appears.

Next, I followed some wrong documentation and assumed in the hands object creation one must write the locateFile parameter function return value as assets/models/hands/${file}, resulting in similar 404 errors.

Initializing the Hands with

new Hands({locateFile: (file) => { return `https://cdn.jsdelivr.net/npm/@mediapipe/hands@0.4/${file}` }})

solved the issue as the models were then correctly fetched on runtime.

Problem 2

As I did not feel I needed to provide any additional options to the Hands model that differ from the defaults, I left it at that. However, when I tried running the solution, the file hands_solution_simd_wasm_bin.js threw an error Uncaught (in promise) RuntimeError: abort(undefined) at Error that is pretty nondescript.

Eventually, I found out that the API needs the hands.setOptions({ modelComplexity: 1 }) to exist. As soon as this was added, everything started working again.

Problem 3

I was quite confused at this one. The documentation seems to contradict this. I kept getting the error notReadableError: Could not start video source. So I figured it had to be something with the browser selecting the wrong video source or the browser permissions were incorrect. But the browser acted correctly, both Chromium and Firefox did. This error appeared even if the Webcam video feed was visible, but the data sent to Hands was somehow unreadable.

I still haven’t completely fixed this issue, however this error is silenced by simply reducing the resolution to 640x480. The maximum resolution that appears to work is 640x640. This happens even though the video feed read is 1280x720 with the camera itself handling 1920x1080. Somehow, my Logitech camera does not like to play well with Mediapipe. It might have to do with its software (closer to bloatware), which ingests one feed and outputs another, possibly for the sneaky branding in the corner.

What’s next?

I would be happy if the gesture recognition performance would gain function from voice recognition in order to classify this merger of hand gesture and voice command into a single command with much higher accuracy than with just voice or with just hand gestures. I believe it’ll be only a matter of time until this is a common sense use case, like in home automation. Some signals are better read from the body using the camera, some signals are better read from voice. If you want to point at something to turn it on, it’s sometimes better than naming it explicitly, for instance turning on specific lights rather than specifying different light schemas in the add-on software. This would also handle the case where you cannot really make a sound, like when other people are sleeping. The entirety of this is a huge boon to accessibility.

Thank you for trying this out!