This guide is an introduction to the MediaPipe Python library on a Raspberry Pi board. It covers installing MediaPipe using pip on a virtual environment and running a gesture recognition example.

MediaPipe is a cross-platform pipeline framework to build custom machine learning (ML) solutions for streaming media (live video). The MediaPipe framework was open-sourced by Google and is currently available in early release.

Prerequisites

Before proceeding:

You need a Raspberry Pi board and a USB Camera.
You should have a Raspberry Pi running Raspberry Pi OS (32-bit or 64-bit).
You should be able to establish a Remote Desktop Connection with your Raspberry Pi – click here for Mac OS instructions.
You should have OpenCV installed on your Raspberry Pi.
Set Up USB Camera for OpenCV Projects with Raspberry Pi.

In our Raspberry Pi projects with a camera, we will be using a regular Logitech USB camera, like the one shown in the picture below.

USB Camera Webcam Raspberry Pi compatible

MediaPipe

MediaPipe is an open-source cross-platform framework for building pipelines to perform computer vision applications built on top of TensorFlow Lite.

MediaPipe has abstracted away the complexities of making on-device ML customizable, production-ready, and accessible across platforms. Using MediaPipe, you can use a simple API that receives an input image and outputs a prediction result.

Top 6

Raspberry Pi eBooks

From Zero to Professional

Shop now

Hand Gesture Example

MediaPipe on device machine learning ml api

Input: image of a person doing the thumbs-up gesture
MediaPipe does all the heavy work for you:
- Detects if there’s a hand in the image provided;
- Then, it detects the hand’s landmarks;
- Creates an embedding vector of the gestures.
Output: classifies the image based on the provided model (detects the thumbs-up gesture).

In summary, here are the MediaPipe key features:

On-device machine learning (ML) solution with simple-to-use abstractions.
Lightweight ML models, all while preserving accuracy.
Domain-specific processing, including vision, text, and audio.
Uses low-code APIs or a no-code studio to customize, evaluate, prototype, and deploy.
End-to-end optimization, including hardware acceleration, all while being lightweight enough to run well on battery-powered devices.

Installing MediaPipe on Raspberry Pi with pip on Virtual Environment (Recommended)

Having a Remote Desktop Connection with your Raspberry Pi, update and upgrade your Raspberry Pi if any updates are available. Run the following command:

sudo apt update && sudo apt upgrade -y

Create a Virtual Environment

We already installed the OpenCV library in a virtual environment in a previous guide. We need to install the MediaPipe library in the same virtual environment.

Enter the following command in a Terminal window to move to the Projects directory on the Desktop:

cd ~/Desktop/projects

Then, you can run the following command to check that the virtual environment is there.

ls -l

Activate the virtual environment projectsenv that was previously created when installing OpenCV:

source projectsenv/bin/activate

Your prompt should change to indicate that you are now in the virtual environment.

Installing the MediaPipe Library

Now that we are in our virtual environment, we can install the MediaPipe library. Run the following command:

pip3 install mediapipe

After a few seconds, the library will be installed (ignore any yellow warnings about deprecated packages).

installing MediaPipe on Raspberry Pi pip3 virtual environment

You have everything ready to start writing your Python code and testing the gesture recognition example.

MediaPipe Example – Gesture Recognition with Raspberry Pi

Having MediaPipe installed, we’ll be running a sample code that does gesture recognition. This script recognizes hand gestures in an image or video format. The default model can recognize seven different gestures in one or two hands:

Thumb up 👍
Thumb down 👎
Victory hand ✌️
Index pointing up ☝️
Raised fist ✊
Open palm ✋
Love-You gesture 🤟

This particular model was created by Google and it went through their rigorous ML Fairness standards and is production-ready.

Gesture Recognition – Python Script

Clone the GitHub repository to your Raspberry Pi with the git command:

git clone https://github.com/RuiSantosdotme/mediapipe.git

Change to the mediapipe/raspberry_pi_gesture_recognizer directory

cd mediapipe/raspberry_pi_gesture_recognizer

Use the ls command to see if you find the files illustrated in the screenshot below:

ls

Check MediaPipe Gesture-recognition example script

Finally, enter the command to install any missing requirements:

sh setup.sh

# Complete project details at https://ebokify.com/install-mediapipe-raspberry-pi/

# Copyright 2023 The MediaPipe Authors. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
# Main scripts to run gesture recognition.

import argparse
import sys
import time

import cv2
import mediapipe as mp

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
from mediapipe.framework.formats import landmark_pb2
mp_hands = mp.solutions.hands
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles


# Global variables to calculate FPS
COUNTER, FPS = 0, 0
START_TIME = time.time()


def run(model: str, num_hands: int,
        min_hand_detection_confidence: float,
        min_hand_presence_confidence: float, min_tracking_confidence: float,
        camera_id: int, width: int, height: int) -> None:
  """Continuously run inference on images acquired from the camera.

  Args:
      model: Name of the gesture recognition model bundle.
      num_hands: Max number of hands can be detected by the recognizer.
      min_hand_detection_confidence: The minimum confidence score for hand
        detection to be considered successful.
      min_hand_presence_confidence: The minimum confidence score of hand
        presence score in the hand landmark detection.
      min_tracking_confidence: The minimum confidence score for the hand
        tracking to be considered successful.
      camera_id: The camera id to be passed to OpenCV.
      width: The width of the frame captured from the camera.
      height: The height of the frame captured from the camera.
  """

  # Start capturing video input from the camera
  cap = cv2.VideoCapture(camera_id)
  cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
  cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)

  # Visualization parameters
  row_size = 50  # pixels
  left_margin = 24  # pixels
  text_color = (0, 0, 0)  # black
  font_size = 1
  font_thickness = 1
  fps_avg_frame_count = 10

  # Label box parameters
  label_text_color = (255, 255, 255)  # white
  label_font_size = 1
  label_thickness = 2

  recognition_frame = None
  recognition_result_list = []

  def save_result(result: vision.GestureRecognizerResult,
                  unused_output_image: mp.Image, timestamp_ms: int):
      global FPS, COUNTER, START_TIME

      # Calculate the FPS
      if COUNTER % fps_avg_frame_count == 0:
          FPS = fps_avg_frame_count / (time.time() - START_TIME)
          START_TIME = time.time()

      recognition_result_list.append(result)
      COUNTER += 1

  # Initialize the gesture recognizer model
  base_options = python.BaseOptions(model_asset_path=model)
  options = vision.GestureRecognizerOptions(base_options=base_options,
                                          running_mode=vision.RunningMode.LIVE_STREAM,
                                          num_hands=num_hands,
                                          min_hand_detection_confidence=min_hand_detection_confidence,
                                          min_hand_presence_confidence=min_hand_presence_confidence,
                                          min_tracking_confidence=min_tracking_confidence,
                                          result_callback=save_result)
  recognizer = vision.GestureRecognizer.create_from_options(options)

  # Continuously capture images from the camera and run inference
  while cap.isOpened():
    success, image = cap.read()
    if not success:
      sys.exit(
          'ERROR: Unable to read from webcam. Please verify your webcam settings.'
      )

    image = cv2.flip(image, 1)

    # Convert the image from BGR to RGB as required by the TFLite model.
    rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_image)

    # Run gesture recognizer using the model.
    recognizer.recognize_async(mp_image, time.time_ns() // 1_000_000)

    # Show the FPS
    fps_text = 'FPS = {:.1f}'.format(FPS)
    text_location = (left_margin, row_size)
    current_frame = image
    cv2.putText(current_frame, fps_text, text_location, cv2.FONT_HERSHEY_DUPLEX,
                font_size, text_color, font_thickness, cv2.LINE_AA)

    if recognition_result_list:
      # Draw landmarks and write the text for each hand.
      for hand_index, hand_landmarks in enumerate(
          recognition_result_list[0].hand_landmarks):
        # Calculate the bounding box of the hand
        x_min = min([landmark.x for landmark in hand_landmarks])
        y_min = min([landmark.y for landmark in hand_landmarks])
        y_max = max([landmark.y for landmark in hand_landmarks])

        # Convert normalized coordinates to pixel values
        frame_height, frame_width = current_frame.shape[:2]
        x_min_px = int(x_min * frame_width)
        y_min_px = int(y_min * frame_height)
        y_max_px = int(y_max * frame_height)

        # Get gesture classification results
        if recognition_result_list[0].gestures:
          gesture = recognition_result_list[0].gestures[hand_index]
          category_name = gesture[0].category_name
          score = round(gesture[0].score, 2)
          result_text = f'{category_name} ({score})'

          # Compute text size
          text_size = \
          cv2.getTextSize(result_text, cv2.FONT_HERSHEY_DUPLEX, label_font_size,
                          label_thickness)[0]
          text_width, text_height = text_size

          # Calculate text position (above the hand)
          text_x = x_min_px
          text_y = y_min_px - 10  # Adjust this value as needed

          # Make sure the text is within the frame boundaries
          if text_y < 0:
            text_y = y_max_px + text_height

          # Draw the text
          cv2.putText(current_frame, result_text, (text_x, text_y),
                      cv2.FONT_HERSHEY_DUPLEX, label_font_size,
                      label_text_color, label_thickness, cv2.LINE_AA)

        # Draw hand landmarks on the frame
        hand_landmarks_proto = landmark_pb2.NormalizedLandmarkList()
        hand_landmarks_proto.landmark.extend([
          landmark_pb2.NormalizedLandmark(x=landmark.x, y=landmark.y,
                                          z=landmark.z) for landmark in
          hand_landmarks
        ])
        mp_drawing.draw_landmarks(
          current_frame,
          hand_landmarks_proto,
          mp_hands.HAND_CONNECTIONS,
          mp_drawing_styles.get_default_hand_landmarks_style(),
          mp_drawing_styles.get_default_hand_connections_style())

      recognition_frame = current_frame
      recognition_result_list.clear()

    if recognition_frame is not None:
        cv2.imshow('gesture_recognition', recognition_frame)

    # Stop the program if the ESC key is pressed.
    if cv2.waitKey(1) == 27:
        break

  recognizer.close()
  cap.release()
  cv2.destroyAllWindows()


def main():
  parser = argparse.ArgumentParser(
      formatter_class=argparse.ArgumentDefaultsHelpFormatter)
  parser.add_argument(
      '--model',
      help='Name of gesture recognition model.',
      required=False,
      default='gesture_recognizer.task')
  parser.add_argument(
      '--numHands',
      help='Max number of hands that can be detected by the recognizer.',
      required=False,
      default=1)
  parser.add_argument(
      '--minHandDetectionConfidence',
      help='The minimum confidence score for hand detection to be considered '
           'successful.',
      required=False,
      default=0.5)
  parser.add_argument(
      '--minHandPresenceConfidence',
      help='The minimum confidence score of hand presence score in the hand '
           'landmark detection.',
      required=False,
      default=0.5)
  parser.add_argument(
      '--minTrackingConfidence',
      help='The minimum confidence score for the hand tracking to be '
           'considered successful.',
      required=False,
      default=0.5)
  # Finding the camera ID can be very reliant on platform-dependent methods.
  # One common approach is to use the fact that camera IDs are usually indexed sequentially by the OS, starting from 0.
  # Here, we use OpenCV and create a VideoCapture object for each potential ID with 'cap = cv2.VideoCapture(i)'.
  # If 'cap' is None or not 'cap.isOpened()', it indicates the camera ID is not available.
  parser.add_argument(
      '--cameraId', help='Id of camera.', required=False, default=0)
  parser.add_argument(
      '--frameWidth',
      help='Width of frame to capture from camera.',
      required=False,
      default=640)
  parser.add_argument(
      '--frameHeight',
      help='Height of frame to capture from camera.',
      required=False,
      default=480)
  args = parser.parse_args()

  run(args.model, int(args.numHands), args.minHandDetectionConfidence,
      args.minHandPresenceConfidence, args.minTrackingConfidence,
      int(args.cameraId), args.frameWidth, args.frameHeight)


if __name__ == '__main__':
  main()

Demonstration Gesture Recognition

Having your Virtual Environment activated, run the next command:

python recognize.py --cameraId 0 --model gesture_recognizer.task --numHands 2

You must enter the correct camera ID number for your USB camera; in my case, it’s 0, but you might need to change it. You can find more information about the supported parameters in the documentation.

With the example running, make different gestures in front of the camera. It will detect and identify the gestures (from the list of gestures we’ve seen previously). It can detect gestures in one hand or two hands simultaneously.

Wrapping Up

This tutorial was a quick getting-started guide to MediaPipe with the Raspberry Pi. MediaPipe is an easy-to-use framework that allows you to build machine-learning projects.

In this guide, we tested the hand gesture recognition example. MediaPipe also has other interesting examples, like counting the number of raised fingers on your hand. This can be especially useful in automation projects because it allows you to control something with gestures. For example, turn a specific Raspberry Pi GPIO on when you have one finger raised and turn it off when you have two raised fingers. The possibilities are endless.

We hope you’ve found this tutorial interesting.

If there’s enough interest from our readers in this kind of subject, we intend to create more machine-learning projects using MediaPipe.

If you would like to learn more about the Raspberry Pi, check out our tutorials: