Pipeline — HumanLoop

How It Works

Raw GoPro Video

↓

Quality Check

↓

Report JSON

→

Hand Pose

↓

Keypoints JSON

→

Segmentation

↓

SAM2 Masks

↓

Combined Full JSON

→

Transcription

↓

Narration JSON

Tools

QUALITY CHECK

HomeHands-QC

Automated video quality assessment for egocentric GoPro footage. Checks resolution, FPS, hand detection rate, and blur score.

Input Raw .mp4 video

Output Quality report JSON

Model MediaPipe Hand Landmarker

pip install mediapipe opencv-python

HAND TRACKING

HomeHands-Pose

21-point hand skeleton detection per frame. Detects left and right hand with confidence scores and pixel coordinates.

Input Raw .mp4 video

Output hand_pose.json per clip

Model MediaPipe Hand Landmarker

pip install mediapipe

SEGMENTATION

HomeHands-Seg

Pixel-level hand segmentation using SAM2. Wrist coordinates used as point prompts. Right hand purple, left hand red.

Input Video + hand_pose.json

Output Colored PNG frames per clip

Model SAM2 tiny (Meta AI)

pip install sam2

AUDIO

HomeHands-Audio

Speech transcription from egocentric narration. Generates subtitles and timestamped narration JSON locally.

Input Raw .mp4 video with audio

Output .srt + narration.json + subtitled .mp4

Model OpenAI Whisper base

pip install openai-whisper

FULL PIPELINE

HomeHands-Pipeline

Master script that runs all modules automatically on every video in a folder. Produces one combined annotation JSON per clip.

Input Folder of .mp4 videos

Output Full annotation JSON per clip

Model All of the above

Run python pipeline/run_pipeline.py

python pipeline/run_pipeline.py

Quick Start

# 1. Clone the repository
git clone https://github.com/aneessaheba/Egocentric_Homes
cd Egocentric_Homes

# 2. Install dependencies
pip install mediapipe sam2 openai-whisper opencv-python
brew install ffmpeg

# 3. Add your videos
cp your_videos/*.mp4 assets/videos/

# 4. Run full pipeline
python pipeline/run_pipeline.py

# Output per video:
# Hand pose JSON     → assets/processed/hand_pose/
# Segmentation PNGs  → assets/processed/segmented/
# Narration JSON     → assets/processed/narrations/
# Combined JSON      → assets/processed/annotations/

Models Used

Model	Task	Made by	Size	License
MediaPipe Hands	Hand tracking	Google	8 MB	Apache 2.0
SAM2 tiny	Segmentation	Meta AI	155 MB	Apache 2.0
Whisper base	Transcription	OpenAI	145 MB	MIT
ffmpeg	Audio extraction	OSS	—	LGPL
OpenCV	Video processing	OSS	—	Apache 2.0

Output Format

{
  "clip_id": "HH_001",
  "filename": "[video_name].mp4",
  "task": "[task_name]",
  "duration_sec": "--",
  "total_frames": "--",
  "fps": 30,
  "resolution": {
    "width": "--",
    "height": "--"
  },
  "hand_detection_rate": "--",
  "narrations": [
    {
      "id": 1,
      "start": "--",
      "end": "--",
      "text": "[narration text]"
    }
  ],
  "frames": [
    {
      "frame_id": 0,
      "timestamp_sec": "--",
      "hands_detected": "--",
      "hands": [
        {
          "label": "Right",
          "confidence": "--",
          "keypoints": {
            "WRIST": {
              "x": "--", "y": "--",
              "z": "--", "px": "--", "py": "--"
            }
          },
          "segmentation": {
            "method": "SAM2",
            "pixel_count": "--",
            "coverage_pct": "--"
          }
        }
      ],
      "narration": "[narration text at this timestamp]"
    }
  ]
}

Pipeline & Tools