Motion (foreground) Segmentation

Considerations:

Detecting moving object outside the label space of the object detector. (ex: a moose walking on the road, large debris rolling across the road due to wind, etc.)
1. Unsupervised approach does not give this capability
2. Some supervised approaches provide instance segmentations and bounding boxes(which can be fed into the tracker) to account for these objects
Which static objects to filter out using motion segmentation?
1. All
  - The most naïve method and the most dangerous one, we do not want to exclude the traffic lights, road signs, etc.
  - Easiest to implement
2. Pedestrians + Cars:
  - Argus++ made this design choice.
  - It should make the ACAR’s job easier and there are not many instances with static pedestrians, but does have some safety concerns imo.
3. Cars only
  - Seems the most logical option
- What about if a car in front is static due to a red light or a stop sign?
  - Implement a custom rule to toggle motion segmentation based on traffic light/road sign detection. Usually at signals, there are no parked cars in the field of view.
    - What if signal is out of view?
      - just passed a signal which turned red; have a memory of signal (say for 10 sec to 60 sec)
        
        based on that turn off or on the motion segmentation
  - What about hazard lights?
    - Can be detected (although faintly in motion segmentation)
      - need to adjust the threshold
  - What about unprotected lefts? use indicator lights as signal? (very faint)
  - Will take more work but seems the most complete solution.
    - If we want to extend this idea further, can have to ACAR models; one for static and non-car/non-pedestrian objects and one for moving pedestrian/car objects.
    - This will allow us to separate action labels based on the type of object; allowing us to add custom exclusivity rules to them, potentially improving the model performance.
    - Will require a custom evaluation scheme

Potential Solutions for us:

Unsupervised motion segmentation
- Argus++ used unsupervised
  - Not specified which one tho
  - binary rigid masks
- There are many ways to do it unsupervised.
- Runtime ranges from 10-20 fps
- Generally lower accuracy than supervised methods
- If we were to implement; potential options with code repo:
  - CC (Competitive collaboration)
  - UnDepthflow
  - General use case, not specifically tailored for autonomous driving
  - Each unsupervised methods differs vastly from the other, each with complex architectures and training steps.
Supervised motion segmentation: generally uses two inputs, RGB and optical flow
- Getting optical flow
  - Implicit
  - Explicit
1. Architecture one (YOLACT based); code not available
  1. YOLOACT code publicly available
    1. changes needed to get desired architecture
      1. add the motion input
      2. fuse the feature
      3. add the motion mask
    2. tweak the training
      1. train semantic head for k steps
      2. train motion head for k step
      3. alternatively replace the semantic head with the motion head
        
        no need to alter the training loop
2. Architecture two (SOLO based)
  - No image
  - Similar to one; has code available
  - 5fps tho; with better accuracy
3. Architecture three (SMS Net; fully convolutional)
  1. Code publicly available; uses tf
  2. Simplest to implement (7 fps)
  3. only detects moving cars.
  4. Can use it as a litmus test with the detector and tracker to see if motion segmentation is worth it.
  5. Trivial task of converting semantic masks to bounding boxes. Is it?

Action Plan:

Try SMSNet
Use off the shelf flownet2 for optical flow computation
- fast and popular in industry
Create a custom evaluation script
1. compares the masks with the bboxes and eliminates boxes which are static
  1. iou threshold
  2. also have semantic labels for further identification
2. compare with ground truth detection on ROAD
If good, start with YOLACT based architecture
1. input and output same
2. faster speed

Papers
Code Repos