--- license: mit metrics: - precision base_model: - MCG-NJU/videomae-large-finetuned-kinetics pipeline_tag: video-classification library_name: transformers new_version: jatinmehra/Accident-Detection-using-Dashcam datasets: - nexar-ai/nexar_collision_prediction tags: - vision --- # 🚗 VideoMAE-2 for Dashcam Collision Prediction This repository contains model for predicting vehicle collisions using dashcam footage, developed for the [Nexar Dashcam Collision Prediction Challenge](https://www.kaggle.com/competitions/nexar-collision-prediction). The model achieved a **11th place** finish on the leaderboard with a score of **0.80** mean Average Precision (mAP) For Training code - [GitHub](https://github.com/Jatin-Mehra119/-Nexar-Dashcam-Crash-Prediction-Challenge) ---------- ## 🧠 Model Overview - **Architecture**: [VideoMAE-2 Large](https://huggingface.co/MCG-NJU/videomae-large-finetuned-kinetics) fine-tuned for binary classification (collision/near-miss vs. normal driving). - **Feature Extraction**: Utilized [TimeSformer](https://huggingface.co/facebook/timesformer-base-finetuned-k400) for preprocessing input frames. - **Input**: 16 frames per video, each resized to 224x224 pixels. - **Output**: Probability score indicating the likelihood of a collision or near-miss event. ---------- ## 📁 Dataset The model was trained on the [Nexar Collision Prediction Dataset](https://huggingface.co/datasets/nexar-ai/nexar_collision_prediction). - 750 non-collision videos - 400 collision videos - 350 near-miss videos [arXiv](https://arxiv.org/html/2503.03848v1?utm_source=chatgpt.com) Each video is annotated with: - **Event Type**: Collision, near-miss, or normal driving - **Event Time**: Timestamp of the (near-)collision - **Alert Time**: Earliest time the event could be predicted. For more details, refer to the [dataset paper](https://arxiv.org/abs/2503.03848). ---------- ## 🛠️ Preprocessing Pipeline 1. **Frame Extraction**: Sampled 16 frames per video, focusing on the interval around the alert time. 2. **Feature Extraction**: Applied TimeSformer feature extractor to obtain pixel values. 3. **Data Augmentation**: Implemented transformations such as horizontal flip, rotation, color jitter, and resized cropping. 4. **Normalization**: Used ImageNet mean and standard deviation for normalization. ---------- ## 🏋️ Training Details - **Framework**: PyTorch with Hugging Face Transformers and Trainer API. - **Training Configuration**: - Batch Size: 4 - Epochs: 15 - Learning Rate: 3e-5 - Weight Decay: 0.01 - Evaluation Strategy: Per epoch - Metric for Best Model: Average Precision - **Hardware**: Trained on 2x NVIDIA T4 GPUs (~4.5 hours) ---------- ## 📊 Evaluation Metrics The model's performance was evaluated using Mean Average Precision (mAP) across different time-to-accident intervals: - 500ms - 1000ms - 1500ms The final score is the mean of the Average Precision (AP) values at these intervals, emphasizing early and accurate collision predictions ## 📚 Citation If you use this model or dataset, please cite: ``` @misc{nexar2025dashcamcollisionprediction, title={Nexar Dashcam Collision Prediction Dataset and Challenge}, author={Daniel C. Moura and Shizhan Zhu and Orly Zvitia}, year={2025}, eprint={2503.03848}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.03848} } ```