This work presents Video Depth Anything based on Depth Anything V2, which can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Compared with other diffusion-based models, it enjoys faster inference speed, fewer parameters, and higher consistent depth accuracy.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection If you like our project, please give us a star ⭐ on GitHub for latest update. 💡 I also have other video-language projects that may interest you . Open-Sora Plan: Open-Source Large Video Generation Model
Our Video-R1-7B obtain strong performance on several video reasoning benchmarks. For example, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o.
Wan: Open and Advanced Large-Scale Video Generative Models In this repository, we present Wan2.1, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. Wan2.1 offers these key features:
We introduce Video-MME, the first-ever full-spectrum, M ulti- M odal E valuation benchmark of MLLMs in Video analysis. It is designed to comprehensively assess the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities.
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding This is the repo for the Video-LLaMA project, which is working on empowering large language models with video and audio understanding capabilities.
Online Video Streaming: Unlike previous models that serve as offline mode (querying/responding to a full video), our model supports online interaction within a video stream. It can proactively update responses during a stream, such as recording activity changes or helping with the next steps in real time.
Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality. Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section.