Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Starred repositories
✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
The repo provides information about KeSpeech dataset.
A generative world for general-purpose robotics & embodied AI learning.
HunyuanVideo: A Systematic Framework For Large Video Generation Model
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
⚡ 一款用于自动语音识别 (ASR)、翻译的高性能异步 API。不需要购买Whisper API,使用本地运行的Whisper模型进行推理,并支持多GPU并发,针对分布式部署进行设计。还内置了包括TikTok、抖音等社交媒体平台的爬虫,可实现来自多个社交平台的无缝媒体处理,为媒体内容数据自动化处理提供了强大且可扩展的解决方案。
Let's build better datasets, together!
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
Ongoing research training transformer models at scale
Convert any PDF into a podcast episode!
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
A high-throughput and memory-efficient inference and serving engine for LLMs
o1-engineer is a command-line tool designed to assist developers in managing and interacting with their projects efficiently. Leveraging the power of OpenAI's API, this tool provides functionalitie…
Convert any PDF into a podcast episode!
Prompt工程师指南,源自英文版,但增加了AIGC的prompt部分,为了降低同学们的学习门槛,翻译更新
A programming framework for agentic AI 🤖 PyPi: autogen-agentchat Discord: https://aka.ms/autogen-discord Office Hour: https://aka.ms/autogen-officehour
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
Speech, Language, Audio, Music Processing with Large Language Model
Official implementation of the paper "Acoustic Music Understanding Model with Large-Scale Self-supervised Training".
利用HuggingFace的官方下载工具从镜像网站进行高速下载。