fix(deps): update dependency transformers to v5#317
Open
dreadnode-renovate-bot[bot] wants to merge 1 commit intomainfrom
Open
fix(deps): update dependency transformers to v5#317dreadnode-renovate-bot[bot] wants to merge 1 commit intomainfrom
dreadnode-renovate-bot[bot] wants to merge 1 commit intomainfrom
Conversation
157a706 to
b5dd341
Compare
| datasource | package | from | to | | ---------- | ------------ | ------ | ----- | | pypi | transformers | 4.57.1 | 5.3.0 |
b5dd341 to
20d4b24
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
>=4.41.0,<5.0.0→>=5.3.0,<5.4.0Release Notes
huggingface/transformers (transformers)
v5.3.0: : EuroBERT, VibeVoice ASR, TimesFM2.5, PP-DocLayoutV2, OlmoHybrid, ModernVBert, Higgs Audio V2Compare Source
New Model additions
EuroBERT
EuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.
Links: Documentation | Paper | Blog Post
VibeVoice ASR
VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.
Links: Documentation | Paper
TimesFM2.5
TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.
Links: Documentation | Paper
PP-DocLayoutV2
PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.
Links: Documentation
OlmoHybrid
OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.
Links: Documentation
ModernVBert
ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.
Links: Documentation | Paper
ColModernVBert
ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.
Links: Documentation | Paper
Higgs Audio V2
Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.
Links: Documentation
Higgs Audio V2 Tokenizer
The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.
Links: Documentation
Breaking changes
Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.
The
Ernie4.5 VL MoEmodel class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299) by @vasquSeveral pipeline tasks have been removed or updated in the V5 cleanup (including
question-answering,visual-question-answering, andimage-to-image), requiring users to migrate to the replacement pipelines or updated task names.3D position IDs for vision-language models have been unified under a common interface (sourced from
qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.🚨 Tokenizer x vLLM fixes 🚨 :
Unigram tokenizers were missing the
spmprecompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.This was done in:
Generation
Generation input preparation was significantly refactored to stop relying on
cache_positionand instead pass pre-slicedinput_ids/inputs_embedsdirectly toprepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broadercache_positionremoval. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.prepare_inputs_for_generation(#44226) by @Cyrilvallez in [#44226]cache_positionto prepare inputs (#44130) by @Cyrilvallez in [#44130]Tokenization
Several tokenization bugs were fixed in this release, including resolving an
AttributeErrorinMLukeTokenizercaused by the v5 rename ofadditional_special_tokens, correcting the Fuyu tokenizer class mapping, fixingLayoutXLMtokenization test failures from the slow tokenizer removal refactor, and addingolmo_hybridto the auto-tokenizer mapping. The tokenizer documentation was also updated to reflect the new unified v5 backend architecture and reorganized for clarity.Kernels
Fixed several kernel-related issues including a security vulnerability, corrected Mamba kernel loading to handle incompatible import structures, ensured Liger Kernel is properly enabled during hyperparameter search, and expanded Flash Attention to support multiple compatible implementations.
Mamba] Fix kernel loading (#44176) by @vasqu in [#44176]Flash Attn] Enable compatible implementations (#44177) by @vasqu in [#44177]Quantization
This release adds several new quantization backends and fixes, including MLX quantization support for MPS devices, Four Over Six (4/6) NVFP4 quantization integration for NVIDIA Blackwell GPUs, and CPU support for MXFP4 models, alongside a bug fix for MXFP4 model saving using
reverse_op.Vision
Fixed backward compatibility for image processors loaded from older remote code that lack
valid_kwargsdefinitions, and resolved test failures in AMD ROCm CI by adding the missingtimmdependency to the Docker image.from_dictbackward compatibility with old remote code (#44245) by @yonigozlan in [#44245]Bugfixes and improvements
speaking_rateas an optionl forward argument (#43283) by @gau-nernst in [#43283]ProcessingKwargsImagesKwargsetc. to docs (#44269) by @yonigozlan in [#44269]has_similar_generate_outputsassertions (#44166) by @tarekziade in [#44166]TokenizersBackendfor Olmo3 to preserve custompre_tokenizer(#44294) by @mario-sanz in [#44294]Modular] Fix file type regression (#44283) by @vasqu in [#44283]Trainerclass docs (compute_loss&hyperparameter_search) (#44268) by @ethanknights in [#44268]fix] Set input_modalities on various architectures that aren't just text (#44078) by @tomaarsen in [#44078]VersionComparison.from_stringreturn type mismatch (#43709) by @tarekziade in [#43709]AnyToAnyPipeline.__call__docstring (#44229) by @alvarobartt in [#44229]test_generate_with_and_without_position_idsin GLM ORC (#44173) by @tarekziade in [#44173]Seq2SeqTrainingArgumentsdocumentation (#35258) by @qgallouedec in [#35258]__setitem__onModelOutputeven if the parameter was previouslyNone(#44080) by @tomaarsen in [#44080]simple] Fix up__repr__whitespace/brackets (#44048) by @tomaarsen in [#44048]chore] Fix incorrect forward type hint for Gemma3n (#44051) by @tomaarsen in [#44051]get_audio_features(#44040) by @zucchini-nlp in [#44040]Kosmos2ModelTesttest (#44061) by @tarekziade in [#44061]grouped_mmfallback (#44043) by @IlyasMoutawwakil in [#44043]Significant community contributions
The following contributors have made significant changes to the library over the last release:
has_similar_generate_outputsassertions (#44166)VersionComparison.from_stringreturn type mismatch (#43709)test_generate_with_and_without_position_idsin GLM ORC (#44173)Kosmos2ModelTesttest (#44061)Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299)Modular] Fix file type regression (#44283)Mamba] Fix kernel loading (#44176)Flash Attn] Enable compatible implementations (#44177)v5.2.0: : GLM-5, Qwen3.5, Voxtral Realtime, VibeVoice Acoustic TokenizerCompare Source
New Model additions
VoxtralRealtime
VoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.
The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.
GLM-5 - GlmMoeDsa
The zAI team launches GLM-5, and introduces it as such:
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Renovate Bot.