[feat]support ep_fsdp by kevssim · Pull Request #71 · modelscope/twinkle

kevssim · 2026-02-24T02:23:37Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

support ep_fsdp.

Experiment results

Env&Config

NPU*8, Qwen3-30B-A3B, GBS=16, Grad_acc=4, ep_size=8, fsdp_size=8

Loss curve comparison : ep_fsdp vs. pure fsdp

Performance comparison

Configuration	Speed (iter/s)	Memory (MB)	Speedup	Memory Reduction
ep_fsdp	0.091	52488	1.40x	-8.6%
pure fsdp	0.065	57403	1.00x	-

gemini-code-assist · 2026-02-24T02:23:58Z

Summary of Changes

Hello @kevssim, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements support for a combined Expert Parallelism and Fully Sharded Data Parallel (EP+FSDP) strategy, specifically targeting Mixture-of-Experts (MoE) models. The changes enable more efficient distributed training by allowing experts to be sharded across a dedicated FSDP mesh while maintaining existing expert parallelism. This involves updates to expert sharding, integration into the FSDP wrapping mechanism, and improvements to gradient clipping for complex distributed tensor configurations.

Highlights

Introduced EP+FSDP Support for MoE Models: Enabled a new distributed training strategy that combines Expert Parallelism (EP) with Fully Sharded Data Parallel (FSDP) for Mixture-of-Experts (MoE) models.
Enhanced Expert Sharding Logic: Modified the expert_parallel.py module to conditionally handle expert execution and sharding based on whether EP+FSDP is enabled, including a new batch processing function for this mode.
Integrated EP+FSDP into FSDP Strategy: Updated the native_fsdp.py module to detect and apply EP+FSDP, including building a dedicated ep_fsdp_mesh and selectively sharding expert blocks using FSDP.
Improved Gradient Clipping for Distributed Tensors: Refined the grad_clip.py utility to correctly handle gradient clipping in scenarios involving mixed DTensor meshes, ensuring proper reduction across different device meshes.
Extended DeviceMesh Functionality: Added new methods to DeviceMesh in platform.py to facilitate the detection of implicit EP+FSDP mode and to retrieve ranks for specific dimensions, crucial for constructing the ep_fsdp_mesh.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/twinkle/metric/train_metric.py
- Updated f-string formatting for learning rate and total time elapse.
src/twinkle/model/transformers/moe/expert_parallel.py
- Added ep_fsdp_enabled flag and passed it to shard_experts.
- Introduced a check for nn.ModuleList experts not supported with EP+EP_FSDP.
- Stored _ep_fsdp_enabled on the block.
- Implemented conditional expert execution in forward to use _run_experts_ep_fsdp_batch for EP+EP_FSDP.
- Modified _run_expert to directly call experts.forward when EP+EP_FSDP is enabled.
- Added _run_experts_ep_fsdp_batch function for batch processing of experts in EP+EP_FSDP mode.
src/twinkle/model/transformers/strategy/native_fsdp.py
- Added _is_ep_fsdp_mode_enabled to check for EP+FSDP.
- Introduced _build_ep_fsdp_mesh to create a dedicated mesh for EP+FSDP.
- Added _ensure_ep_fsdp_supported to validate expert types for EP+FSDP.
- Implemented _maybe_shard_ep_expert_blocks to apply FSDP sharding to expert blocks on the ep_fsdp_mesh.
- Integrated EP+FSDP setup into the wrap_model method.
src/twinkle/utils/grad_clip.py
- Modified normalize_and_clip_grad_norm to detect and handle has_mixed_dtensor_mesh.
- Adjusted _local_grad to set reduce_group to None for mixed DTensor meshes, forcing world reduction over local shards.
src/twinkle/utils/platform.py
- Added get_ranks_for_dims method to retrieve ranks for specified dimensions.
- Implemented is_implicit_ep_fsdp_enabled to check if implicit EP+FSDP mode is active based on world sizes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for EP+FSDP, a combination of expert parallelism and fully sharded data parallelism. The changes are spread across model parallelism logic, strategy definitions, and utilities for device mesh and gradient clipping. The implementation provides a new method for sharding and executing experts under this combined parallelism scheme. My review highlights two main concerns: a potential dead code path in the expert execution logic which could lead to confusion, and a restrictive assumption in a device mesh utility function that might limit its use with more complex distributed configurations.

src/twinkle/model/transformers/moe/expert_parallel.py

src/twinkle/utils/platform.py

kevssim · 2026-03-10T09:15:05Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for a new parallel strategy, ep_fsdp, which combines Expert Parallelism (EP) with Fully Sharded Data Parallelism (FSDP). This is a significant and complex feature. The implementation is well-structured, involving a major refactoring of the expert parallelism logic, a new FSDP wrapping strategy, and custom EP-aware gradient clipping. The changes are accompanied by a new cookbook example and a rigorous precision test, which is excellent.

The core of the new design is the introduction of a separate ep_fsdp_device_mesh to manage EP and FSDP on experts, decoupling it from the main device mesh. This allows for flexible parallel configurations. The expert communication logic has been refactored into a new ep_utils.py file, improving modularity. The NativeFSDPStrategy is now much more sophisticated, applying different sharding strategies and mixed-precision policies to expert and non-expert layers.

Overall, this is a high-quality contribution that adds a powerful new capability. My feedback is minor, focusing on improving a type hint for clarity in a new utility file.

gemini-code-assist · 2026-03-10T09:51:27Z

src/twinkle/model/transformers/moe/ep_utils.py

+def tokens_post_all2all(
+    expert_outputs: torch.Tensor,
+    routing_weights: torch.Tensor,
+    selected_experts: int,


The type hint for selected_experts is int, but it is used as a tensor in generate_weights_idx. Based on its usage and where it's produced (from torch.topk), the type should be torch.Tensor.

Suggested change

selected_experts: int,

selected_experts: torch.Tensor,

tastelikefeet · 2026-03-10T10:39:46Z

src/twinkle/model/transformers/transformers.py

        self._enable_expert_parallel = self._should_enable_expert_parallel(self._expert_parallel_config,
                                                                           self.device_mesh)
        self._expert_parallel_applied = False
+        # Store ep_size for later use (EP mesh construction, grad clip, etc.)


这里的逻辑是否可以封装进NativeFSDPStrategy，看没有其他地方使用

tastelikefeet · 2026-03-10T10:39:55Z

src/twinkle/model/transformers/transformers.py

                fsdp_config=self._fsdp_config,
                device_mesh=self.device_mesh,
                enable_ep=self._enable_expert_parallel,
+                ep_fsdp_device_mesh=ep_fsdp_mesh,


这里是否可以在里面内部构造

tastelikefeet · 2026-03-10T10:42:47Z

src/twinkle/utils/device_mesh.py

+            return None
+        world_size = self.world_size
+        assert world_size % ep_size == 0, (f'world_size ({world_size}) must be divisible by ep_size ({ep_size})')
+        ep_fsdp_size = world_size // ep_size


如果这里能算出来ep_fsdp_size，还需要外部传入吗？

或者说，这里如何判定需要在ep内部开启fsdp呢？

tastelikefeet · 2026-03-10T10:43:56Z

src/twinkle/utils/device_mesh.py

+            mesh = (
+                torch.arange(math.prod((ep_size, ep_fsdp_size)), dtype=torch.int).view(ep_fsdp_size,
+                                                                                       ep_size).transpose(0, 1))
+        return torch.distributed.DeviceMesh(self.device_type, mesh, mesh_dim_names=('ep', 'ep_fsdp'))


to_torch_device_mesh这个已经有了，考虑复用？

tastelikefeet · 2026-03-10T10:47:45Z

src/twinkle/utils/device_mesh.py

            self.mesh = np.array(self.mesh)

-        valid_dim_names = {'dp', 'fsdp', 'tp', 'pp', 'cp', 'ep'}
+        valid_dim_names = {'dp', 'fsdp', 'tp', 'pp', 'cp', 'ep', 'ep_fsdp'}


可以考虑使用from_sizes，这里应该就不用修改了

参考下from_sizes的代码

tastelikefeet · 2026-03-10T11:00:45Z

src/twinkle/utils/grad_clip.py

+        # EP: reduce over ep_fsdp_group, then ep_group
+        ep_val = _local_norm_stat(ep_params, norm_type)
+        if ep_fsdp_group is not None:
+            op = dist.ReduceOp.MAX if math.isinf(norm_type) else dist.ReduceOp.SUM


fsdp2是不是不需要使用all_reduce？

tastelikefeet · 2026-03-10T11:02:23Z

src/twinkle/utils/grad_clip.py

        total_norm_tensor = torch.tensor(local_norm, device=reduce_device, dtype=torch.float32)
        if dist.is_initialized():
-            dist.all_reduce(total_norm_tensor, op=dist.ReduceOp.MAX, group=group)
+            dist.all_reduce(total_norm_tensor, op=dist.ReduceOp.MAX, group=reduce_group)


同理，这里是否需要reduce

kevssim added 2 commits February 24, 2026 10:15

wip

ae30102

lint

181edbd

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

src/twinkle/model/transformers/moe/expert_parallel.py Outdated Show resolved Hide resolved

src/twinkle/utils/platform.py Outdated Show resolved Hide resolved

kevssim added 18 commits February 24, 2026 11:35

wip

0358d97

wip

dbd78df

wip

de75557

wip

4cedfc6

wip

7c0c4bd

wip

8a37a88

Merge remote-tracking branch 'origin/main' into ep_fsdp

8a0d5e3

Merge remote-tracking branch 'origin/main' into ep_fsdp

b0895d9

lint

ad51ef6

Merge remote-tracking branch 'origin/main' into ep_fsdp

36c5425

lint

fb485f2

wip

f56bc9f

wip

5c5c3e1

wip

8e81fd8

wip

90decba

wip

1d3eeb1

wip

55dd812

lint

d70e093

kevssim marked this pull request as ready for review March 10, 2026 09:14

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

tastelikefeet reviewed Mar 10, 2026

View reviewed changes

Conversation

kevssim commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Experiment results

Env&Config

Loss curve comparison : ep_fsdp vs. pure fsdp

Performance comparison

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kevssim commented Mar 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevssim commented Feb 24, 2026 •

edited

Loading