Skip to content

[feat]support ep_fsdp#71

Open
kevssim wants to merge 20 commits intomodelscope:mainfrom
kevssim:ep_fsdp
Open

[feat]support ep_fsdp#71
kevssim wants to merge 20 commits intomodelscope:mainfrom
kevssim:ep_fsdp

Conversation

@kevssim
Copy link
Collaborator

@kevssim kevssim commented Feb 24, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

support ep_fsdp.

Experiment results

Env&Config

NPU*8, Qwen3-30B-A3B, GBS=16, Grad_acc=4, ep_size=8, fsdp_size=8

Loss curve comparison : ep_fsdp vs. pure fsdp

dlZhImageForImgTag

Performance comparison

Configuration Speed (iter/s) Memory (MB) Speedup Memory Reduction
ep_fsdp 0.091 52488 1.40x -8.6%
pure fsdp 0.065 57403 1.00x -

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kevssim, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements support for a combined Expert Parallelism and Fully Sharded Data Parallel (EP+FSDP) strategy, specifically targeting Mixture-of-Experts (MoE) models. The changes enable more efficient distributed training by allowing experts to be sharded across a dedicated FSDP mesh while maintaining existing expert parallelism. This involves updates to expert sharding, integration into the FSDP wrapping mechanism, and improvements to gradient clipping for complex distributed tensor configurations.

Highlights

  • Introduced EP+FSDP Support for MoE Models: Enabled a new distributed training strategy that combines Expert Parallelism (EP) with Fully Sharded Data Parallel (FSDP) for Mixture-of-Experts (MoE) models.
  • Enhanced Expert Sharding Logic: Modified the expert_parallel.py module to conditionally handle expert execution and sharding based on whether EP+FSDP is enabled, including a new batch processing function for this mode.
  • Integrated EP+FSDP into FSDP Strategy: Updated the native_fsdp.py module to detect and apply EP+FSDP, including building a dedicated ep_fsdp_mesh and selectively sharding expert blocks using FSDP.
  • Improved Gradient Clipping for Distributed Tensors: Refined the grad_clip.py utility to correctly handle gradient clipping in scenarios involving mixed DTensor meshes, ensuring proper reduction across different device meshes.
  • Extended DeviceMesh Functionality: Added new methods to DeviceMesh in platform.py to facilitate the detection of implicit EP+FSDP mode and to retrieve ranks for specific dimensions, crucial for constructing the ep_fsdp_mesh.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/twinkle/metric/train_metric.py
    • Updated f-string formatting for learning rate and total time elapse.
  • src/twinkle/model/transformers/moe/expert_parallel.py
    • Added ep_fsdp_enabled flag and passed it to shard_experts.
    • Introduced a check for nn.ModuleList experts not supported with EP+EP_FSDP.
    • Stored _ep_fsdp_enabled on the block.
    • Implemented conditional expert execution in forward to use _run_experts_ep_fsdp_batch for EP+EP_FSDP.
    • Modified _run_expert to directly call experts.forward when EP+EP_FSDP is enabled.
    • Added _run_experts_ep_fsdp_batch function for batch processing of experts in EP+EP_FSDP mode.
  • src/twinkle/model/transformers/strategy/native_fsdp.py
    • Added _is_ep_fsdp_mode_enabled to check for EP+FSDP.
    • Introduced _build_ep_fsdp_mesh to create a dedicated mesh for EP+FSDP.
    • Added _ensure_ep_fsdp_supported to validate expert types for EP+FSDP.
    • Implemented _maybe_shard_ep_expert_blocks to apply FSDP sharding to expert blocks on the ep_fsdp_mesh.
    • Integrated EP+FSDP setup into the wrap_model method.
  • src/twinkle/utils/grad_clip.py
    • Modified normalize_and_clip_grad_norm to detect and handle has_mixed_dtensor_mesh.
    • Adjusted _local_grad to set reduce_group to None for mixed DTensor meshes, forcing world reduction over local shards.
  • src/twinkle/utils/platform.py
    • Added get_ranks_for_dims method to retrieve ranks for specified dimensions.
    • Implemented is_implicit_ep_fsdp_enabled to check if implicit EP+FSDP mode is active based on world sizes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for EP+FSDP, a combination of expert parallelism and fully sharded data parallelism. The changes are spread across model parallelism logic, strategy definitions, and utilities for device mesh and gradient clipping. The implementation provides a new method for sharding and executing experts under this combined parallelism scheme. My review highlights two main concerns: a potential dead code path in the expert execution logic which could lead to confusion, and a restrictive assumption in a device mesh utility function that might limit its use with more complex distributed configurations.

@kevssim kevssim marked this pull request as ready for review March 10, 2026 09:14
@kevssim
Copy link
Collaborator Author

kevssim commented Mar 10, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for a new parallel strategy, ep_fsdp, which combines Expert Parallelism (EP) with Fully Sharded Data Parallelism (FSDP). This is a significant and complex feature. The implementation is well-structured, involving a major refactoring of the expert parallelism logic, a new FSDP wrapping strategy, and custom EP-aware gradient clipping. The changes are accompanied by a new cookbook example and a rigorous precision test, which is excellent.

The core of the new design is the introduction of a separate ep_fsdp_device_mesh to manage EP and FSDP on experts, decoupling it from the main device mesh. This allows for flexible parallel configurations. The expert communication logic has been refactored into a new ep_utils.py file, improving modularity. The NativeFSDPStrategy is now much more sophisticated, applying different sharding strategies and mixed-precision policies to expert and non-expert layers.

Overall, this is a high-quality contribution that adds a powerful new capability. My feedback is minor, focusing on improving a type hint for clarity in a new utility file.

def tokens_post_all2all(
expert_outputs: torch.Tensor,
routing_weights: torch.Tensor,
selected_experts: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint for selected_experts is int, but it is used as a tensor in generate_weights_idx. Based on its usage and where it's produced (from torch.topk), the type should be torch.Tensor.

Suggested change
selected_experts: int,
selected_experts: torch.Tensor,

self._enable_expert_parallel = self._should_enable_expert_parallel(self._expert_parallel_config,
self.device_mesh)
self._expert_parallel_applied = False
# Store ep_size for later use (EP mesh construction, grad clip, etc.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的逻辑是否可以封装进NativeFSDPStrategy,看没有其他地方使用

fsdp_config=self._fsdp_config,
device_mesh=self.device_mesh,
enable_ep=self._enable_expert_parallel,
ep_fsdp_device_mesh=ep_fsdp_mesh,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是否可以在里面内部构造

return None
world_size = self.world_size
assert world_size % ep_size == 0, (f'world_size ({world_size}) must be divisible by ep_size ({ep_size})')
ep_fsdp_size = world_size // ep_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果这里能算出来ep_fsdp_size,还需要外部传入吗?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者说,这里如何判定需要在ep内部开启fsdp呢?

mesh = (
torch.arange(math.prod((ep_size, ep_fsdp_size)), dtype=torch.int).view(ep_fsdp_size,
ep_size).transpose(0, 1))
return torch.distributed.DeviceMesh(self.device_type, mesh, mesh_dim_names=('ep', 'ep_fsdp'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_torch_device_mesh这个已经有了,考虑复用?

self.mesh = np.array(self.mesh)

valid_dim_names = {'dp', 'fsdp', 'tp', 'pp', 'cp', 'ep'}
valid_dim_names = {'dp', 'fsdp', 'tp', 'pp', 'cp', 'ep', 'ep_fsdp'}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以考虑使用from_sizes,这里应该就不用修改了

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考下from_sizes的代码

# EP: reduce over ep_fsdp_group, then ep_group
ep_val = _local_norm_stat(ep_params, norm_type)
if ep_fsdp_group is not None:
op = dist.ReduceOp.MAX if math.isinf(norm_type) else dist.ReduceOp.SUM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsdp2是不是不需要使用all_reduce?

total_norm_tensor = torch.tensor(local_norm, device=reduce_device, dtype=torch.float32)
if dist.is_initialized():
dist.all_reduce(total_norm_tensor, op=dist.ReduceOp.MAX, group=group)
dist.all_reduce(total_norm_tensor, op=dist.ReduceOp.MAX, group=reduce_group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同理,这里是否需要reduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants