[SP] add SP deny list instead of allow by kashif · Pull Request #7887 · deepspeedai/DeepSpeed

kashif · 2026-03-05T13:32:31Z

this way one can register kernels based flash-attn as well with SP

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

tohtana

Thank you for opening this PR! I think supporting HF hub kernels is is a significant update.

Regarding the approach, we check if core_attn_implementation is in ALL_ATTENTION_FUNCTIONS but HF hub kernels like kernels-community/flash-attn2 is not in the list. So HF hub kernels won’t still be available with this fix.

We probably need to do the proper registration steps:

Reject known-bad impls explicitly: eager, flex_attention, and probably paged|eager.
If core_attn_implementation is an HF hub kernel string, call the HF registration path first. (Using lazy_import_flash_attention(…))
Then read core_attn_function = ALL_ATTENTION_FUNCTIONS[core_attn_implementation].
Build uattn from that original function.
Replace that key with uattn_wrapper.

Does it make sense to you?

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif · 2026-03-08T09:43:37Z

thanks @tohtana I have tried to fix all the issues raised, if you can kindly check again?

stas00 · 2026-03-09T20:09:10Z

Reject known-bad impls explicitly: eager, flex_attention, and probably paged|eager.

We actually don't know if flex_attention is bad, we just haven't tried it out. Do you have resources to try it out, Kashif? Same for the others on the list.

That's why we started with approve list, rather than deny.

The only reason eager is denied is that it requires 4D attention_mask which is a bad idea for long sequence.

BTW, SDPA is silently broken with packed samples - when there is no attn mask, it ignores pos ids and attends to the whole sequence instead. Expect bad results. Not sure how to flag that to users - probably need to inspect pos ids and see if they reset at least once and disallow sdpa then.

tohtana · 2026-03-10T02:19:09Z

Hi @kashif,
Thank you for addressing my comments! It looks good to me.

I also think Stas's comment makes sense. Can you try implementing such a validation?
You can refer to transformers' find_packed_sequence_indices.

kashif · 2026-03-10T12:26:58Z

sure @tohtana i can check

stas00 · 2026-03-12T00:29:16Z

to make things more exact - it's packed samples + pos ids + 4D attention_mask=None where sdpa silently does the wrong thing. I haven't validated but it most likely will do the right thing with 4D attention mask being not None- but it can't be used with SP because it becomes too large too quickly.

stas00 · 2026-03-12T00:47:15Z

oh, Kashif, I'm being told eager has the exact same problem as sdpa - so both need to be fixed on the transformers side. Thank you very much!

kashif · 2026-03-14T11:33:24Z

I ran some experiments comparing flash_attention_2, sdpa, and flex_attention with SP=4 on Qwen3-4B (GQA: 32 Q
heads, 8 KV heads), 8K seq length, 10 steps.

Without SP (1 GPU baseline): flash_attention_2 and sdpa produce identical losses — confirming the backends are
equivalent in the standard path.

  ┌──────┬───────┬───────┐
  │ Step │  fa2  │ sdpa  │
  ├──────┼───────┼───────┤
  │ 1    │ 0.736 │ 0.737 │
  ├──────┼───────┼───────┤
  │ 2    │ 1.841 │ 1.843 │
  ├──────┼───────┼───────┤
  │ 3    │ 0.806 │ 0.807 │
  └──────┴───────┴───────┘

With SP=4 (4 GPUs): sdpa and flex_attention match each other, but both diverge significantly from
flash_attention_2:

  ┌──────┬──────┬──────┬──────┐
  │ Step │ fa2  │ sdpa │ flex │
  ├──────┼──────┼──────┼──────┤
  │ 1    │ 2.37 │ 4.55 │ 4.55 │
  ├──────┼──────┼──────┼──────┤
  │ 5    │ 2.28 │ 3.52 │ 3.52 │
  ├──────┼──────┼──────┼──────┤
  │ 10   │ 2.29 │ 3.02 │ 3.02 │
  └──────┴──────┴──────┴──────┘

@stas00 any ideas on what flash_attention_2 might be doing differently after the all-to-all that
sdpa/flex_attention aren't? The Q/K/V shapes and attention_mask=None + is_causal=True path should be equivalent,
but something in the SP gather/scatter is exposing a difference.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif · 2026-03-14T12:48:34Z

ok @stas00 I now enerate position_ids if missing from batch, build causal BlockMask for flex_attention and do a one-time packed sample validation for packed samples + sdpa/eager

Now the outputs are matching:

  ┌──────┬───────────────────┬───────┬────────────────┐
  │ Step │ flash_attention_2 │ sdpa  │ flex_attention │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 1    │ 2.152             │ 2.152 │ 2.150          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 2    │ 2.469             │ 2.468 │ 2.468          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 3    │ 2.045             │ 2.044 │ 2.045          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 4    │ 2.197             │ 2.197 │ 2.197          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 5    │ 2.113             │ 2.112 │ 2.112          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 6    │ 2.173             │ 2.173 │ 2.172          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 7    │ 2.351             │ 2.350 │ 2.351          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 8    │ 2.380             │ 2.380 │ 2.379          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 9    │ 1.847             │ 1.847 │ 1.847          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 10   │ 2.151             │ 2.151 │ 2.151          │
  └──────┴───────────────────┴───────┴────────────────┘

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

stas00 · 2026-03-15T15:43:28Z

Thank you for running those quality comparison experiments, Kashif

I'm a bit unclear about your last "success" comment - what was missing to make FA2 match? are you saying the mismatch was from missing position_ids? but we said that already that SDPA (and now most likely FlexAttenion) have a trouble with no-attn-mask / yes-pos-id and will ignore packed samples. SDPA on the other hand does the right thing here.

And it's great to hear Flex Attention works as well with Ulysses, so we could add it to the allow list.

stas00 · 2026-03-15T15:46:37Z

deepspeed/runtime/sequence_parallel/ulysses_sp.py

+                if has_packed_samples and self.core_attn_implementation in ("sdpa", "eager"):
+                    raise ValueError(


heh, I thought we were discussing that it's HF Transformers that has to do that, not Ulysses SP. It affects all users regardless of whether they use Ulysses or not. Unless HF Transformers disallows not providing attn-mask with sdpa/eager, which I don't think is the case.

agree, removed from DeepSpeed side

deepspeed/runtime/sequence_parallel/ulysses_sp.py

kashif · 2026-03-16T07:59:06Z

Thank you for running those quality comparison experiments, Kashif

I'm a bit unclear about your last "success" comment - what was missing to make FA2 match? are you saying the mismatch was from missing position_ids? but we said that already that SDPA (and now most likely FlexAttenion) have a trouble with no-attn-mask / yes-pos-id and will ignore packed samples. SDPA on the other hand does the right thing here.

And it's great to hear Flex Attention works as well with Ulysses, so we could add it to the allow list.

So, FA2 was the one producing correct results, while SDPA/flex were wrong. Here's what was happening:

When position_ids are not in the dataloader batch (common with SFTTrainer + packing=False), UlyssesSPDataLoaderAdapter doesn't generate them. The Trainer then generates position_ids = [0,...,chunk_len-1] on each rank AFTER the adapter has already sharded the sequence. After all_gather in UlyssesSPAttentionHF.forward(), the concatenated position_ids become:

 [0,...,2047, 0,...,2047, 0,...,2047, 0,...,2047]  # looks like 4 packed documents!

FA2 "accidentally" handles this correctly — _is_packed_sequence() detects the resets and switches to flash_varlen_fn, treating each shard as a separate document. This gives correct attention within each shard.

SDPA with is_causal=True applies a simple lower-triangular causal mask over the entire gathered sequence, allowing tokens to attend across the position_id resets. This produced loss=4.55 vs FA2's correct loss=2.37.

The fix: generate position_ids in UlyssesSPDataLoaderAdapter.refill() BEFORE all_gather and sharding, so each rank gets correct global positions (rank 1 gets [2048,...,4095], not [0,...,2047]). After gather they reconstruct to monotonic [0,...,8191] — no resets, all backends produce identical results.

With this fix, all three backends match within numerical precision:

  ┌──────┬───────┬───────┬───────┐
  │ Step │  FA2  │ SDPA  │ Flex  │
  ├──────┼───────┼───────┼───────┤
  │ 1    │ 2.152 │ 2.152 │ 2.150 │
  ├──────┼───────┼───────┼───────┤
  │ 5    │ 2.113 │ 2.113 │ 2.112 │
  ├──────┼───────┼───────┼───────┤
  │ 10   │ 2.151 │ 2.152 │ 2.151 │
  └──────┴───────┴───────┴───────┘

For flex_attention, we also needed to rebuild the BlockMask for the full gathered sequence length after the all-to-all (the wrapper discards the original one since it was built for the local shard).

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

stas00 · 2026-03-16T18:19:24Z

great explanations, Kashif - thank you!

let's assert if pos ids isn't there, trusting that the user will set it up correctly. Generating a warning doesn't guarantee the user will see. But an assert and telling them to do it correctly is probably the safest/resilient way forward.
and as you said a special treatment needs to be added for BlockMask for flex attn - I'm not familiar with this one, so will see your implementation when you get a chance to add it.

Thank you, Kashif

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif · 2026-03-17T12:32:19Z

@stas00, regarding point 2, we added BlockMask handling for flex_attention in these places:

uattn_wrapper: keeps the BlockMask instead of discarding it (other mask types are set to None)
UlyssesSPAttentionHF.forward(): rebuilds the BlockMask for the full gathered sequence length after the all-to-all (the original was built for the local shard)
register_with_transformers(): imports BlockMask and create_block_mask once and stores them on the instance (only when core_attn_implementation == "flex_attention")

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

stas00 · 2026-03-17T16:44:18Z

Thank you very much, Kashif.

Do you think all this amazing tooling you added should live here and not in HF Transformers?

kashif · 2026-03-17T16:55:42Z

checking

kashif · 2026-03-19T16:40:49Z

So some SP-specific things tied to the all-to-all make sense to be here...

position_ids assert in the adapter: the adapter owns the sharding contract and needs position_ids before it shards
BlockMask rebuild in forward(): after the all-to-all, the gathered sequence is longer than the original mask was built for. This is inherently SP logic I believe
Smart mask handling in uattn_wrapper: distinguishing BlockMask (keep) from 4D tensor masks (discard) is specific to how SP nulls out masks
flex_attention allow/deny list: SP-specific compatibility

Agree that in Transformers:

The packed samples + sdpa/eager silent failure: any user with packed sequences + attention_mask=None gets wrong results, regardless of SP. sdpa/eager should respect position_ids for document boundaries like FA2 does (or at minimum warn)

On the TRL side:

Ensuring position_ids are always in the batch when SP is enabled: currently requires padding_free=True. The collator could detect SP and always emit position_ids

stas00 · 2026-03-19T20:28:20Z

Thank you for the detailed summary, Kashif.

I agree with everything, except:

sdpa/eager should respect position_ids for document boundaries like FA2 does (or at minimum warn)

I think it should assert. Warnings don't work and allowing invalid training can be so so costly to the user who missed the warning in the sea of warnings. I wonder how many people will discover their model has been mistrained and they had no clue that was the case, other than getting bad outcomes.

stas00 · 2026-03-19T20:29:28Z

Please let us know when things are ready for the final review, Kashif.

kashif · 2026-03-20T07:23:19Z

thanks @stas00 yes we are asserting the position_ids in the adapter see: https://github.com/deepspeedai/DeepSpeed/pull/7887/changes#diff-93102c31b16242dc2a15b2cee19920eb9c228d298863c4c4218ac93c2c62ab6bR652

kashif · 2026-03-20T07:23:35Z

yes ready for review, thanks!

stas00 · 2026-03-20T15:39:08Z

thanks @stas00 yes we are asserting the position_ids in the adapter see: https://github.com/deepspeedai/DeepSpeed/pull/7887/changes#diff-93102c31b16242dc2a15b2cee19920eb9c228d298863c4c4218ac93c2c62ab6bR652

I meant inside transformers. Currently transformers may provide a disservice to users if they use packed samples w/o attention w/ sdpa/eager - or is it the case that transformers enforces 4D attention mask?

stas00

Kashif,

overall looks great - added a few suggestions

do you think we should discuss the different supported attn types in the tutorial as well?

https://github.com/deepspeedai/DeepSpeed/blob/5f7b687018bd1e0340c661859820fd97aa80a616/docs/_tutorials/ulysses-alst-sequence-parallelism.md

if so let's add a brief section there?

also we can now probably test fa4 and add it to the list - fa4 support has been merged in transformers a few days ago.

stas00 · 2026-03-20T15:55:00Z

deepspeed/runtime/sequence_parallel/ulysses_sp.py

                f"{core_attn_implementation} attn_implementation isn't currently supported by Ulysses sequence"
-                f" parallelism. Set core_attn_implementation arg to one of {supported_attn_implementation}.")
+                f" parallelism because it requires a 4D attention_mask (O(n²) memory)."
+                f" Use 'flash_attention_2', 'flash_attention_3', 'flex_attention', 'sdpa',"


Should we future proof this for fa and say any official flash attention version?

tests/unit/ulysses_alst/test_ulysses_sp_hf.py

- Use lazy imports for BlockMask/create_block_mask instead of storing on instance attributes, fixing multiprocessing pickle errors in tests - Future-proof error message for unsupported attn implementations - Add TestUlyssesSPHFFlexAttention test class with non_daemonic_procs and a model with head_dim >= 16 (flex_attention compiled kernel requirement) Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

…rward() - Move `attention_mask = self._flex_block_mask_cached` inside the `isinstance(attention_mask, BlockMask)` guard to prevent stale cache from leaking when attention_mask is not a BlockMask - Add warning_once in forward() when position_ids are missing, so users who bypass UlyssesSPDataLoaderAdapter are alerted to potential incorrect causal masking Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif · 2026-03-27T18:43:37Z

the test is failing... i am investigating

- Replace Felladrin/Llama-160M-Chat-v1 with a locally-created LlamaConfig (head_dim=16, 2 layers, 2 heads) to match DeepSpeed's convention of using tiny models in tests, avoiding external model downloads - Remove _compile=True from create_block_mask — it caused gradient explosion in the backward pass through torch.compile - Set random seed for reproducible model initialization - Use torch_assert_close for loss (flex_attention + torch.compile introduces tiny numerical differences vs exact match) - Parametrize over zero_stage [2, 3] matching existing test convention Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif · 2026-03-27T20:00:54Z

ok tests passing now

Compiling create_block_mask (via _compile=True or torch.compile) inside the model forward causes gradient explosion when the resulting BlockMask is used with flex_attention's own torch.compile. The nested compilation contexts conflict in the backward pass. Since the BlockMask is already cached and only rebuilt when dimensions change, the creation cost is negligible without compilation. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

deepspeed/runtime/sequence_parallel/ulysses_sp.py

position_ids are required for Ulysses SP — without them each rank generates local [0..chunk_len-1] positions which break causal masking after the all_gather. A warning is useless since training silently produces wrong results. Make it a hard assert with actionable guidance. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif requested review from tjruwase and tohtana as code owners March 5, 2026 13:32

kashif mentioned this pull request Mar 5, 2026

[DeepSpeed] allow kernels flash-attn in SP huggingface/accelerate#3959

Merged

5 tasks

add SP deny list instead of allow

aec2c90

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif force-pushed the sp_attn_deny branch from 6a76b53 to aec2c90 Compare March 5, 2026 13:36

tohtana reviewed Mar 6, 2026

View reviewed changes

kashif requested a review from loadams as a code owner March 8, 2026 09:41

hub-kernel lazy registration before validation and tests

49e0310

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif force-pushed the sp_attn_deny branch from 5044178 to 49e0310 Compare March 8, 2026 09:42

Merge branch 'master' into sp_attn_deny

2f8e77c

Merge branch 'master' into sp_attn_deny

ce69dc0

kashif added 2 commits March 13, 2026 17:56

Merge branch 'master' into sp_attn_deny

952c3ae

Merge branch 'master' into sp_attn_deny

6e3f2cb

position_ids generation and flex_attention BlockMask

874ec62

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif added 2 commits March 14, 2026 17:02

refactor

7d0a136

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

update comments

5868135

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif requested a review from tohtana March 14, 2026 17:22

stas00 reviewed Mar 15, 2026

View reviewed changes

do not check for has_packed_samples

b0e05f0

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

raise error instead of warning

89058fc

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

cache BlockMask

463cb30

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

stas00 requested changes Mar 20, 2026

View reviewed changes

loadams and others added 4 commits March 23, 2026 07:49

Merge branch 'master' into sp_attn_deny

5cf053d

Merge branch 'master' into sp_attn_deny

e250492

stas00 requested changes Mar 27, 2026

View reviewed changes

deepspeed/runtime/sequence_parallel/ulysses_sp.py Outdated Show resolved Hide resolved

deepspeed/runtime/sequence_parallel/ulysses_sp.py Show resolved Hide resolved

kashif force-pushed the sp_attn_deny branch from 24964b4 to 3ed1ff5 Compare March 28, 2026 08:20

Merge branch 'master' into sp_attn_deny

6cb71e9

		if has_packed_samples and self.core_attn_implementation in ("sdpa", "eager"):
		raise ValueError(

Conversation

kashif commented Mar 5, 2026

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

kashif commented Mar 8, 2026

Uh oh!

stas00 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana commented Mar 10, 2026

Uh oh!

kashif commented Mar 10, 2026

Uh oh!

stas00 commented Mar 12, 2026

Uh oh!

stas00 commented Mar 12, 2026

Uh oh!

kashif commented Mar 14, 2026

Uh oh!

kashif commented Mar 14, 2026

Uh oh!

stas00 commented Mar 15, 2026

Uh oh!

stas00 Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

kashif Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kashif commented Mar 16, 2026

Uh oh!

stas00 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Mar 17, 2026

Uh oh!

kashif commented Mar 17, 2026

Uh oh!

kashif commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Mar 19, 2026

Uh oh!

stas00 commented Mar 19, 2026

Uh oh!

kashif commented Mar 20, 2026

Uh oh!

kashif commented Mar 20, 2026

Uh oh!

stas00 commented Mar 20, 2026

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

stas00 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kashif commented Mar 27, 2026

Uh oh!

kashif commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stas00 commented Mar 9, 2026 •

edited

Loading

stas00 commented Mar 16, 2026 •

edited

Loading

kashif commented Mar 17, 2026 •

edited

Loading

kashif commented Mar 19, 2026 •

edited

Loading