Skip to content

[fix] Fix cuda ipc weight sync after #1271#1292

Merged
erictang000 merged 1 commit intoNovaSky-AI:mainfrom
erictang000:fix_cuda_ipc_sync
Mar 7, 2026
Merged

[fix] Fix cuda ipc weight sync after #1271#1292
erictang000 merged 1 commit intoNovaSky-AI:mainfrom
erictang000:fix_cuda_ipc_sync

Conversation

@erictang000
Copy link
Collaborator

@erictang000 erictang000 commented Mar 7, 2026

Fixes the following error encountered after #1271 when running colocated training w/ cuda ipc weight sync

                                  ^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(TypeError): ray::FSDPPolicyWorkerBase.broadcast_to_inference_engines() (pid=238850, ip=10.0.135.49, actor_id=17e6181c4a6c57b58e04aee108000000, repr=<skyrl.backends.skyrl_train.workers.fsdp.fsdp_worker.FSDPPolicyWorkerBase object at 0x7d820dd11df0>)
  File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-03-06_17-46-17_343270_6070/runtime_resources/working_dir_files/_ray_pkg_3a51306235ee6269/skyrl/backends/skyrl_train/workers/fsdp/fsdp_worker.py", line 256, in broadcast_to_inference_engines
    await self._weight_transfer_sender.send_chunks(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: CudaIpcWeightTransferSender.send_chunks() got an unexpected keyword argument 'weight_metadata'

Open with Devin

@erictang000 erictang000 merged commit e9e5fb2 into NovaSky-AI:main Mar 7, 2026
2 of 4 checks passed
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a TypeError that occurs during CUDA IPC weight synchronization by updating the send_chunks method signature in CudaIpcWeightTransferSender. The new signature includes the weight_metadata keyword argument, which aligns it with the base class interface and resolves the error. The method's docstring is also updated to document the new argument.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 1 additional finding.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant