Skip to content

SFT example notebook references inaccessible S3 dataset URI #5627

@manuwaik

Description

@manuwaik

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
The SFT finetuning example notebook hardcodes an S3 URI (s3://mc-flows-sdk-testing/...) that external users do not have access to. Any user following the notebook will hit a 403 Forbidden error immediately when registering the dataset.

The notebook should either use a publicly accessible dataset or clearly instruct users to substitute their own, with a link to the required dataset format.

To reproduce
Run the following cell from sft_finetuning_example_notebook_pysdk_prod_v3.ipynb as-is:

from sagemaker.ai_registry.dataset import DataSet

dataset = DataSet.create(
    name="demo-1",
    source="s3://mc-flows-sdk-testing/input_data/sft/sample_data_256_final.jsonl"
)

Expected behavior
The example notebook should work out of the box, or clearly guide users to supply their own dataset with instructions on the required format.

Screenshots or logs

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:9                                                                                    │
│                                                                                                  │
│    6 # Register dataset in SageMaker AI Registry                                                 │
│    7 # This creates a versioned dataset that can be referenced by ARN                            │
│    8 # Provide a source (it can be local file path or S3 URL)                                    │
│ ❱  9 dataset = DataSet.create(                                                                   │
│   10 │   name="demo-1",                                                                          │
│   11 │   source="s3://mc-flows-sdk-testing/input_data/sft/sample_data_256_final.jsonl"           │
│   12 )                                                                                           │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/sagemaker/core/telemetry/telemetry_logging.py:172 in wrapper │
│ ❱ 172 │   │   │   │   │   │   raise caught_ex                                                    │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/sagemaker/core/telemetry/telemetry_logging.py:143 in wrapper │
│ ❱ 143 │   │   │   │   │   response = func(*args, **kwargs)                                       │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/sagemaker/ai_registry/dataset.py:283 in create               │
│   280 │   │   │   │   local_path = tmp_file.name                                                 │
│   281 │   │                                                                                      │
│   282 │   │   │   try:                                                                           │
│ ❱ 283 │   │   │   │   AIRHub.download_from_s3(source, local_path)                                │
│   284 │   │   │   │   cls._validate_dataset_format(local_path)                                   │
│   285 │   │   │   finally:                                                                       │
│   286 │   │   │   │   if os.path.exists(local_path):                                             │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/sagemaker/core/telemetry/telemetry_logging.py:180 in wrapper │
│ ❱ 180 │   │   │   │   return func(*args, **kwargs)                                               │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/sagemaker/ai_registry/air_hub.py:290 in download_from_s3     │
│   287 │   │   parsed = urlparse(s3_uri)                                                          │
│   288 │   │   bucket = parsed.netloc                                                             │
│   289 │   │   key = parsed.path.lstrip("/")                                                      │
│ ❱ 290 │   │   AIRHub._s3_client.download_file(bucket, key, local_path)                           │
│   291                                                                                            │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/botocore/context.py:123 in wrapper                           │
│ ❱ 123 │   │   │   │   return func(*args, **kwargs)                                               │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/boto3/s3/inject.py:223 in download_file                      │
│   222 │   with S3Transfer(self, Config) as transfer:                                             │
│ ❱ 223 │   │   return transfer.download_file(                                                     │
│   224 │   │   │   bucket=Bucket,                                                                 │
│   225 │   │   │   key=Key,                                                                       │
│   226 │   │   │   filename=Filename,                                                             │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/boto3/s3/transfer.py:484 in download_file                    │
│ ❱ 484 │   │   │   future.result()                                                                │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/s3transfer/futures.py:111 in result                          │
│ ❱ 111 │   │   │   return self._coordinator.result()                                              │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/s3transfer/futures.py:287 in result                          │
│   286 │   │   if self._exception:                                                                │
│ ❱ 287 │   │   │   raise self._exception                                                          │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/s3transfer/tasks.py:272 in _main                             │
│ ❱ 272 │   │   │   self._submit(transfer_future=transfer_future, **kwargs)                        │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/s3transfer/download.py:359 in _submit                        │
│   356 │   │   │   transfer_future.meta.size is None                                              │
│   357 │   │   │   or transfer_future.meta.etag is None                                           │
│   358 │   │   ):                                                                                 │
│ ❱ 359 │   │   │   response = client.head_object(                                                 │
│   360 │   │   │   │   Bucket=transfer_future.meta.call_args.bucket,                              │
│   361 │   │   │   │   Key=transfer_future.meta.call_args.key,                                    │
│   362 │   │   │   │   **transfer_future.meta.call_args.extra_args,                               │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/botocore/client.py:602 in _api_call                          │
│ ❱ 602 │   │   │   return self._make_api_call(operation_name, kwargs)                             │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/botocore/context.py:123 in wrapper                           │
│ ❱ 123 │   │   │   │   return func(*args, **kwargs)                                               │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/botocore/client.py:1078 in _make_api_call                    │
│   1075 │   │   │   │   'error_code_override'                                                     │
│   1076 │   │   │   ) or error_info.get("Code")                                                   │
│   1077 │   │   │   error_class = self.exceptions.from_code(error_code)                           │
│ ❱ 1078 │   │   │   raise error_class(parsed_response, operation_name)                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

System information

  • SageMaker Python SDK version: SageMaker 3.5.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): SFTTrainer
  • Framework version: N/A
  • Python version: 3.11
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
Affected file: v3-examples/model-customization-examples/sft_finetuning_example_notebook_pysdk_prod_v3.ipynb

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions