Skip to content

Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script#1374

Draft
cjac wants to merge 15 commits intoGoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:gpu-202601
Draft

Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script#1374
cjac wants to merge 15 commits intoGoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:gpu-202601

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented Jan 23, 2026

GPU Initialization Action Enhancements for Secure Boot, Proxy, and Reliability

This large update significantly improves the install_gpu_driver.sh script and its accompanying documentation, focusing on robust support for complex environments involving Secure Boot and HTTP/S proxies, and increasing overall reliability and maintainability.

1. gpu/README.md:

  • Comprehensive Documentation for Secure Boot & Proxy:
    • Added a major section: "Building Custom Images with Secure Boot and Proxy Support". This details the end-to-end process using the GoogleCloudDataproc/custom-images repository to create Dataproc images with NVIDIA drivers signed for Secure Boot. It covers environment setup, key management in GCP Secret Manager, Docker builder image creation, and running the image generation process.
    • Added a major section: "Launching a Cluster with the Secure Boot Custom Image". This explains how to use the custom-built images to launch Dataproc clusters with --shielded-secure-boot. It includes instructions for private network setups using Google Cloud Secure Web Proxy, leveraging scripts from the GoogleCloudDataproc/cloud-dataproc repository for VPC, subnet, and proxy configuration.
    • Includes essential verification steps for checking driver status, module signatures, and system logs on the cluster nodes.
  • Enhanced Proxy Metadata: Clarified and expanded descriptions for proxy-related metadata: http-proxy, https-proxy, proxy-uri, no-proxy, and http-proxy-pem-uri.
  • New Section: "Enhanced Proxy Support": Explicitly outlines the script's capabilities in proxied environments, including custom CA certificate handling, automatic tool configuration (curl, apt, dnf, gpg, Java), and bypass mechanisms.
  • Troubleshooting: Added specific points for debugging network and proxy issues.

2. gpu/install_gpu_driver.sh:

  • Robust Proxy Handling (set_proxy):
    • Completely revamped to handle http-proxy, https-proxy, and proxy-uri metadata, determining the correct proxy values for HTTP and HTTPS.
    • Dynamically sets HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variables.
    • Updates /etc/environment with the current proxy settings.
    • Conditionally configures gcloud proxy settings only if the gcloud SDK version is 547.0.0 or greater.
    • Performs TCP and HTTP(S) connection tests to the proxy to validate setup.
    • Configures apt and dnf to use the proxy.
    • Ensures dirmngr or gnupg2-smime is installed and configures dirmngr.conf to use the HTTP proxy.
    • Installs custom proxy CA certificates from http-proxy-pem-uri into system, Java, and Conda trust stores. Switches to HTTPS for proxy communications when a CA cert is provided.
    • Includes comprehensive verification steps for the proxy and certificate setup.
  • Reliable GPG Key Importing (import_gpg_keys):
    • Introduced a new function import_gpg_keys to handle GPG key fetching and importing in a proxy-aware manner using curl over HTTPS, replacing direct gpg --recv-keys calls to keyservers.
    • This function supports fetching keys by URL or Key ID and is used throughout the script for repository setup (NVIDIA Container Toolkit, CUDA, Bigtop, Adoptium, Docker, Google Cloud, CRAN-R, MySQL).
  • Conda/Mamba Environment (install_pytorch):
    • Refined package list: numba, pytorch, tensorflow[and-cuda], rapids, pyspark, and cuda-version<=${CUDA_VERSION}. Explicit CUDA runtime (e.g., cudart_spec) is no longer added, allowing the solver more flexibility.
    • Uses Mamba preferentially, with a Conda fallback.
    • Implements cache/environment clearing logic based on install_gpu_driver-main and pytorch sentinels to allow forced refreshes.
    • Improved error handling for environment creation, with specific messages for Mamba failures in proxied environments.
  • NVIDIA Driver Handling:
    • set_driver_version: Uses curl -I for a more lightweight HEAD request to check URL validity.
    • build_driver_from_github: Caches the open kernel module source tarball from GitHub to GCS. Checks for existing signed and loadable modules to avoid unnecessary rebuilds.
    • execute_github_driver_build: Refactored to accept tarball paths. popd removed to balance pushd in caller. Removed a debug echo of the sign-file exit code.
    • Added make -j$(nproc) to modules_install for parallelization.
    • Post-install verification loop checks modinfo for signer: to confirm modules are signed.
  • Lifecycle Improvements:
    • prepare_to_install: Moved curl_retry_args definition earlier.
    • install_nvidia_gpu_driver: Checks if nvidia module loads at the start and marks incomplete if not.
    • main: Added mark_complete install_gpu_driver-main at the end.
    • configure_dkms_certs: Always fetches keys from secret manager if PSN is set to ensure modulus_md5sum is available.
    • install_gpu_agent: Checks if METADATA_HTTP_PROXY_PEM_URI is non-empty before using it.
  • Secure Boot Check: Issues a warning instead of exiting for Secure Boot on Debian < 2.2.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants