Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script#1374
Draft
cjac wants to merge 15 commits intoGoogleCloudDataproc:mainfrom
Draft
Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script#1374cjac wants to merge 15 commits intoGoogleCloudDataproc:mainfrom
cjac wants to merge 15 commits intoGoogleCloudDataproc:mainfrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GPU Initialization Action Enhancements for Secure Boot, Proxy, and Reliability
This large update significantly improves the
install_gpu_driver.shscript and its accompanying documentation, focusing on robust support for complex environments involving Secure Boot and HTTP/S proxies, and increasing overall reliability and maintainability.1.
gpu/README.md:GoogleCloudDataproc/custom-imagesrepository to create Dataproc images with NVIDIA drivers signed for Secure Boot. It covers environment setup, key management in GCP Secret Manager, Docker builder image creation, and running the image generation process.--shielded-secure-boot. It includes instructions for private network setups using Google Cloud Secure Web Proxy, leveraging scripts from theGoogleCloudDataproc/cloud-dataprocrepository for VPC, subnet, and proxy configuration.http-proxy,https-proxy,proxy-uri,no-proxy, andhttp-proxy-pem-uri.2.
gpu/install_gpu_driver.sh:set_proxy):http-proxy,https-proxy, andproxy-urimetadata, determining the correct proxy values for HTTP and HTTPS.HTTP_PROXY,HTTPS_PROXY, andNO_PROXYenvironment variables./etc/environmentwith the current proxy settings.gcloudproxy settings only if the gcloud SDK version is 547.0.0 or greater.aptanddnfto use the proxy.dirmngrorgnupg2-smimeis installed and configuresdirmngr.confto use the HTTP proxy.http-proxy-pem-uriinto system, Java, and Conda trust stores. Switches to HTTPS for proxy communications when a CA cert is provided.import_gpg_keys):import_gpg_keysto handle GPG key fetching and importing in a proxy-aware manner usingcurlover HTTPS, replacing directgpg --recv-keyscalls to keyservers.install_pytorch):numba,pytorch,tensorflow[and-cuda],rapids,pyspark, andcuda-version<=${CUDA_VERSION}. Explicit CUDA runtime (e.g.,cudart_spec) is no longer added, allowing the solver more flexibility.install_gpu_driver-mainandpytorchsentinels to allow forced refreshes.set_driver_version: Usescurl -Ifor a more lightweight HEAD request to check URL validity.build_driver_from_github: Caches the open kernel module source tarball from GitHub to GCS. Checks for existing signed and loadable modules to avoid unnecessary rebuilds.execute_github_driver_build: Refactored to accept tarball paths.popdremoved to balancepushdin caller. Removed a debug echo of thesign-fileexit code.make -j$(nproc)tomodules_installfor parallelization.modinfoforsigner:to confirm modules are signed.prepare_to_install: Movedcurl_retry_argsdefinition earlier.install_nvidia_gpu_driver: Checks ifnvidiamodule loads at the start and marks incomplete if not.main: Addedmark_complete install_gpu_driver-mainat the end.configure_dkms_certs: Always fetches keys from secret manager ifPSNis set to ensuremodulus_md5sumis available.install_gpu_agent: Checks ifMETADATA_HTTP_PROXY_PEM_URIis non-empty before using it.