Feat: Enhance GPU Tests, Docs & Private Cluster Setup#181
Feat: Enhance GPU Tests, Docs & Private Cluster Setup#181cjac wants to merge 9 commits intoGoogleCloudDataproc:mainfrom
Conversation
Summary of ChangesHello @cjac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the Dataproc environment reproduction scripts by introducing a robust framework for GPU testing, enhancing the clarity and completeness of the documentation, and improving the reliability and idempotency of private cluster setups utilizing Secure Web Proxy. These changes aim to streamline development, testing, and deployment workflows for GPU-accelerated and secure Dataproc environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the Dataproc reproduction scripts by adding comprehensive GPU testing, overhauling the documentation for clarity, and improving the setup for private clusters with Secure Web Proxy. The changes are well-structured, particularly the move to versioned resources for SWP components, which improves idempotency. My review focuses on improving script robustness, maintainability, and clarity. Key suggestions include refactoring duplicated logic, ensuring IAM permissions are applied robustly, and cleaning up test scripts and documentation examples for better readability and correctness. Overall, this is a solid set of improvements.
| --master-accelerator "type=${M_ACCELERATOR_TYPE}" \ | ||
| --worker-accelerator "type=${PRIMARY_ACCELERATOR_TYPE}" \ | ||
| --secondary-worker-accelerator "type=${SECONDARY_ACCELERATOR_TYPE}" \ | ||
| --master-machine-type "${MASTER_MACHINE_TYPE}" \ | ||
| --master-machine-type "${M_MACHINE_TYPE}" \ |
There was a problem hiding this comment.
While these variable renames improve consistency, this file (shared-functions.sh) appears to contain a significant amount of logic that is either duplicated or superseded by the functions in the more specific lib/ subdirectories (e.g., lib/dataproc/cluster.sh). For instance, it defines its own create_dpgce_cluster. To improve maintainability and reduce confusion, this entire file should be evaluated for refactoring or deprecation, with its functionality fully migrated to the newer, more modular scripts.
gcloud/README.md
Outdated
| # After building the image using scripts in custom-images/examples/secure-boot/ | ||
|
|
||
| # In lib/dataproc/cluster.sh | ||
| CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot" |
There was a problem hiding this comment.
The example for CLUSTER_IMAGE contains a hardcoded project ID (cjac-2025-01). For better consistency and usability, it would be clearer to use a placeholder like YOUR_PROJECT, similar to the example on line 110.
| CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot" | |
| CLUSTER_IMAGE="projects/YOUR_PROJECT/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot" |
| if [[ $(perl -e "print 1 if q{$1} =~ /^\d+$/") == "1" ]]; then | ||
| master_num="$1" | ||
| echo "master_num: $master_num" | ||
| MASTER_HOSTNAME="${MASTER_HOSTNAME}-${master_num}" | ||
| m_num="$1" | ||
| echo "m_num: $m_num" | ||
| M_HOSTNAME="${M_HOSTNAME}-${m_num}" | ||
| shift 1 | ||
| else | ||
| MASTER_HOSTNAME="${MASTER_HOSTNAME}" | ||
| M_HOSTNAME="${M_HOSTNAME}" | ||
| fi |
There was a problem hiding this comment.
The logic to detect the node index can be improved for readability and robustness. The perl command can be replaced with a more idiomatic and native bash regular expression. Additionally, the else block is redundant and can be removed. While the ambiguity of a numeric filename vs. an index remains, these changes make the script cleaner.
| if [[ $(perl -e "print 1 if q{$1} =~ /^\d+$/") == "1" ]]; then | |
| master_num="$1" | |
| echo "master_num: $master_num" | |
| MASTER_HOSTNAME="${MASTER_HOSTNAME}-${master_num}" | |
| m_num="$1" | |
| echo "m_num: $m_num" | |
| M_HOSTNAME="${M_HOSTNAME}-${m_num}" | |
| shift 1 | |
| else | |
| MASTER_HOSTNAME="${MASTER_HOSTNAME}" | |
| M_HOSTNAME="${M_HOSTNAME}" | |
| fi | |
| if [[ "$1" =~ ^[0-9]+$ ]]; then | |
| m_num="$1" | |
| echo "m_num: $m_num" | |
| M_HOSTNAME="${M_HOSTNAME}-${m_num}" | |
| shift 1 | |
| fi |
There was a problem hiding this comment.
there is no condition for setting the M_HOSTNAME value when the regex doesn't match.
gcloud/lib/dataproc/cluster.sh
Outdated
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20251108-180659-tf" | ||
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-002929-tf" | ||
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-213919-secure-boot" | ||
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-213919-tf" | ||
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-3-deb12-20260127-175333-secure-boot" | ||
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-3-deb12-20260127-175333-tf" |
There was a problem hiding this comment.
This section contains a large number of commented-out image URIs. This makes the active configuration difficult to parse and increases maintenance overhead. To improve clarity, consider moving these examples to a separate documentation file (like docs/dataproc_cluster_examples.md) or removing them if they are for temporary development purposes.
There was a problem hiding this comment.
Let's mark these as to be removed before release. Until then, they are helping me to recall which images function properly.
| is_cuda_gpu_available = tf.test.is_gpu_available(cuda_only=True) | ||
| print("is_cuda_gpu_available : " + str(is_cuda_gpu_available)) | ||
|
|
||
| #is_cuda_gpu_min_3 = tf.config.list_physical_devices('GPU',True, (3,0)) | ||
| is_cuda_gpu_min_3 = tf.test.is_gpu_available(True, (3,0)) | ||
| print("is_cuda_gpu_min_3 : " + str(is_cuda_gpu_min_3)) |
There was a problem hiding this comment.
There was a problem hiding this comment.
let's see if we can verify that both the deprecated code and the code that replaces it work. That seems like a reasonable way to do this for now, right? The next release we can change the default provider and allow the use of the old provider with an argument or something, but eventually the plan is to remove that code from the codebase because it has been fully replaced in all of its use cases with another chunk of code.
| echo gcloud dataproc jobs submit pyspark \ | ||
| --properties="spark:spark.executor.resource.gpu.amount=1" \ | ||
| --properties="spark:spark.task.resource.gpu.amount=1" \ | ||
| --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \ | ||
| --cluster=${CLUSTER_NAME} \ | ||
| --region ${REGION} gs://${BUCKET}/test.py | ||
|
|
||
| get_gpu_resources_script="/usr/lib/spark/scripts/gpu/getGpusResources.sh" | ||
| echo gcloud dataproc jobs submit spark \ | ||
| --project "${PROJECT_ID}" \ | ||
| --cluster="${CLUSTER_NAME}" \ | ||
| --region "${REGION}" \ | ||
| --jars "file:///usr/lib/spark/examples/jars/spark-examples.jar" \ | ||
| --class "org.apache.spark.examples.ml.JavaIndexToStringExample" \ | ||
| --properties \ | ||
| "spark.driver.resource.gpu.amount=1,"\ | ||
| "spark.driver.resource.gpu.discoveryScript=${get_gpu_resources_script},"\ | ||
| "spark.executor.resource.gpu.amount=1,"\ | ||
| "spark.executor.resource.gpu.discoveryScript=${get_gpu_resources_script}" |
There was a problem hiding this comment.
These gcloud dataproc jobs submit commands are only printed to the console via echo, not executed. This can be misleading for anyone running the script. If they are intended to be examples, they should be commented out with an explanation. If they are meant to be executed, the echo should be removed.
# Example of submitting a PySpark job with GPU resources:
# gcloud dataproc jobs submit pyspark \
# --properties="spark:spark.executor.resource.gpu.amount=1" \
# --properties="spark:spark.task.resource.gpu.amount=1" \
# --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
# --cluster=${CLUSTER_NAME} \
# --region ${REGION} gs://${BUCKET}/test.py
get_gpu_resources_script="/usr/lib/spark/scripts/gpu/getGpusResources.sh"
# Example of submitting a Spark job with GPU resources for driver and executor:
# gcloud dataproc jobs submit spark \
# --project "${PROJECT_ID}" \
# --cluster="${CLUSTER_NAME}" \
# --region "${REGION}" \
# --jars "file:///usr/lib/spark/examples/jars/spark-examples.jar" \
# --class "org.apache.spark.examples.ml.JavaIndexToStringExample" \
# --properties \
# "spark.driver.resource.gpu.amount=1,"\
# "spark.driver.resource.gpu.discoveryScript=${get_gpu_resources_script},"\
# "spark.executor.resource.gpu.amount=1,"\
# "spark.executor.resource.gpu.discoveryScript=${get_gpu_resources_script}"There was a problem hiding this comment.
yeah, I haven't verified whether they work or not. I'm busy and the tests that I am running are passing. I don't know if I want to add new ones right this VERY second.
7887d15 to
7a5a629
Compare
3a92f5a to
696dd13
Compare
9cfe963 to
73be281
Compare
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
85d00d7 to
059ff92
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces significant changes to the Dataproc on GCE (DPGCE) environment setup scripts, focusing on state management, declarative creation, and idempotency. The changes include updating the README, adding a new audit script (audit-dpgce), modifying the create script (create-dpgce) to use a declarative model, and updating various library functions. The reviewer identified several potential issues, including a command injection vulnerability in script-utils.sh, the lack of explicit resource deletion in destroy-dpgce, missing error handling in gce-proxy-setup.sh, and the need for more robust state checks in dataproc/cluster.sh. The reviewer also suggested adding checks for metadata length and directory existence in dataproc/cluster.sh and audit-dpgce, respectively, and ensuring that the ssh command is not executed with empty command arguments in ssh-m.
gcloud/lib/script-utils.sh
Outdated
|
|
||
| # The command_to_run should be a gcloud command with --format=json | ||
| # that returns a JSON object if the resource exists and fails otherwise. | ||
| json_output=$(eval "${command_to_run}" 2>/dev/null) |
There was a problem hiding this comment.
gcloud/bin/destroy-dpgce
Outdated
|
|
||
| ROUTES_RAW=$(get_state "routes") | ||
| if [[ "${ROUTES_RAW}" != "null" && "${ROUTES_RAW}" != "[]" ]]; then | ||
| mapfile -t route_names < <(echo "${ROUTES_RAW}" | jq -r '.[].name') |
gcloud/init/gce-proxy-setup.sh
Outdated
| if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then | ||
| echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}." |
There was a problem hiding this comment.
The nc command lacks error handling. If the connection fails, the script should exit to prevent further execution with a broken proxy setup.
| if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then | |
| echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}." | |
| if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then | |
| echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}." | |
| exit 1 | |
| else |
gcloud/bin/create-dpgce
Outdated
| # TODO: Check if NAT is already on the router | ||
| echo "INFO: NAT Egress enabled, ensuring NAT on existing router..." | ||
| add_nat_to_router |
gcloud/lib/dataproc/cluster.sh
Outdated
| export -f delete_dpgce_cluster | ||
|
|
||
| function exists_dataproc_cluster_vms() { | ||
| _check_exists "gcloud compute instances list --project='${PROJECT_ID}' --filter='labels.goog-dataproc-cluster-name=${CLUSTER_NAME}' --format='json(name,zone,status)'" | jq 'if . == [] then null else . end' |
There was a problem hiding this comment.
| fi | ||
| print_status " Attempting to delete CA Pool ${short_pool_name}..." | ||
| if ! run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources; then overall_status="Fail"; report_result "Fail"; else report_result "Deleted"; fi | ||
| run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources || true |
There was a problem hiding this comment.
The script attempts to delete the CA pool, but doesn't check for errors. This could lead to a situation where the script continues even if the CA pool deletion fails.
| run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources || true | |
| if ! run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources; then overall_status="Fail"; report_result "Fail"; else report_result "Deleted"; fi |
gcloud/lib/swp/policy.sh
Outdated
| run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \ | ||
| --gateway-security-policy="${policy_name}" \ | ||
| --location="${region}" \ | ||
| --project="${project_id}" \ | ||
| --quiet || true |
There was a problem hiding this comment.
The script deletes the rule and policy, but doesn't check for errors after each operation. This could lead to a situation where the script continues even if the rule or policy deletion fails.
| run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \ | |
| --gateway-security-policy="${policy_name}" \ | |
| --location="${region}" \ | |
| --project="${project_id}" \ | |
| --quiet || true | |
| run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \ | |
| --gateway-security-policy="${policy_name}" \ | |
| --location="${region}" \ | |
| --project="${project_id}" \ | |
| --quiet || true | |
| if run_gcloud "${policy_log}" gcloud network-security gateway-security-policies delete "${policy_name}" \ |
gcloud/lib/swp/policy.sh
Outdated
| run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \ | ||
| --gateway-security-policy="${policy_name}" \ | ||
| --location="${region}" \ | ||
| --project="${project_id}" \ | ||
| --quiet || true |
gcloud/bin/audit-dpgce
Outdated
|
|
||
| ( | ||
| source "${GCLOUD_DIR}/${source_file}" | ||
| "${func_name}" "$@" > "${cmd_out}" 2>/dev/null |
gcloud/lib/script-utils.sh
Outdated
|
|
||
| # The command_to_run should be a gcloud command with --format=json | ||
| # that returns a JSON object if the resource exists and fails otherwise. | ||
| json_output=$(eval "${command_to_run}" 2>/dev/null) |
There was a problem hiding this comment.
Dataproc Lifecycle Script Overhaul: From Brittle to DeclarativeThis series of changes represents a fundamental refactoring of the Initial State & Problems:
The Evolution:
Current State: The scripts are now much more robust, predictable, and easier to use. The declarative approach in Key Improvements for Review:
This narrative provides context for the significant architectural changes and the iterative debugging and refinement process that led to the current state of the scripts. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a major refactoring of the Dataproc environment scripts, moving towards a declarative, state-driven model using an SQLite database. It also adds significant new functionality, including GPU testing scripts, enhanced documentation, and improved setup for private clusters with Secure Web Proxy. The changes are extensive and generally well-executed, greatly improving the structure and robustness of the scripts. However, I've identified several critical and high-severity issues, primarily related to new or refactored scripts calling non-existent functions or files, broken CI configurations, and bugs in test scripts. Addressing these issues is crucial for the stability and correctness of the new implementation.
| @@ -0,0 +1 @@ | |||
| #!/bin/bashsource lib/env.shgcloud network-services gateways describe ${SWP_INSTANCE_NAME} --location ${REGION}gcloud network-security gateway-security-policies rules export allow-all-rule \ --gateway-security-policy=${SWP_POLICY_NAME} \ --location ${REGION}#gcloud logging read \# "resource.type=\"networkservices.googleapis.com/Gateway\" \# resource.labels.location=\"${REGION}\" \# resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \# --project="${PROJECT_ID}" \# --limit=100 \# --format=json # Or text, defaultgcloud logging read \ "resource.type=\"networkservices.googleapis.com/Gateway\" \ resource.labels.location=\"${REGION}\" \ resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \ --project="${PROJECT_ID}" \ --limit=500 \ --freshness=1d # Look back over the last day# Delete the old test if it existsgcloud network-management connectivity-tests delete swp-test --quiet# Create a new onegcloud network-management connectivity-tests create swp-test \ --source-instance=projects/${PROJECT_ID}/zones/${ZONE}/instances/${CLUSTER_NAME}-m \ --destination-ip-address=10.43.79.245 \ --destination-port=3128 \ --protocol=TCP \ --source-network=projects/${PROJECT_ID}/global/networks/${NETWORK}# Wait and describesleep 5# After a moment, get the results:gcloud network-management connectivity-tests describe swp-test No newline at end of file | |||
There was a problem hiding this comment.
This file appears to be a scratchpad of commands rather than a valid script. It's missing a shebang, and all commands are concatenated on a single line. This will fail to execute correctly. Please format this as a proper bash script with a shebang and each command on a new line.
#!/bin/bash
source lib/env.sh
gcloud network-services gateways describe ${SWP_INSTANCE_NAME} --location ${REGION}
gcloud network-security gateway-security-policies rules export allow-all-rule \
--gateway-security-policy=${SWP_POLICY_NAME} \
--location ${REGION}
#gcloud logging read \
# "resource.type=\"networkservices.googleapis.com/Gateway\" \
# resource.labels.location=\"${REGION}\" \
# resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \
# --project="${PROJECT_ID}" \
# --limit=100 \
# --format=json # Or text, default
gcloud logging read \
"resource.type=\"networkservices.googleapis.com/Gateway\" \
resource.labels.location=\"${REGION}\" \
resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \
--project="${PROJECT_ID}" \
--limit=500 \
--freshness=1d # Look back over the last day
# Delete the old test if it exists
gcloud network-management connectivity-tests delete swp-test --quiet
# Create a new one
gcloud network-management connectivity-tests create swp-test \
--source-instance=projects/${PROJECT_ID}/zones/${ZONE}/instances/${CLUSTER_NAME}-m \
--destination-ip-address=10.43.79.245 \
--destination-port=3128 \
--protocol=TCP \
--source-network=projects/${PROJECT_ID}/global/networks/${NETWORK}
# Wait and describe
sleep 5
# After a moment, get the results:
gcloud network-management connectivity-tests describe swp-test
|
|
||
| echo "DEBUG: set_proxy: Testing external site access via proxy..." | ||
| local test_url="https://www.google.com" | ||
| if curl -vL ${curl_retry_args} -o /dev/null "${test_url}"; then |
There was a problem hiding this comment.
| - 'PROJECT_ID=cjac-2025-01' # Use CI_PROJECT_ID here | ||
| - 'CLOUDSDK_CORE_DISABLE_PROMPTS=1' | ||
| substitutions: | ||
| _CI_PROJECT_ID: "cjac-2025-01" |
There was a problem hiding this comment.
The PROJECT_ID and _CI_PROJECT_ID are hardcoded in this file. This makes the Cloud Build configuration specific to the cjac-2025-01 project and not reusable. You should use Cloud Build substitutions (e.g., $_CI_PROJECT_ID or $PROJECT_ID) to make this configuration generic, as was done in the setup-cicd.sh script.
- 'PROJECT_ID=${_CI_PROJECT_ID}' # Use CI_PROJECT_ID here
- 'CLOUDSDK_CORE_DISABLE_PROMPTS=1'
substitutions:
_CI_PROJECT_ID: "${PROJECT_ID}"
| - name: 'gcr.io/cloud-builders/gcloud' | ||
| id: 'Test Standard Create' | ||
| entrypoint: 'bash' | ||
| args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce-create'] |
There was a problem hiding this comment.
The generated cloudbuild.yaml file attempts to call audit-dpgce-create, which has been removed in this pull request as part of the refactoring to a single audit-dpgce script. This will cause the CI build to fail. The audit step should be updated to use the new universal audit script, bin/audit-dpgce. This issue is present for all test steps in the generated file.
| args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce-create'] | |
| args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce'] | |
| - name: 'gcr.io/cloud-builders/gcloud' | ||
| id: 'Test Standard Create' | ||
| entrypoint: 'bash' | ||
| args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce-create'] |
There was a problem hiding this comment.
This Cloud Build step calls audit-dpgce-create, but this script has been removed in this PR. The build will fail. You should update this to use the new universal bin/audit-dpgce script. This applies to all other audit steps in this file as well.
args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce']
| if echo "${PROJECT_ID_RAW}" | /usr/bin/jq empty > /dev/null 2>&1; then | ||
| PROJECT_ID_VAL=$(echo "${PROJECT_ID_RAW}" | /usr/bin/jq -r '.projectId // "N/A"') |
There was a problem hiding this comment.
The path to jq is hardcoded as /usr/bin/jq. This is not portable and may fail on systems where jq is installed in a different location (e.g., /usr/local/bin). It's better to rely on the PATH environment variable and just call jq directly, as it's listed as a prerequisite in the README.
| if echo "${PROJECT_ID_RAW}" | /usr/bin/jq empty > /dev/null 2>&1; then | |
| PROJECT_ID_VAL=$(echo "${PROJECT_ID_RAW}" | /usr/bin/jq -r '.projectId // "N/A"') | |
| if echo "${PROJECT_ID_RAW}" | jq empty > /dev/null 2>&1; then | |
| PROJECT_ID_VAL=$(echo "${PROJECT_ID_RAW}" | jq -r '.projectId // "N/A"') | |
gcloud/.gitignore
Outdated
| env.cpan.json | ||
|
|
||
| # Temporary files and logs | ||
| /tmp/ |
There was a problem hiding this comment.
The entry /tmp/ will only ignore a tmp directory at the root of the gcloud directory. The scripts, however, seem to use the absolute path /tmp/dataproc-repro/... for logs and temporary files, which is outside the repository and cannot be ignored by .gitignore. If the intention is to ignore a local tmp directory for other purposes, consider changing this to tmp/ to avoid ambiguity with the system's root /tmp directory.
| @@ -0,0 +1,188 @@ | |||
| #!/bin/bash | |||
| function exists_firewall() { | ||
| # This is a basic check. A more robust version might check for a list of rules. | ||
| _check_exists "gcloud compute firewall-rules describe '${FIREWALL}-in-ssh' --project='${PROJECT_ID}' --format='json(name,selfLink)'" | ||
| } |
There was a problem hiding this comment.
| export -f create_managed_certificate | ||
|
|
||
| function _delete_swp_ca_resources() { | ||
| function delete_managed_certificate() { |
There was a problem hiding this comment.
The refactored delete_managed_certificate function appears to have lost the conditional logic based on the --force flag that existed in the previous implementation. The old behavior was to preserve versioned CA components unless --force was specified. Now, it seems all versioned components are deleted unconditionally. If preserving these resources by default is still desired, the --force check should be reintroduced.
656ea83 to
059ff92
Compare
- Replaced separate create/destroy/audit scripts for standard and private clusters with unified scripts in bin/ (e.g., create-dpgce, destroy-dpgce, audit-dpgce). - Moved common functions to lib/script-utils.sh. - Updated lib/env.sh with new variables and logic. - Added new utility scripts like connectivity-test and debug-init-action.
- Added cluster creation logic for custom and private cluster types in lib/dataproc/. - Updated various library functions in lib/ to support the new unified scripts. - Added new scripts for CI/CD setup (bin/setup-cicd.sh) and GitHub triggers (bin/create-github-trigger.sh).
- Updated README.md to reflect the new script organization. - Enhanced .gitignore. - Added env.json and updated env.json.sample. - Added cloudbuild.yaml for CI/CD. - Updated init/gce-proxy-setup.sh. - Added test scripts in t/. - Added project findings and work narratives.
Implements a robust audit system for the DPGCE environment: - Replaces the state.json file with an SQLite database (state.db) for atomic and concurrent-safe state management. - Refactors `bin/audit-dpgce` to perform checks for all resources concurrently, writing results to temporary files. - Updates the SQLite database sequentially from temp files to prevent locking issues. - Adds `exists_*` functions to numerous lib scripts to support the audit. - Ensures all necessary functions are exported from their respective lib files. - Introduces `report_audit_status` for clearer [Exists]/[Not Found] reporting. - Updates README.md with `sqlite3` prerequisite. - Fixes various bugs in the audit script and helper functions. This provides a much faster, more reliable, and comprehensive audit of the environment.
This commit introduces a declarative, plan-based approach to the `create-dpgce` script and significantly improves the robustness and completeness of the `destroy-dpgce` script.
**`create-dpgce` Refactoring:**
* Replaced the initial audit with loading state from the SQLite cache (`init_state_db`).
* Added `display_status` function to show the current state of resources.
* Implemented plan generation by comparing the current state with the intended state based on flags.
* Added a plan execution block that creates resources in the correct order based on the generated plan.
* Removed redundant `check_project` and `check_billing` calls.
* Ensured all state checks use `get_state`.
**`destroy-dpgce` Enhancements:**
* Integrated `get_state` for all resource existence checks, replacing old `jq` logic.
* Added `delete_all_network_vms` to ensure all VMs in the network are deleted before network teardown.
* Included more library scripts to cover a wider range of resources for deletion (PHS, GKE, DBs, SWP, etc.).
* Refined route deletion to skip default internet and local routes.
* Added cache updates (`update_state "resource" "null"`) after successful deletions for many resources.
**Library Updates (`lib/`):**
* Added `refresh_resource_state` or `update_state` calls to CUD functions in:
* `dataproc/autoscaling.sh`
* `dataproc/cluster.sh`
* `dataproc/private-cluster.sh`
* `gcp/gcr.sh`
* `gcp/gcs.sh`
* `gcp/iam.sh`
* `gcp/kms.sh`
* `network/network.sh`
* `network/peering.sh`
* `network/router.sh`
* `network/routes.sh`
* `network/subnet.sh`
* Added `exists_dataproc_cluster_vms` to `lib/dataproc/cluster.sh`.
* Fixed `create_subnet` to accept parameters for name, key, and range.
* Corrected line continuation in `create_vpc_network` in `lib/network/network.sh`.
**Other Changes:**
* Fixed `recreate-cluster.sh` to correctly call `gcloud dataproc clusters describe` for displaying cluster details.
* Updated `env.json` to use `us-west4-a`.
* Cleaned up `lib/gcp/misc.sh`.
These changes make the scripts more reliable, idempotent, and easier to understand.
- Ensure all calls to _check_exists pass command and arguments as separate array elements, not a single string. This fixes the audit not detecting some existing resources. - Refactor refresh_resource_state to handle array arguments. - Add exit 1 to init/gce-proxy-setup.sh if proxy connection test fails. - Improve stderr logging in bin/audit-dpgce background checks. - Fail script in lib/swp/policy.sh if security rule deletion fails. - Add clarifying comments in bin/destroy-dpgce. - Fix relative paths in lib/env.sh and lib/dataproc/private-cluster.sh. - Run audit at the start of create-dpgce. - Add NAT pre-delete back to lib/network/router.sh.
059ff92 to
729999a
Compare
- Switched to client-side jq filtering for default route checks in lib/network/routes.sh due to gcloud filter unreliability. - Made create-dpgce idempotent for network infrastructure, ensuring components like NAT and default routes are created if missing, even if the Dataproc cluster already exists. - create-dpgce no longer errors out if a cluster exists with a different configuration; it only creates the cluster if missing. - Added Cloud NAT checks to the audit-dpgce script and report. - Corrected GCLOUD_DIR path resolution in lib/env.sh to be self-contained and robust. - Fixed bin/ssh-m to correctly set GCLOUD_DIR. - Simplified SCRIPT_DIR assignment in all bin/ scripts. - Added gcloud/.gitignore to exclude env.json. - Removed env.json from git history using filter-branch. - Rebased onto upstream/main.
729999a to
8ce83a7
Compare
- Reworked `bin/destroy-dpgce` to run an audit first, then generate a teardown plan based on discovered resources. - Added a context-aware status report to `destroy-dpgce` to display the state of resources found by the audit, respecting intent flags like --nat-egress, --custom, etc. - Updated `destroy-dpgce` to accept the same intent flags as `create-dpgce` to tailor the status report. - Made plan generation for network VMs and custom routes conditional on their detection in the audit. - Added `get_any_network_vms_state` to `lib/gcp/misc.sh` and an `anyNetworkVms` check to `bin/audit-dpgce` to support more accurate teardown planning for any VM in the network. - Ensured `destroy-dpgce` uses a unique TIMESTAMP for each run to isolate state. - Updated `README.md` to reflect these significant changes.
- Re-instate unique TIMESTAMP generation per run for create-dpgce
and destroy-dpgce to ensure state isolation.
- Corrected SQLite schema usage in script-utils.sh to use the
'state' table and 'value' column consistently.
- Refactored Service Account management in lib/gcp/iam.sh:
- Added auditing for SA IAM bindings (serviceAccountBindings)
and readiness (serviceAccountRolesReady).
- create_service_account now verifies all required roles are bound.
- delete_service_account now properly removes role bindings
before deleting the SA.
- Fixed gcloud command execution within run_gcloud by using arrays
to handle multi-line arguments correctly.
- Enhanced run_gcloud to suppress output for expected errors like
IAM binding not found.
- Optimized create_subnet in lib/network/subnet.sh to avoid
refetching details after creation.
- Removed various debug messages.
|
/gemini review |
This set of changes introduces significant enhancements to the Dataproc environment reproduction scripts, focusing on documentation, private cluster setups with Secure Web Proxy (SWP), GPU testing, and code clarity.
Key Changes:
README.md Overhaul:
env.jsonvariables.custom-imagesrepository.create-*,destroy-*, andrecreate-*scripts.New Test Scripts (
t/directory):t/scripts/verify_tensorflow.py: A Python script to check TensorFlow GPU detection.t/scripts/verify_torch.py: A Python script to check PyTorch CUDA availability and GPU details.t/spark-gpu-test.sh: A new test runner script that:t/scriptsdirectory to the cluster's -m node usingbin/scp-m.Script Modifications:
bin/destroy-dpgce-private: Adjusted the order of resource deletion for the SWP teardown.bin/recreate-dpgce: Enabled thegsutil cpcommand to ensure initialization actions are always updated from the localinit/directory during cluster recreation.bin/scp-m: Updated usage text to reflect its purpose of copying to the-mnode.bin/ssh-m: Renamed internal variables for clarity (e.g.,M_HOSTNAME).lib/dataproc/cluster.sh: Renamed environment variables (MASTER_MACHINE_TYPEtoM_MACHINE_TYPE,MASTER_ACCELERATOR_TYPEtoM_ACCELERATOR_TYPE) for consistency. Includes many commented-out options, indicating active development and testing of various cluster configurations (proxy, image versions, secure boot).lib/env.sh: Updated to use the newM_prefixed variables and added many new variables to support the private cluster and SWP configurations.lib/shared-functions.sh: Adapted to the variable renames inenv.sh.lib/swp/certs.sh: Enhanced to support versioned certificates and sentinels, making SWP component management more robust and idempotent per run.lib/swp/policy.sh: Added export forSWP_POLICY_URI_PARTIAL.In essence, this change adds comprehensive GPU validation, improves documentation, refines the private cluster setup, and makes various scripts more robust.