Feat: Enhance GPU Tests, Docs & Private Cluster Setup by cjac · Pull Request #181 · GoogleCloudDataproc/cloud-dataproc

cjac · 2026-01-27T22:55:14Z

This set of changes introduces significant enhancements to the Dataproc environment reproduction scripts, focusing on documentation, private cluster setups with Secure Web Proxy (SWP), GPU testing, and code clarity.

Key Changes:

README.md Overhaul:
- The README has been substantially restructured and expanded to improve clarity and usability.
- A "Prerequisites" section was added.
- The "Setup" section now provides more detail on env.json variables.
- The "Scripts" section is now better organized into "Environment Management," "Audit Scripts," and "Utility Scripts," with more descriptive entries for each script, including the new test scripts.
- The "Using Custom Images" section was added to guide users on building specialized images, particularly for Secure Boot, pointing to the custom-images repository.
- A "Script Details" section provides deeper explanations of the create-*, destroy-*, and recreate-* scripts.
- Language has been updated to use "-m node" instead of potentially problematic terms.
New Test Scripts (t/ directory):
- t/scripts/verify_tensorflow.py: A Python script to check TensorFlow GPU detection.
- t/scripts/verify_torch.py: A Python script to check PyTorch CUDA availability and GPU details.
- t/spark-gpu-test.sh: A new test runner script that:
  - Copies the t/scripts directory to the cluster's -m node using bin/scp-m.
  - Executes the Python GPU verification scripts on the -m node via SSH.
  - Runs a series of Spark jobs (SparkPi and JavaIndexToStringExample) with various GPU-related properties to test GPU allocation and functionality within YARN.
Script Modifications:
- bin/destroy-dpgce-private: Adjusted the order of resource deletion for the SWP teardown.
- bin/recreate-dpgce: Enabled the gsutil cp command to ensure initialization actions are always updated from the local init/ directory during cluster recreation.
- bin/scp-m: Updated usage text to reflect its purpose of copying to the -m node.
- bin/ssh-m: Renamed internal variables for clarity (e.g., M_HOSTNAME).
- lib/dataproc/cluster.sh: Renamed environment variables (MASTER_MACHINE_TYPE to M_MACHINE_TYPE, MASTER_ACCELERATOR_TYPE to M_ACCELERATOR_TYPE) for consistency. Includes many commented-out options, indicating active development and testing of various cluster configurations (proxy, image versions, secure boot).
- lib/env.sh: Updated to use the new M_ prefixed variables and added many new variables to support the private cluster and SWP configurations.
- lib/shared-functions.sh: Adapted to the variable renames in env.sh.
- lib/swp/certs.sh: Enhanced to support versioned certificates and sentinels, making SWP component management more robust and idempotent per run.
- lib/swp/policy.sh: Added export for SWP_POLICY_URI_PARTIAL.

In essence, this change adds comprehensive GPU validation, improves documentation, refines the private cluster setup, and makes various scripts more robust.

gemini-code-assist · 2026-01-27T22:55:45Z

Summary of Changes

Hello @cjac, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the Dataproc environment reproduction scripts by introducing a robust framework for GPU testing, enhancing the clarity and completeness of the documentation, and improving the reliability and idempotency of private cluster setups utilizing Secure Web Proxy. These changes aim to streamline development, testing, and deployment workflows for GPU-accelerated and secure Dataproc environments.

Highlights

Enhanced GPU Testing Framework: Introduced new Python scripts (verify_tensorflow.py, verify_torch.py) to check GPU detection and CUDA availability, along with a comprehensive spark-gpu-test.sh script to run Spark jobs with various GPU configurations on YARN.
Improved Documentation: The README.md has been significantly expanded and reorganized, adding sections for prerequisites, detailed env.json variables, structured script categories, guidance on using custom images (especially for Secure Boot), and in-depth explanations of core create-*, destroy-*, and recreate-* scripts.
Robust Private Cluster Setup (SWP): Modifications to lib/swp/certs.sh introduce versioned certificates and sentinels for Secure Web Proxy (SWP) components, making their creation and deletion more robust and idempotent. The destroy-dpgce-private script now handles SWP teardown more effectively by adjusting the order of resource deletion.
Script Refinements and Consistency: Key scripts like bin/scp-m, bin/ssh-m, lib/dataproc/cluster.sh, and lib/env.sh have undergone internal variable renames (e.g., MASTER_MACHINE_TYPE to M_MACHINE_TYPE) for better consistency. bin/recreate-dpgce now reliably updates initialization actions to GCS.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly enhances the Dataproc reproduction scripts by adding comprehensive GPU testing, overhauling the documentation for clarity, and improving the setup for private clusters with Secure Web Proxy. The changes are well-structured, particularly the move to versioned resources for SWP components, which improves idempotency. My review focuses on improving script robustness, maintainability, and clarity. Key suggestions include refactoring duplicated logic, ensuring IAM permissions are applied robustly, and cleaning up test scripts and documentation examples for better readability and correctness. Overall, this is a solid set of improvements.

gemini-code-assist · 2026-01-27T22:58:02Z

gcloud/lib/shared-functions.sh

+    --master-accelerator "type=${M_ACCELERATOR_TYPE}" \
    --worker-accelerator "type=${PRIMARY_ACCELERATOR_TYPE}" \
    --secondary-worker-accelerator "type=${SECONDARY_ACCELERATOR_TYPE}" \
-    --master-machine-type "${MASTER_MACHINE_TYPE}" \
+    --master-machine-type "${M_MACHINE_TYPE}" \


While these variable renames improve consistency, this file (shared-functions.sh) appears to contain a significant amount of logic that is either duplicated or superseded by the functions in the more specific lib/ subdirectories (e.g., lib/dataproc/cluster.sh). For instance, it defines its own create_dpgce_cluster. To improve maintainability and reduce confusion, this entire file should be evaluated for refactoring or deprecation, with its functionality fully migrated to the newer, more modular scripts.

gemini-code-assist · 2026-01-27T22:58:02Z

gcloud/README.md

+# After building the image using scripts in custom-images/examples/secure-boot/
+
+# In lib/dataproc/cluster.sh
+CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"


The example for CLUSTER_IMAGE contains a hardcoded project ID (cjac-2025-01). For better consistency and usability, it would be clearer to use a placeholder like YOUR_PROJECT, similar to the example on line 110.

Suggested change

CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"

CLUSTER_IMAGE="projects/YOUR_PROJECT/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"

gemini-code-assist · 2026-01-27T22:58:02Z

gcloud/bin/scp-m

 if [[ $(perl -e "print 1 if q{$1} =~ /^\d+$/") == "1" ]]; then
-  master_num="$1"
-  echo "master_num: $master_num"
-  MASTER_HOSTNAME="${MASTER_HOSTNAME}-${master_num}"
+  m_num="$1"
+  echo "m_num: $m_num"
+  M_HOSTNAME="${M_HOSTNAME}-${m_num}"
  shift 1
 else
-  MASTER_HOSTNAME="${MASTER_HOSTNAME}"
+  M_HOSTNAME="${M_HOSTNAME}"
 fi


The logic to detect the node index can be improved for readability and robustness. The perl command can be replaced with a more idiomatic and native bash regular expression. Additionally, the else block is redundant and can be removed. While the ambiguity of a numeric filename vs. an index remains, these changes make the script cleaner.

Suggested change

if [[ $(perl -e "print 1 if q{$1} =~ /^\d+$/") == "1" ]]; then

master_num="$1"

echo "master_num: $master_num"

MASTER_HOSTNAME="${MASTER_HOSTNAME}-${master_num}"

m_num="$1"

echo "m_num: $m_num"

M_HOSTNAME="${M_HOSTNAME}-${m_num}"

shift 1

else

MASTER_HOSTNAME="${MASTER_HOSTNAME}"

M_HOSTNAME="${M_HOSTNAME}"

fi

if [[ "$1" =~ ^[0-9]+$ ]]; then

m_num="$1"

echo "m_num: $m_num"

M_HOSTNAME="${M_HOSTNAME}-${m_num}"

shift 1

fi

there is no condition for setting the M_HOSTNAME value when the regex doesn't match.

gemini-code-assist · 2026-01-27T22:58:02Z

gcloud/lib/dataproc/cluster.sh

+#    --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20251108-180659-tf"
+#    --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-002929-tf"
+#    --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-213919-secure-boot"
+#    --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-213919-tf"
+#    --image "projects/${PROJECT_ID}/global/images/dataproc-2-3-deb12-20260127-175333-secure-boot"
+#    --image "projects/${PROJECT_ID}/global/images/dataproc-2-3-deb12-20260127-175333-tf"


This section contains a large number of commented-out image URIs. This makes the active configuration difficult to parse and increases maintenance overhead. To improve clarity, consider moving these examples to a separate documentation file (like docs/dataproc_cluster_examples.md) or removing them if they are for temporary development purposes.

Let's mark these as to be removed before release. Until then, they are helping me to recall which images function properly.

gemini-code-assist · 2026-01-27T22:58:02Z

gcloud/t/scripts/verify_tensorflow.py

+is_cuda_gpu_available = tf.test.is_gpu_available(cuda_only=True)
+print("is_cuda_gpu_available : " + str(is_cuda_gpu_available))
+
+#is_cuda_gpu_min_3 = tf.config.list_physical_devices('GPU',True, (3,0))
+is_cuda_gpu_min_3 = tf.test.is_gpu_available(True, (3,0))
+print("is_cuda_gpu_min_3 : " + str(is_cuda_gpu_min_3))


The function tf.test.is_gpu_available() is deprecated in TensorFlow 2.x and should be avoided. The check for available GPUs is already performed on line 10 using the recommended tf.config.list_physical_devices('GPU'). These lines can be removed to simplify and modernize the script.

let's see if we can verify that both the deprecated code and the code that replaces it work. That seems like a reasonable way to do this for now, right? The next release we can change the default provider and allow the use of the old provider with an argument or something, but eventually the plan is to remove that code from the codebase because it has been fully replaced in all of its use cases with another chunk of code.

gemini-code-assist · 2026-01-27T22:58:02Z

gcloud/t/spark-gpu-test.sh

+echo gcloud dataproc jobs submit pyspark \
+  --properties="spark:spark.executor.resource.gpu.amount=1" \
+  --properties="spark:spark.task.resource.gpu.amount=1" \
+  --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
+  --cluster=${CLUSTER_NAME} \
+  --region ${REGION} gs://${BUCKET}/test.py
+
+get_gpu_resources_script="/usr/lib/spark/scripts/gpu/getGpusResources.sh"
+echo gcloud dataproc jobs submit spark \
+  --project "${PROJECT_ID}" \
+  --cluster="${CLUSTER_NAME}" \
+  --region "${REGION}" \
+  --jars "file:///usr/lib/spark/examples/jars/spark-examples.jar" \
+  --class "org.apache.spark.examples.ml.JavaIndexToStringExample" \
+  --properties \
+"spark.driver.resource.gpu.amount=1,"\
+"spark.driver.resource.gpu.discoveryScript=${get_gpu_resources_script},"\
+"spark.executor.resource.gpu.amount=1,"\
+"spark.executor.resource.gpu.discoveryScript=${get_gpu_resources_script}"


These gcloud dataproc jobs submit commands are only printed to the console via echo, not executed. This can be misleading for anyone running the script. If they are intended to be examples, they should be commented out with an explanation. If they are meant to be executed, the echo should be removed.

# Example of submitting a PySpark job with GPU resources: # gcloud dataproc jobs submit pyspark \ # --properties="spark:spark.executor.resource.gpu.amount=1" \ # --properties="spark:spark.task.resource.gpu.amount=1" \ # --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \ # --cluster=${CLUSTER_NAME} \ # --region ${REGION} gs://${BUCKET}/test.py get_gpu_resources_script="/usr/lib/spark/scripts/gpu/getGpusResources.sh" # Example of submitting a Spark job with GPU resources for driver and executor: # gcloud dataproc jobs submit spark \ # --project "${PROJECT_ID}" \ # --cluster="${CLUSTER_NAME}" \ # --region "${REGION}" \ # --jars "file:///usr/lib/spark/examples/jars/spark-examples.jar" \ # --class "org.apache.spark.examples.ml.JavaIndexToStringExample" \ # --properties \ # "spark.driver.resource.gpu.amount=1,"\ # "spark.driver.resource.gpu.discoveryScript=${get_gpu_resources_script},"\ # "spark.executor.resource.gpu.amount=1,"\ # "spark.executor.resource.gpu.discoveryScript=${get_gpu_resources_script}"

yeah, I haven't verified whether they work or not. I'm busy and the tests that I am running are passing. I don't know if I want to add new ones right this VERY second.

review-notebook-app · 2026-03-18T06:07:32Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

cjac · 2026-03-18T06:26:47Z

/gemini review

gemini-code-assist

Code Review

The pull request introduces significant changes to the Dataproc on GCE (DPGCE) environment setup scripts, focusing on state management, declarative creation, and idempotency. The changes include updating the README, adding a new audit script (audit-dpgce), modifying the create script (create-dpgce) to use a declarative model, and updating various library functions. The reviewer identified several potential issues, including a command injection vulnerability in script-utils.sh, the lack of explicit resource deletion in destroy-dpgce, missing error handling in gce-proxy-setup.sh, and the need for more robust state checks in dataproc/cluster.sh. The reviewer also suggested adding checks for metadata length and directory existence in dataproc/cluster.sh and audit-dpgce, respectively, and ensuring that the ssh command is not executed with empty command arguments in ssh-m.

gemini-code-assist · 2026-03-18T06:32:26Z

gcloud/lib/script-utils.sh

+
+  # The command_to_run should be a gcloud command with --format=json
+  # that returns a JSON object if the resource exists and fails otherwise.
+  json_output=$(eval "${command_to_run}" 2>/dev/null)


The command eval "${command_to_run}" 2>/dev/null is used to execute the gcloud command. However, this command can be vulnerable to command injection if the command_to_run variable contains malicious code. Consider using a safer way to execute the gcloud command, such as using an array.

gemini-code-assist · 2026-03-18T06:32:26Z

gcloud/bin/destroy-dpgce

+
+ROUTES_RAW=$(get_state "routes")
+if [[ "${ROUTES_RAW}" != "null" && "${ROUTES_RAW}" != "[]" ]]; then
+    mapfile -t route_names < <(echo "${ROUTES_RAW}" | jq -r '.[].name')


It's generally safer to explicitly specify the resources being deleted (e.g., gcloud compute firewall-rules delete ...) rather than relying on a general-purpose function that might inadvertently delete other resources.

gemini-code-assist · 2026-03-18T06:32:27Z

gcloud/init/gce-proxy-setup.sh

+    if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then
+      echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}."


The nc command lacks error handling. If the connection fails, the script should exit to prevent further execution with a broken proxy setup.

Suggested change

if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then

echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}."

if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then

echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}."

exit 1

else

gemini-code-assist · 2026-03-18T06:32:27Z

gcloud/bin/create-dpgce

+  # TODO: Check if NAT is already on the router
+  echo "INFO: NAT Egress enabled, ensuring NAT on existing router..."
+  add_nat_to_router


This check is useful, but it might be more robust to check if NAT is enabled within the router's configuration rather than relying on the cloudRouter state alone.

gemini-code-assist · 2026-03-18T06:32:27Z

gcloud/lib/dataproc/cluster.sh

 export -f delete_dpgce_cluster
+
+function exists_dataproc_cluster_vms() {
+  _check_exists "gcloud compute instances list --project='${PROJECT_ID}' --filter='labels.goog-dataproc-cluster-name=${CLUSTER_NAME}' --format='json(name,zone,status)'" | jq 'if . == [] then null else . end'


The _check_exists function is used to determine if a resource exists. However, it only checks if the resource exists and does not check if the resource is in the correct state. Consider adding a check to ensure that the resource is in the correct state before proceeding.

gemini-code-assist · 2026-03-18T06:32:27Z

gcloud/lib/swp/certs.sh

      fi
      print_status "  Attempting to delete CA Pool ${short_pool_name}..."
-      if ! run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources; then overall_status="Fail"; report_result "Fail"; else report_result "Deleted"; fi
+      run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources || true


The script attempts to delete the CA pool, but doesn't check for errors. This could lead to a situation where the script continues even if the CA pool deletion fails.

Suggested change

run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources || true

if ! run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources; then overall_status="Fail"; report_result "Fail"; else report_result "Deleted"; fi

gemini-code-assist · 2026-03-18T06:32:27Z

gcloud/lib/swp/policy.sh

+  run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \
+    --gateway-security-policy="${policy_name}" \
+    --location="${region}" \
+    --project="${project_id}" \
+    --quiet || true


The script deletes the rule and policy, but doesn't check for errors after each operation. This could lead to a situation where the script continues even if the rule or policy deletion fails.

Suggested change

run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \

--gateway-security-policy="${policy_name}" \

--location="${region}" \

--project="${project_id}" \

--quiet || true

run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \

--gateway-security-policy="${policy_name}" \

--location="${region}" \

--project="${project_id}" \

--quiet || true

if run_gcloud "${policy_log}" gcloud network-security gateway-security-policies delete "${policy_name}" \

gemini-code-assist · 2026-03-18T06:32:27Z

gcloud/lib/swp/policy.sh

+  run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \
+    --gateway-security-policy="${policy_name}" \
+    --location="${region}" \
+    --project="${project_id}" \
+    --quiet || true


Consider adding a check to ensure that the rule is actually deleted before attempting to delete the policy. This can prevent issues if the rule deletion fails.

gemini-code-assist · 2026-03-18T06:32:27Z

gcloud/bin/audit-dpgce

+
+  (
+    source "${GCLOUD_DIR}/${source_file}"
+    "${func_name}" "$@" > "${cmd_out}" 2>/dev/null


Redirecting stderr to /dev/null hides potential errors from the sourced script. Consider logging stderr to a file for debugging purposes.

gemini-code-assist · 2026-03-18T06:32:27Z

gcloud/lib/script-utils.sh

+
+  # The command_to_run should be a gcloud command with --format=json
+  # that returns a JSON object if the resource exists and fails otherwise.
+  json_output=$(eval "${command_to_run}" 2>/dev/null)


The stderr is redirected to /dev/null, which means that any errors that occur during the execution of the command will not be logged. This can make it difficult to debug the script if something goes wrong. Consider logging stderr to a file for debugging purposes.

cjac · 2026-03-18T06:32:40Z

Dataproc Lifecycle Script Overhaul: From Brittle to Declarative

This series of changes represents a fundamental refactoring of the cloud-dataproc/gcloud scripts, moving from a collection of fragmented, sometimes brittle scripts to a more unified, state-aware, and declarative system for managing Dataproc environments.

Initial State & Problems:

Multiple, single-purpose scripts for different creation scenarios (e.g., create-dpgce-custom, create-dpgce-private).
Reliance on sentinel files for state management, which proved fragile.
Audit scripts were separate and didn't always reflect the true state of GCP.
create scripts were not truly idempotent, often failing if the environment wasn't perfectly clean, requiring manual destroy calls.

The Evolution:

Script Consolidation (2026-03-12): The first major step was unifying the various create-dpgce-* and destroy-dpgce-* scripts into single bin/create-dpgce and bin/destroy-dpgce scripts, using flags like --custom and --private to manage variations. Audit scripts were also unified, querying live GCP for the source of truth.
Intelligent Image Selection (2026-03-12): Enhanced custom image creation to check for existing images before rebuilding.
The Shift to state.json (2026-03-13): A key architectural change was made to use a central state.json file as the authoritative state, generated by bin/audit-dpgce. create and destroy scripts were modified to read from this file instead of querying GCP directly. This phase was difficult, marked by significant debugging of shell syntax and line-continuation errors.
Achieving True Idempotency (2026-03-13): A critical flaw was identified: create-dpgce would fail if any resource existed. This was corrected to make the script truly idempotent, where it checks for each resource individually and only creates it if missing, allowing it to converge partially built environments. This breakthrough led to the first fully successful automated cluster creation.
Validation & Cleanup (2026-03-13): End-to-end validation, including GPU driver installation on a Secure Boot custom image, was successful. Efforts were made to clean up the Git history and identify lingering issues like the outdated README.
Networking Nightmares (2026-03-13 - 2026-03-17): Extensive time was spent debugging network connectivity issues that prevented cluster registration. This was eventually traced to VMs in us-west4-a receiving /32 netmasks.
Transition to SQLite & Declarative Planning (2026-03-18):
- To address state management robustness, state.json was replaced with an SQLite database (state.db), providing atomic updates.
- bin/audit-dpgce was overhauled to run checks concurrently and populate the SQLite DB.
- bin/create-dpgce was transformed into a fully declarative, plan-based system:
  - It reads the current state from state.db.
  - Determines intended resources based on flags.
  - Generates and displays a PLAN of actions (CREATE, RECREATE).
  - Executes the plan, including logic to recreate the cluster if the image type doesn't match the flags.
- bin/destroy-dpgce was enhanced to use the SQLite DB and ensure complete network teardown by deleting the default-internet- route.
Stability & Feature Additions (2026-03-18):
- Added --nat-egress and placeholder --swp-egress flags.
- Made NAT creation idempotent in lib/network/router.sh.
- Added --quiet to autoscaling policy import to prevent hangs.
- Successfully tested create-dpgce --custom --nat-egress.

Current State:

The scripts are now much more robust, predictable, and easier to use. The declarative approach in create-dpgce, backed by the SQLite cache, allows for reliable environment provisioning and convergence. Key functionalities like custom image and NAT egress are working.

Key Improvements for Review:

Unified Scripts: Fewer scripts to manage.
SQLite State: Robust and atomic state caching.
Declarative Creation: Clear planning and execution in create-dpgce.
Idempotency: Scripts can be re-run safely.
Improved Teardown: More reliable network and resource deletion.

This narrative provides context for the significant architectural changes and the iterative debugging and refinement process that led to the current state of the scripts.

cjac · 2026-03-18T06:52:48Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a major refactoring of the Dataproc environment scripts, moving towards a declarative, state-driven model using an SQLite database. It also adds significant new functionality, including GPU testing scripts, enhanced documentation, and improved setup for private clusters with Secure Web Proxy. The changes are extensive and generally well-executed, greatly improving the structure and robustness of the scripts. However, I've identified several critical and high-severity issues, primarily related to new or refactored scripts calling non-existent functions or files, broken CI configurations, and bugs in test scripts. Addressing these issues is crucial for the stability and correctness of the new implementation.

gemini-code-assist · 2026-03-18T06:58:24Z

gcloud/bin/connectivity-test

@@ -0,0 +1 @@
+#!/bin/bashsource lib/env.shgcloud network-services gateways describe ${SWP_INSTANCE_NAME} --location ${REGION}gcloud network-security gateway-security-policies rules export allow-all-rule \  --gateway-security-policy=${SWP_POLICY_NAME} \  --location ${REGION}#gcloud logging read \#  "resource.type=\"networkservices.googleapis.com/Gateway\" \#  resource.labels.location=\"${REGION}\" \#  resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \#  --project="${PROJECT_ID}" \#  --limit=100 \#  --format=json # Or text, defaultgcloud logging read \  "resource.type=\"networkservices.googleapis.com/Gateway\" \  resource.labels.location=\"${REGION}\" \  resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \  --project="${PROJECT_ID}" \  --limit=500 \  --freshness=1d # Look back over the last day# Delete the old test if it existsgcloud network-management connectivity-tests delete swp-test --quiet# Create a new onegcloud network-management connectivity-tests create swp-test \  --source-instance=projects/${PROJECT_ID}/zones/${ZONE}/instances/${CLUSTER_NAME}-m \  --destination-ip-address=10.43.79.245 \  --destination-port=3128 \  --protocol=TCP \  --source-network=projects/${PROJECT_ID}/global/networks/${NETWORK}# Wait and describesleep 5# After a moment, get the results:gcloud network-management connectivity-tests describe swp-test


This file appears to be a scratchpad of commands rather than a valid script. It's missing a shebang, and all commands are concatenated on a single line. This will fail to execute correctly. Please format this as a proper bash script with a shebang and each command on a new line.

#!/bin/bash source lib/env.sh gcloud network-services gateways describe ${SWP_INSTANCE_NAME} --location ${REGION} gcloud network-security gateway-security-policies rules export allow-all-rule \ --gateway-security-policy=${SWP_POLICY_NAME} \ --location ${REGION} #gcloud logging read \ # "resource.type=\"networkservices.googleapis.com/Gateway\" \ # resource.labels.location=\"${REGION}\" \ # resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \ # --project="${PROJECT_ID}" \ # --limit=100 \ # --format=json # Or text, default gcloud logging read \ "resource.type=\"networkservices.googleapis.com/Gateway\" \ resource.labels.location=\"${REGION}\" \ resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \ --project="${PROJECT_ID}" \ --limit=500 \ --freshness=1d # Look back over the last day # Delete the old test if it exists gcloud network-management connectivity-tests delete swp-test --quiet # Create a new one gcloud network-management connectivity-tests create swp-test \ --source-instance=projects/${PROJECT_ID}/zones/${ZONE}/instances/${CLUSTER_NAME}-m \ --destination-ip-address=10.43.79.245 \ --destination-port=3128 \ --protocol=TCP \ --source-network=projects/${PROJECT_ID}/global/networks/${NETWORK} # Wait and describe sleep 5 # After a moment, get the results: gcloud network-management connectivity-tests describe swp-test

gemini-code-assist · 2026-03-18T06:58:24Z

gcloud/init/gce-proxy-setup.sh

+
+    echo "DEBUG: set_proxy: Testing external site access via proxy..."
+    local test_url="https://www.google.com"
+    if curl -vL ${curl_retry_args} -o /dev/null "${test_url}"; then


The variable ${curl_retry_args} is used here but it is not defined anywhere in the script. This will cause the curl command to fail. You should define this variable with appropriate retry arguments, for example: local curl_retry_args="--retry-connrefused --retry 10 --retry-max-time 30".

gemini-code-assist · 2026-03-18T06:58:24Z

gcloud/cloudbuild.yaml

+  - 'PROJECT_ID=cjac-2025-01' # Use CI_PROJECT_ID here
+  - 'CLOUDSDK_CORE_DISABLE_PROMPTS=1'
+substitutions:
+  _CI_PROJECT_ID: "cjac-2025-01"


The PROJECT_ID and _CI_PROJECT_ID are hardcoded in this file. This makes the Cloud Build configuration specific to the cjac-2025-01 project and not reusable. You should use Cloud Build substitutions (e.g., $_CI_PROJECT_ID or $PROJECT_ID) to make this configuration generic, as was done in the setup-cicd.sh script.

- 'PROJECT_ID=${_CI_PROJECT_ID}' # Use CI_PROJECT_ID here - 'CLOUDSDK_CORE_DISABLE_PROMPTS=1' substitutions: _CI_PROJECT_ID: "${PROJECT_ID}"

gemini-code-assist · 2026-03-18T06:58:24Z

gcloud/bin/setup-cicd.sh

+  - name: 'gcr.io/cloud-builders/gcloud'
+    id: 'Test Standard Create'
+    entrypoint: 'bash'
+    args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce-create']


The generated cloudbuild.yaml file attempts to call audit-dpgce-create, which has been removed in this pull request as part of the refactoring to a single audit-dpgce script. This will cause the CI build to fail. The audit step should be updated to use the new universal audit script, bin/audit-dpgce. This issue is present for all test steps in the generated file.

Suggested change

args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce-create']

args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce']

gemini-code-assist · 2026-03-18T06:58:24Z

gcloud/cloudbuild.yaml

+  - name: 'gcr.io/cloud-builders/gcloud'
+    id: 'Test Standard Create'
+    entrypoint: 'bash'
+    args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce-create']


This Cloud Build step calls audit-dpgce-create, but this script has been removed in this PR. The build will fail. You should update this to use the new universal bin/audit-dpgce script. This applies to all other audit steps in this file as well.

args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce']

gemini-code-assist · 2026-03-18T06:58:24Z

gcloud/bin/audit-dpgce

+  if echo "${PROJECT_ID_RAW}" | /usr/bin/jq empty > /dev/null 2>&1; then
+    PROJECT_ID_VAL=$(echo "${PROJECT_ID_RAW}" | /usr/bin/jq -r '.projectId // "N/A"')


The path to jq is hardcoded as /usr/bin/jq. This is not portable and may fail on systems where jq is installed in a different location (e.g., /usr/local/bin). It's better to rely on the PATH environment variable and just call jq directly, as it's listed as a prerequisite in the README.

Suggested change

if echo "${PROJECT_ID_RAW}" | /usr/bin/jq empty > /dev/null 2>&1; then

PROJECT_ID_VAL=$(echo "${PROJECT_ID_RAW}" | /usr/bin/jq -r '.projectId // "N/A"')

if echo "${PROJECT_ID_RAW}" | jq empty > /dev/null 2>&1; then

PROJECT_ID_VAL=$(echo "${PROJECT_ID_RAW}" | jq -r '.projectId // "N/A"')

gemini-code-assist · 2026-03-18T06:58:24Z

gcloud/.gitignore

+env.cpan.json
+
+# Temporary files and logs
+/tmp/


The entry /tmp/ will only ignore a tmp directory at the root of the gcloud directory. The scripts, however, seem to use the absolute path /tmp/dataproc-repro/... for logs and temporary files, which is outside the repository and cannot be ignored by .gitignore. If the intention is to ignore a local tmp directory for other purposes, consider changing this to tmp/ to avoid ambiguity with the system's root /tmp directory.

gemini-code-assist · 2026-03-18T06:58:24Z

gcloud/lib/dataproc/cluster-custom.sh

@@ -0,0 +1,188 @@
+#!/bin/bash


This new file cluster-custom.sh appears to be unused in the refactored create-dpgce script or any other part of the PR. If it's work-in-progress or dead code, consider removing it to avoid confusion and reduce maintenance overhead.

gemini-code-assist · 2026-03-18T06:58:24Z

gcloud/lib/network/firewall.sh

+function exists_firewall() {
+    # This is a basic check. A more robust version might check for a list of rules.
+    _check_exists "gcloud compute firewall-rules describe '${FIREWALL}-in-ssh' --project='${PROJECT_ID}' --format='json(name,selfLink)'"
+}


The function exists_firewall seems to be dead code. The main audit script (bin/audit-dpgce) uses the more generic exists_firewall_rule function from lib/swp/firewall.sh to check for firewall rules. To avoid confusion and code duplication, consider removing this unused function.

gemini-code-assist · 2026-03-18T06:58:24Z

gcloud/lib/swp/certs.sh

 export -f create_managed_certificate

-function _delete_swp_ca_resources() {
+function delete_managed_certificate() {


The refactored delete_managed_certificate function appears to have lost the conditional logic based on the --force flag that existed in the previous implementation. The old behavior was to preserve versioned CA components unless --force was specified. Now, it seems all versioned components are deleted unconditionally. If preserving these resources by default is still desired, the --force check should be reintroduced.

- Replaced separate create/destroy/audit scripts for standard and private clusters with unified scripts in bin/ (e.g., create-dpgce, destroy-dpgce, audit-dpgce). - Moved common functions to lib/script-utils.sh. - Updated lib/env.sh with new variables and logic. - Added new utility scripts like connectivity-test and debug-init-action.

- Added cluster creation logic for custom and private cluster types in lib/dataproc/. - Updated various library functions in lib/ to support the new unified scripts. - Added new scripts for CI/CD setup (bin/setup-cicd.sh) and GitHub triggers (bin/create-github-trigger.sh).

- Updated README.md to reflect the new script organization. - Enhanced .gitignore. - Added env.json and updated env.json.sample. - Added cloudbuild.yaml for CI/CD. - Updated init/gce-proxy-setup.sh. - Added test scripts in t/. - Added project findings and work narratives.

Implements a robust audit system for the DPGCE environment: - Replaces the state.json file with an SQLite database (state.db) for atomic and concurrent-safe state management. - Refactors `bin/audit-dpgce` to perform checks for all resources concurrently, writing results to temporary files. - Updates the SQLite database sequentially from temp files to prevent locking issues. - Adds `exists_*` functions to numerous lib scripts to support the audit. - Ensures all necessary functions are exported from their respective lib files. - Introduces `report_audit_status` for clearer [Exists]/[Not Found] reporting. - Updates README.md with `sqlite3` prerequisite. - Fixes various bugs in the audit script and helper functions. This provides a much faster, more reliable, and comprehensive audit of the environment.

This commit introduces a declarative, plan-based approach to the `create-dpgce` script and significantly improves the robustness and completeness of the `destroy-dpgce` script. **`create-dpgce` Refactoring:** * Replaced the initial audit with loading state from the SQLite cache (`init_state_db`). * Added `display_status` function to show the current state of resources. * Implemented plan generation by comparing the current state with the intended state based on flags. * Added a plan execution block that creates resources in the correct order based on the generated plan. * Removed redundant `check_project` and `check_billing` calls. * Ensured all state checks use `get_state`. **`destroy-dpgce` Enhancements:** * Integrated `get_state` for all resource existence checks, replacing old `jq` logic. * Added `delete_all_network_vms` to ensure all VMs in the network are deleted before network teardown. * Included more library scripts to cover a wider range of resources for deletion (PHS, GKE, DBs, SWP, etc.). * Refined route deletion to skip default internet and local routes. * Added cache updates (`update_state "resource" "null"`) after successful deletions for many resources. **Library Updates (`lib/`):** * Added `refresh_resource_state` or `update_state` calls to CUD functions in: * `dataproc/autoscaling.sh` * `dataproc/cluster.sh` * `dataproc/private-cluster.sh` * `gcp/gcr.sh` * `gcp/gcs.sh` * `gcp/iam.sh` * `gcp/kms.sh` * `network/network.sh` * `network/peering.sh` * `network/router.sh` * `network/routes.sh` * `network/subnet.sh` * Added `exists_dataproc_cluster_vms` to `lib/dataproc/cluster.sh`. * Fixed `create_subnet` to accept parameters for name, key, and range. * Corrected line continuation in `create_vpc_network` in `lib/network/network.sh`. **Other Changes:** * Fixed `recreate-cluster.sh` to correctly call `gcloud dataproc clusters describe` for displaying cluster details. * Updated `env.json` to use `us-west4-a`. * Cleaned up `lib/gcp/misc.sh`. These changes make the scripts more reliable, idempotent, and easier to understand.

- Ensure all calls to _check_exists pass command and arguments as separate array elements, not a single string. This fixes the audit not detecting some existing resources. - Refactor refresh_resource_state to handle array arguments. - Add exit 1 to init/gce-proxy-setup.sh if proxy connection test fails. - Improve stderr logging in bin/audit-dpgce background checks. - Fail script in lib/swp/policy.sh if security rule deletion fails. - Add clarifying comments in bin/destroy-dpgce. - Fix relative paths in lib/env.sh and lib/dataproc/private-cluster.sh. - Run audit at the start of create-dpgce. - Add NAT pre-delete back to lib/network/router.sh.

- Switched to client-side jq filtering for default route checks in lib/network/routes.sh due to gcloud filter unreliability. - Made create-dpgce idempotent for network infrastructure, ensuring components like NAT and default routes are created if missing, even if the Dataproc cluster already exists. - create-dpgce no longer errors out if a cluster exists with a different configuration; it only creates the cluster if missing. - Added Cloud NAT checks to the audit-dpgce script and report. - Corrected GCLOUD_DIR path resolution in lib/env.sh to be self-contained and robust. - Fixed bin/ssh-m to correctly set GCLOUD_DIR. - Simplified SCRIPT_DIR assignment in all bin/ scripts. - Added gcloud/.gitignore to exclude env.json. - Removed env.json from git history using filter-branch. - Rebased onto upstream/main.

- Reworked `bin/destroy-dpgce` to run an audit first, then generate a teardown plan based on discovered resources. - Added a context-aware status report to `destroy-dpgce` to display the state of resources found by the audit, respecting intent flags like --nat-egress, --custom, etc. - Updated `destroy-dpgce` to accept the same intent flags as `create-dpgce` to tailor the status report. - Made plan generation for network VMs and custom routes conditional on their detection in the audit. - Added `get_any_network_vms_state` to `lib/gcp/misc.sh` and an `anyNetworkVms` check to `bin/audit-dpgce` to support more accurate teardown planning for any VM in the network. - Ensured `destroy-dpgce` uses a unique TIMESTAMP for each run to isolate state. - Updated `README.md` to reflect these significant changes.

- Re-instate unique TIMESTAMP generation per run for create-dpgce and destroy-dpgce to ensure state isolation. - Corrected SQLite schema usage in script-utils.sh to use the 'state' table and 'value' column consistently. - Refactored Service Account management in lib/gcp/iam.sh: - Added auditing for SA IAM bindings (serviceAccountBindings) and readiness (serviceAccountRolesReady). - create_service_account now verifies all required roles are bound. - delete_service_account now properly removes role bindings before deleting the SA. - Fixed gcloud command execution within run_gcloud by using arrays to handle multi-line arguments correctly. - Enhanced run_gcloud to suppress output for expected errors like IAM binding not found. - Optimized create_subnet in lib/network/subnet.sh to avoid refetching details after creation. - Removed various debug messages.

cjac · 2026-03-20T07:13:51Z

/gemini review

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

cjac force-pushed the proxy-sync-2026-01 branch from 7887d15 to 7a5a629 Compare January 28, 2026 02:10

cjac force-pushed the proxy-sync-2026-01 branch 3 times, most recently from 3a92f5a to 696dd13 Compare March 13, 2026 03:00

cjac force-pushed the proxy-sync-2026-01 branch from 9cfe963 to 73be281 Compare March 18, 2026 06:07

cjac force-pushed the proxy-sync-2026-01 branch 3 times, most recently from 85d00d7 to 059ff92 Compare March 18, 2026 06:26

cjac marked this pull request as ready for review March 18, 2026 06:26

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

cjac force-pushed the proxy-sync-2026-01 branch from 656ea83 to 059ff92 Compare March 18, 2026 07:02

cjac added 6 commits March 19, 2026 02:39

cjac force-pushed the proxy-sync-2026-01 branch from 059ff92 to 729999a Compare March 19, 2026 02:40

cjac force-pushed the proxy-sync-2026-01 branch from 729999a to 8ce83a7 Compare March 19, 2026 02:43

cjac added 2 commits March 19, 2026 04:01

cjac requested a review from Abinash-Sharma March 20, 2026 07:13

	CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"
	CLUSTER_IMAGE="projects/YOUR_PROJECT/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"

		if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then
		echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}."

	run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources \|\| true
	if ! run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources; then overall_status="Fail"; report_result "Fail"; else report_result "Deleted"; fi

		@@ -0,0 +1 @@
		#!/bin/bashsource lib/env.shgcloud network-services gateways describe ${SWP_INSTANCE_NAME} --location ${REGION}gcloud network-security gateway-security-policies rules export allow-all-rule \ --gateway-security-policy=${SWP_POLICY_NAME} \ --location ${REGION}#gcloud logging read \# "resource.type=\"networkservices.googleapis.com/Gateway\" \# resource.labels.location=\"${REGION}\" \# resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \# --project="${PROJECT_ID}" \# --limit=100 \# --format=json # Or text, defaultgcloud logging read \ "resource.type=\"networkservices.googleapis.com/Gateway\" \ resource.labels.location=\"${REGION}\" \ resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \ --project="${PROJECT_ID}" \ --limit=500 \ --freshness=1d # Look back over the last day# Delete the old test if it existsgcloud network-management connectivity-tests delete swp-test --quiet# Create a new onegcloud network-management connectivity-tests create swp-test \ --source-instance=projects/${PROJECT_ID}/zones/${ZONE}/instances/${CLUSTER_NAME}-m \ --destination-ip-address=10.43.79.245 \ --destination-port=3128 \ --protocol=TCP \ --source-network=projects/${PROJECT_ID}/global/networks/${NETWORK}# Wait and describesleep 5# After a moment, get the results:gcloud network-management connectivity-tests describe swp-test No newline at end of file

	args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce-create']
	args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce']

		if echo "${PROJECT_ID_RAW}" \| /usr/bin/jq empty > /dev/null 2>&1; then
		PROJECT_ID_VAL=$(echo "${PROJECT_ID_RAW}" \| /usr/bin/jq -r '.projectId // "N/A"')

Conversation

cjac commented Jan 27, 2026

Uh oh!

gemini-code-assist bot commented Jan 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

cjac Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

cjac Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

cjac Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

cjac Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

review-notebook-app bot commented Mar 18, 2026

Uh oh!

cjac commented Mar 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!