Skip to content

Feat: Enhance GPU Tests, Docs & Private Cluster Setup#181

Open
cjac wants to merge 9 commits intoGoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:proxy-sync-2026-01
Open

Feat: Enhance GPU Tests, Docs & Private Cluster Setup#181
cjac wants to merge 9 commits intoGoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:proxy-sync-2026-01

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented Jan 27, 2026

This set of changes introduces significant enhancements to the Dataproc environment reproduction scripts, focusing on documentation, private cluster setups with Secure Web Proxy (SWP), GPU testing, and code clarity.

Key Changes:

  1. README.md Overhaul:

    • The README has been substantially restructured and expanded to improve clarity and usability.
    • A "Prerequisites" section was added.
    • The "Setup" section now provides more detail on env.json variables.
    • The "Scripts" section is now better organized into "Environment Management," "Audit Scripts," and "Utility Scripts," with more descriptive entries for each script, including the new test scripts.
    • The "Using Custom Images" section was added to guide users on building specialized images, particularly for Secure Boot, pointing to the custom-images repository.
    • A "Script Details" section provides deeper explanations of the create-*, destroy-*, and recreate-* scripts.
    • Language has been updated to use "-m node" instead of potentially problematic terms.
  2. New Test Scripts (t/ directory):

    • t/scripts/verify_tensorflow.py: A Python script to check TensorFlow GPU detection.
    • t/scripts/verify_torch.py: A Python script to check PyTorch CUDA availability and GPU details.
    • t/spark-gpu-test.sh: A new test runner script that:
      • Copies the t/scripts directory to the cluster's -m node using bin/scp-m.
      • Executes the Python GPU verification scripts on the -m node via SSH.
      • Runs a series of Spark jobs (SparkPi and JavaIndexToStringExample) with various GPU-related properties to test GPU allocation and functionality within YARN.
  3. Script Modifications:

    • bin/destroy-dpgce-private: Adjusted the order of resource deletion for the SWP teardown.
    • bin/recreate-dpgce: Enabled the gsutil cp command to ensure initialization actions are always updated from the local init/ directory during cluster recreation.
    • bin/scp-m: Updated usage text to reflect its purpose of copying to the -m node.
    • bin/ssh-m: Renamed internal variables for clarity (e.g., M_HOSTNAME).
    • lib/dataproc/cluster.sh: Renamed environment variables (MASTER_MACHINE_TYPE to M_MACHINE_TYPE, MASTER_ACCELERATOR_TYPE to M_ACCELERATOR_TYPE) for consistency. Includes many commented-out options, indicating active development and testing of various cluster configurations (proxy, image versions, secure boot).
    • lib/env.sh: Updated to use the new M_ prefixed variables and added many new variables to support the private cluster and SWP configurations.
    • lib/shared-functions.sh: Adapted to the variable renames in env.sh.
    • lib/swp/certs.sh: Enhanced to support versioned certificates and sentinels, making SWP component management more robust and idempotent per run.
    • lib/swp/policy.sh: Added export for SWP_POLICY_URI_PARTIAL.

In essence, this change adds comprehensive GPU validation, improves documentation, refines the private cluster setup, and makes various scripts more robust.

@gemini-code-assist
Copy link

Summary of Changes

Hello @cjac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the Dataproc environment reproduction scripts by introducing a robust framework for GPU testing, enhancing the clarity and completeness of the documentation, and improving the reliability and idempotency of private cluster setups utilizing Secure Web Proxy. These changes aim to streamline development, testing, and deployment workflows for GPU-accelerated and secure Dataproc environments.

Highlights

  • Enhanced GPU Testing Framework: Introduced new Python scripts (verify_tensorflow.py, verify_torch.py) to check GPU detection and CUDA availability, along with a comprehensive spark-gpu-test.sh script to run Spark jobs with various GPU configurations on YARN.
  • Improved Documentation: The README.md has been significantly expanded and reorganized, adding sections for prerequisites, detailed env.json variables, structured script categories, guidance on using custom images (especially for Secure Boot), and in-depth explanations of core create-*, destroy-*, and recreate-* scripts.
  • Robust Private Cluster Setup (SWP): Modifications to lib/swp/certs.sh introduce versioned certificates and sentinels for Secure Web Proxy (SWP) components, making their creation and deletion more robust and idempotent. The destroy-dpgce-private script now handles SWP teardown more effectively by adjusting the order of resource deletion.
  • Script Refinements and Consistency: Key scripts like bin/scp-m, bin/ssh-m, lib/dataproc/cluster.sh, and lib/env.sh have undergone internal variable renames (e.g., MASTER_MACHINE_TYPE to M_MACHINE_TYPE) for better consistency. bin/recreate-dpgce now reliably updates initialization actions to GCS.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the Dataproc reproduction scripts by adding comprehensive GPU testing, overhauling the documentation for clarity, and improving the setup for private clusters with Secure Web Proxy. The changes are well-structured, particularly the move to versioned resources for SWP components, which improves idempotency. My review focuses on improving script robustness, maintainability, and clarity. Key suggestions include refactoring duplicated logic, ensuring IAM permissions are applied robustly, and cleaning up test scripts and documentation examples for better readability and correctness. Overall, this is a solid set of improvements.

Comment on lines +26 to +29
--master-accelerator "type=${M_ACCELERATOR_TYPE}" \
--worker-accelerator "type=${PRIMARY_ACCELERATOR_TYPE}" \
--secondary-worker-accelerator "type=${SECONDARY_ACCELERATOR_TYPE}" \
--master-machine-type "${MASTER_MACHINE_TYPE}" \
--master-machine-type "${M_MACHINE_TYPE}" \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While these variable renames improve consistency, this file (shared-functions.sh) appears to contain a significant amount of logic that is either duplicated or superseded by the functions in the more specific lib/ subdirectories (e.g., lib/dataproc/cluster.sh). For instance, it defines its own create_dpgce_cluster. To improve maintainability and reduce confusion, this entire file should be evaluated for refactoring or deprecation, with its functionality fully migrated to the newer, more modular scripts.

gcloud/README.md Outdated
# After building the image using scripts in custom-images/examples/secure-boot/

# In lib/dataproc/cluster.sh
CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for CLUSTER_IMAGE contains a hardcoded project ID (cjac-2025-01). For better consistency and usability, it would be clearer to use a placeholder like YOUR_PROJECT, similar to the example on line 110.

Suggested change
CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"
CLUSTER_IMAGE="projects/YOUR_PROJECT/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"

Comment on lines 26 to 33
if [[ $(perl -e "print 1 if q{$1} =~ /^\d+$/") == "1" ]]; then
master_num="$1"
echo "master_num: $master_num"
MASTER_HOSTNAME="${MASTER_HOSTNAME}-${master_num}"
m_num="$1"
echo "m_num: $m_num"
M_HOSTNAME="${M_HOSTNAME}-${m_num}"
shift 1
else
MASTER_HOSTNAME="${MASTER_HOSTNAME}"
M_HOSTNAME="${M_HOSTNAME}"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to detect the node index can be improved for readability and robustness. The perl command can be replaced with a more idiomatic and native bash regular expression. Additionally, the else block is redundant and can be removed. While the ambiguity of a numeric filename vs. an index remains, these changes make the script cleaner.

Suggested change
if [[ $(perl -e "print 1 if q{$1} =~ /^\d+$/") == "1" ]]; then
master_num="$1"
echo "master_num: $master_num"
MASTER_HOSTNAME="${MASTER_HOSTNAME}-${master_num}"
m_num="$1"
echo "m_num: $m_num"
M_HOSTNAME="${M_HOSTNAME}-${m_num}"
shift 1
else
MASTER_HOSTNAME="${MASTER_HOSTNAME}"
M_HOSTNAME="${M_HOSTNAME}"
fi
if [[ "$1" =~ ^[0-9]+$ ]]; then
m_num="$1"
echo "m_num: $m_num"
M_HOSTNAME="${M_HOSTNAME}-${m_num}"
shift 1
fi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no condition for setting the M_HOSTNAME value when the regex doesn't match.

Comment on lines +76 to +81
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20251108-180659-tf"
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-002929-tf"
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-213919-secure-boot"
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-213919-tf"
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-3-deb12-20260127-175333-secure-boot"
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-3-deb12-20260127-175333-tf"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section contains a large number of commented-out image URIs. This makes the active configuration difficult to parse and increases maintenance overhead. To improve clarity, consider moving these examples to a separate documentation file (like docs/dataproc_cluster_examples.md) or removing them if they are for temporary development purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mark these as to be removed before release. Until then, they are helping me to recall which images function properly.

Comment on lines +14 to +19
is_cuda_gpu_available = tf.test.is_gpu_available(cuda_only=True)
print("is_cuda_gpu_available : " + str(is_cuda_gpu_available))

#is_cuda_gpu_min_3 = tf.config.list_physical_devices('GPU',True, (3,0))
is_cuda_gpu_min_3 = tf.test.is_gpu_available(True, (3,0))
print("is_cuda_gpu_min_3 : " + str(is_cuda_gpu_min_3))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function tf.test.is_gpu_available() is deprecated in TensorFlow 2.x and should be avoided. The check for available GPUs is already performed on line 10 using the recommended tf.config.list_physical_devices('GPU'). These lines can be removed to simplify and modernize the script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see if we can verify that both the deprecated code and the code that replaces it work. That seems like a reasonable way to do this for now, right? The next release we can change the default provider and allow the use of the old provider with an argument or something, but eventually the plan is to remove that code from the codebase because it has been fully replaced in all of its use cases with another chunk of code.

Comment on lines +24 to +42
echo gcloud dataproc jobs submit pyspark \
--properties="spark:spark.executor.resource.gpu.amount=1" \
--properties="spark:spark.task.resource.gpu.amount=1" \
--properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
--cluster=${CLUSTER_NAME} \
--region ${REGION} gs://${BUCKET}/test.py

get_gpu_resources_script="/usr/lib/spark/scripts/gpu/getGpusResources.sh"
echo gcloud dataproc jobs submit spark \
--project "${PROJECT_ID}" \
--cluster="${CLUSTER_NAME}" \
--region "${REGION}" \
--jars "file:///usr/lib/spark/examples/jars/spark-examples.jar" \
--class "org.apache.spark.examples.ml.JavaIndexToStringExample" \
--properties \
"spark.driver.resource.gpu.amount=1,"\
"spark.driver.resource.gpu.discoveryScript=${get_gpu_resources_script},"\
"spark.executor.resource.gpu.amount=1,"\
"spark.executor.resource.gpu.discoveryScript=${get_gpu_resources_script}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These gcloud dataproc jobs submit commands are only printed to the console via echo, not executed. This can be misleading for anyone running the script. If they are intended to be examples, they should be commented out with an explanation. If they are meant to be executed, the echo should be removed.

# Example of submitting a PySpark job with GPU resources:
# gcloud dataproc jobs submit pyspark \
#   --properties="spark:spark.executor.resource.gpu.amount=1" \
#   --properties="spark:spark.task.resource.gpu.amount=1" \
#   --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
#   --cluster=${CLUSTER_NAME} \
#   --region ${REGION} gs://${BUCKET}/test.py

get_gpu_resources_script="/usr/lib/spark/scripts/gpu/getGpusResources.sh"
# Example of submitting a Spark job with GPU resources for driver and executor:
# gcloud dataproc jobs submit spark \
#   --project "${PROJECT_ID}" \
#   --cluster="${CLUSTER_NAME}" \
#   --region "${REGION}" \
#   --jars "file:///usr/lib/spark/examples/jars/spark-examples.jar" \
#   --class "org.apache.spark.examples.ml.JavaIndexToStringExample" \
#   --properties \
# "spark.driver.resource.gpu.amount=1,"\
# "spark.driver.resource.gpu.discoveryScript=${get_gpu_resources_script},"\
# "spark.executor.resource.gpu.amount=1,"\
# "spark.executor.resource.gpu.discoveryScript=${get_gpu_resources_script}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I haven't verified whether they work or not. I'm busy and the tests that I am running are passing. I don't know if I want to add new ones right this VERY second.

@cjac cjac force-pushed the proxy-sync-2026-01 branch from 7887d15 to 7a5a629 Compare January 28, 2026 02:10
@cjac cjac force-pushed the proxy-sync-2026-01 branch 3 times, most recently from 3a92f5a to 696dd13 Compare March 13, 2026 03:00
@cjac cjac force-pushed the proxy-sync-2026-01 branch from 9cfe963 to 73be281 Compare March 18, 2026 06:07
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@cjac cjac force-pushed the proxy-sync-2026-01 branch 3 times, most recently from 85d00d7 to 059ff92 Compare March 18, 2026 06:26
@cjac cjac marked this pull request as ready for review March 18, 2026 06:26
@cjac
Copy link
Contributor Author

cjac commented Mar 18, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces significant changes to the Dataproc on GCE (DPGCE) environment setup scripts, focusing on state management, declarative creation, and idempotency. The changes include updating the README, adding a new audit script (audit-dpgce), modifying the create script (create-dpgce) to use a declarative model, and updating various library functions. The reviewer identified several potential issues, including a command injection vulnerability in script-utils.sh, the lack of explicit resource deletion in destroy-dpgce, missing error handling in gce-proxy-setup.sh, and the need for more robust state checks in dataproc/cluster.sh. The reviewer also suggested adding checks for metadata length and directory existence in dataproc/cluster.sh and audit-dpgce, respectively, and ensuring that the ssh command is not executed with empty command arguments in ssh-m.


# The command_to_run should be a gcloud command with --format=json
# that returns a JSON object if the resource exists and fails otherwise.
json_output=$(eval "${command_to_run}" 2>/dev/null)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The command eval "${command_to_run}" 2>/dev/null is used to execute the gcloud command. However, this command can be vulnerable to command injection if the command_to_run variable contains malicious code. Consider using a safer way to execute the gcloud command, such as using an array.


ROUTES_RAW=$(get_state "routes")
if [[ "${ROUTES_RAW}" != "null" && "${ROUTES_RAW}" != "[]" ]]; then
mapfile -t route_names < <(echo "${ROUTES_RAW}" | jq -r '.[].name')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

It's generally safer to explicitly specify the resources being deleted (e.g., gcloud compute firewall-rules delete ...) rather than relying on a general-purpose function that might inadvertently delete other resources.

Comment on lines +176 to +177
if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then
echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The nc command lacks error handling. If the connection fails, the script should exit to prevent further execution with a broken proxy setup.

Suggested change
if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then
echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}."
if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then
echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}."
exit 1
else

Comment on lines +181 to +183
# TODO: Check if NAT is already on the router
echo "INFO: NAT Egress enabled, ensuring NAT on existing router..."
add_nat_to_router

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check is useful, but it might be more robust to check if NAT is enabled within the router's configuration rather than relying on the cloudRouter state alone.

export -f delete_dpgce_cluster

function exists_dataproc_cluster_vms() {
_check_exists "gcloud compute instances list --project='${PROJECT_ID}' --filter='labels.goog-dataproc-cluster-name=${CLUSTER_NAME}' --format='json(name,zone,status)'" | jq 'if . == [] then null else . end'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _check_exists function is used to determine if a resource exists. However, it only checks if the resource exists and does not check if the resource is in the correct state. Consider adding a check to ensure that the resource is in the correct state before proceeding.

fi
print_status " Attempting to delete CA Pool ${short_pool_name}..."
if ! run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources; then overall_status="Fail"; report_result "Fail"; else report_result "Deleted"; fi
run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script attempts to delete the CA pool, but doesn't check for errors. This could lead to a situation where the script continues even if the CA pool deletion fails.

Suggested change
run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources || true
if ! run_gcloud "${log_file}" gcloud privateca pools delete "${short_pool_name}" --location="${region}" --project="${project_id}" --quiet --ignore-dependent-resources; then overall_status="Fail"; report_result "Fail"; else report_result "Deleted"; fi

Comment on lines +70 to +74
run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \
--gateway-security-policy="${policy_name}" \
--location="${region}" \
--project="${project_id}" \
--quiet || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script deletes the rule and policy, but doesn't check for errors after each operation. This could lead to a situation where the script continues even if the rule or policy deletion fails.

Suggested change
run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \
--gateway-security-policy="${policy_name}" \
--location="${region}" \
--project="${project_id}" \
--quiet || true
run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \
--gateway-security-policy="${policy_name}" \
--location="${region}" \
--project="${project_id}" \
--quiet || true
if run_gcloud "${policy_log}" gcloud network-security gateway-security-policies delete "${policy_name}" \

Comment on lines +70 to +74
run_gcloud "${rule_log}" gcloud network-security gateway-security-policies rules delete "${rule_name}" \
--gateway-security-policy="${policy_name}" \
--location="${region}" \
--project="${project_id}" \
--quiet || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a check to ensure that the rule is actually deleted before attempting to delete the policy. This can prevent issues if the rule deletion fails.


(
source "${GCLOUD_DIR}/${source_file}"
"${func_name}" "$@" > "${cmd_out}" 2>/dev/null

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Redirecting stderr to /dev/null hides potential errors from the sourced script. Consider logging stderr to a file for debugging purposes.


# The command_to_run should be a gcloud command with --format=json
# that returns a JSON object if the resource exists and fails otherwise.
json_output=$(eval "${command_to_run}" 2>/dev/null)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The stderr is redirected to /dev/null, which means that any errors that occur during the execution of the command will not be logged. This can make it difficult to debug the script if something goes wrong. Consider logging stderr to a file for debugging purposes.

@cjac
Copy link
Contributor Author

cjac commented Mar 18, 2026

Dataproc Lifecycle Script Overhaul: From Brittle to Declarative

This series of changes represents a fundamental refactoring of the cloud-dataproc/gcloud scripts, moving from a collection of fragmented, sometimes brittle scripts to a more unified, state-aware, and declarative system for managing Dataproc environments.

Initial State & Problems:

  • Multiple, single-purpose scripts for different creation scenarios (e.g., create-dpgce-custom, create-dpgce-private).
  • Reliance on sentinel files for state management, which proved fragile.
  • Audit scripts were separate and didn't always reflect the true state of GCP.
  • create scripts were not truly idempotent, often failing if the environment wasn't perfectly clean, requiring manual destroy calls.

The Evolution:

  1. Script Consolidation (2026-03-12): The first major step was unifying the various create-dpgce-* and destroy-dpgce-* scripts into single bin/create-dpgce and bin/destroy-dpgce scripts, using flags like --custom and --private to manage variations. Audit scripts were also unified, querying live GCP for the source of truth.

  2. Intelligent Image Selection (2026-03-12): Enhanced custom image creation to check for existing images before rebuilding.

  3. The Shift to state.json (2026-03-13): A key architectural change was made to use a central state.json file as the authoritative state, generated by bin/audit-dpgce. create and destroy scripts were modified to read from this file instead of querying GCP directly. This phase was difficult, marked by significant debugging of shell syntax and line-continuation errors.

  4. Achieving True Idempotency (2026-03-13): A critical flaw was identified: create-dpgce would fail if any resource existed. This was corrected to make the script truly idempotent, where it checks for each resource individually and only creates it if missing, allowing it to converge partially built environments. This breakthrough led to the first fully successful automated cluster creation.

  5. Validation & Cleanup (2026-03-13): End-to-end validation, including GPU driver installation on a Secure Boot custom image, was successful. Efforts were made to clean up the Git history and identify lingering issues like the outdated README.

  6. Networking Nightmares (2026-03-13 - 2026-03-17): Extensive time was spent debugging network connectivity issues that prevented cluster registration. This was eventually traced to VMs in us-west4-a receiving /32 netmasks.

  7. Transition to SQLite & Declarative Planning (2026-03-18):

    • To address state management robustness, state.json was replaced with an SQLite database (state.db), providing atomic updates.
    • bin/audit-dpgce was overhauled to run checks concurrently and populate the SQLite DB.
    • bin/create-dpgce was transformed into a fully declarative, plan-based system:
      • It reads the current state from state.db.
      • Determines intended resources based on flags.
      • Generates and displays a PLAN of actions (CREATE, RECREATE).
      • Executes the plan, including logic to recreate the cluster if the image type doesn't match the flags.
    • bin/destroy-dpgce was enhanced to use the SQLite DB and ensure complete network teardown by deleting the default-internet- route.
  8. Stability & Feature Additions (2026-03-18):

    • Added --nat-egress and placeholder --swp-egress flags.
    • Made NAT creation idempotent in lib/network/router.sh.
    • Added --quiet to autoscaling policy import to prevent hangs.
    • Successfully tested create-dpgce --custom --nat-egress.

Current State:

The scripts are now much more robust, predictable, and easier to use. The declarative approach in create-dpgce, backed by the SQLite cache, allows for reliable environment provisioning and convergence. Key functionalities like custom image and NAT egress are working.

Key Improvements for Review:

  • Unified Scripts: Fewer scripts to manage.
  • SQLite State: Robust and atomic state caching.
  • Declarative Creation: Clear planning and execution in create-dpgce.
  • Idempotency: Scripts can be re-run safely.
  • Improved Teardown: More reliable network and resource deletion.

This narrative provides context for the significant architectural changes and the iterative debugging and refinement process that led to the current state of the scripts.

@cjac
Copy link
Contributor Author

cjac commented Mar 18, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major refactoring of the Dataproc environment scripts, moving towards a declarative, state-driven model using an SQLite database. It also adds significant new functionality, including GPU testing scripts, enhanced documentation, and improved setup for private clusters with Secure Web Proxy. The changes are extensive and generally well-executed, greatly improving the structure and robustness of the scripts. However, I've identified several critical and high-severity issues, primarily related to new or refactored scripts calling non-existent functions or files, broken CI configurations, and bugs in test scripts. Addressing these issues is crucial for the stability and correctness of the new implementation.

@@ -0,0 +1 @@
#!/bin/bashsource lib/env.shgcloud network-services gateways describe ${SWP_INSTANCE_NAME} --location ${REGION}gcloud network-security gateway-security-policies rules export allow-all-rule \ --gateway-security-policy=${SWP_POLICY_NAME} \ --location ${REGION}#gcloud logging read \# "resource.type=\"networkservices.googleapis.com/Gateway\" \# resource.labels.location=\"${REGION}\" \# resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \# --project="${PROJECT_ID}" \# --limit=100 \# --format=json # Or text, defaultgcloud logging read \ "resource.type=\"networkservices.googleapis.com/Gateway\" \ resource.labels.location=\"${REGION}\" \ resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \ --project="${PROJECT_ID}" \ --limit=500 \ --freshness=1d # Look back over the last day# Delete the old test if it existsgcloud network-management connectivity-tests delete swp-test --quiet# Create a new onegcloud network-management connectivity-tests create swp-test \ --source-instance=projects/${PROJECT_ID}/zones/${ZONE}/instances/${CLUSTER_NAME}-m \ --destination-ip-address=10.43.79.245 \ --destination-port=3128 \ --protocol=TCP \ --source-network=projects/${PROJECT_ID}/global/networks/${NETWORK}# Wait and describesleep 5# After a moment, get the results:gcloud network-management connectivity-tests describe swp-test No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This file appears to be a scratchpad of commands rather than a valid script. It's missing a shebang, and all commands are concatenated on a single line. This will fail to execute correctly. Please format this as a proper bash script with a shebang and each command on a new line.

#!/bin/bash
source lib/env.sh
gcloud network-services gateways describe ${SWP_INSTANCE_NAME} --location ${REGION}
gcloud network-security gateway-security-policies rules export allow-all-rule \
  --gateway-security-policy=${SWP_POLICY_NAME} \
  --location ${REGION}

#gcloud logging read \
#  "resource.type=\"networkservices.googleapis.com/Gateway\" \
#  resource.labels.location=\"${REGION}\" \
#  resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \
#  --project="${PROJECT_ID}" \
#  --limit=100 \
#  --format=json # Or text, default

gcloud logging read \
  "resource.type=\"networkservices.googleapis.com/Gateway\" \
  resource.labels.location=\"${REGION}\" \
  resource.labels.gateway_id=\"${SWP_INSTANCE_NAME}\"" \
  --project="${PROJECT_ID}" \
  --limit=500 \
  --freshness=1d # Look back over the last day

# Delete the old test if it exists
gcloud network-management connectivity-tests delete swp-test --quiet

# Create a new one
gcloud network-management connectivity-tests create swp-test \
  --source-instance=projects/${PROJECT_ID}/zones/${ZONE}/instances/${CLUSTER_NAME}-m \
  --destination-ip-address=10.43.79.245 \
  --destination-port=3128 \
  --protocol=TCP \
  --source-network=projects/${PROJECT_ID}/global/networks/${NETWORK}

# Wait and describe
sleep 5

# After a moment, get the results:
gcloud network-management connectivity-tests describe swp-test


echo "DEBUG: set_proxy: Testing external site access via proxy..."
local test_url="https://www.google.com"
if curl -vL ${curl_retry_args} -o /dev/null "${test_url}"; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable ${curl_retry_args} is used here but it is not defined anywhere in the script. This will cause the curl command to fail. You should define this variable with appropriate retry arguments, for example: local curl_retry_args="--retry-connrefused --retry 10 --retry-max-time 30".

Comment on lines +77 to +80
- 'PROJECT_ID=cjac-2025-01' # Use CI_PROJECT_ID here
- 'CLOUDSDK_CORE_DISABLE_PROMPTS=1'
substitutions:
_CI_PROJECT_ID: "cjac-2025-01"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PROJECT_ID and _CI_PROJECT_ID are hardcoded in this file. This makes the Cloud Build configuration specific to the cjac-2025-01 project and not reusable. You should use Cloud Build substitutions (e.g., $_CI_PROJECT_ID or $PROJECT_ID) to make this configuration generic, as was done in the setup-cicd.sh script.

  - 'PROJECT_ID=${_CI_PROJECT_ID}' # Use CI_PROJECT_ID here
  - 'CLOUDSDK_CORE_DISABLE_PROMPTS=1'
substitutions:
  _CI_PROJECT_ID: "${PROJECT_ID}"

- name: 'gcr.io/cloud-builders/gcloud'
id: 'Test Standard Create'
entrypoint: 'bash'
args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce-create']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The generated cloudbuild.yaml file attempts to call audit-dpgce-create, which has been removed in this pull request as part of the refactoring to a single audit-dpgce script. This will cause the CI build to fail. The audit step should be updated to use the new universal audit script, bin/audit-dpgce. This issue is present for all test steps in the generated file.

Suggested change
args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce-create']
args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce']

- name: 'gcr.io/cloud-builders/gcloud'
id: 'Test Standard Create'
entrypoint: 'bash'
args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce-create']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This Cloud Build step calls audit-dpgce-create, but this script has been removed in this PR. The build will fail. You should update this to use the new universal bin/audit-dpgce script. This applies to all other audit steps in this file as well.

    args: ['-c', 'source lib/env.sh && ./bin/create-dpgce && ./bin/audit-dpgce']

Comment on lines +130 to +131
if echo "${PROJECT_ID_RAW}" | /usr/bin/jq empty > /dev/null 2>&1; then
PROJECT_ID_VAL=$(echo "${PROJECT_ID_RAW}" | /usr/bin/jq -r '.projectId // "N/A"')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path to jq is hardcoded as /usr/bin/jq. This is not portable and may fail on systems where jq is installed in a different location (e.g., /usr/local/bin). It's better to rely on the PATH environment variable and just call jq directly, as it's listed as a prerequisite in the README.

Suggested change
if echo "${PROJECT_ID_RAW}" | /usr/bin/jq empty > /dev/null 2>&1; then
PROJECT_ID_VAL=$(echo "${PROJECT_ID_RAW}" | /usr/bin/jq -r '.projectId // "N/A"')
if echo "${PROJECT_ID_RAW}" | jq empty > /dev/null 2>&1; then
PROJECT_ID_VAL=$(echo "${PROJECT_ID_RAW}" | jq -r '.projectId // "N/A"')

env.cpan.json

# Temporary files and logs
/tmp/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The entry /tmp/ will only ignore a tmp directory at the root of the gcloud directory. The scripts, however, seem to use the absolute path /tmp/dataproc-repro/... for logs and temporary files, which is outside the repository and cannot be ignored by .gitignore. If the intention is to ignore a local tmp directory for other purposes, consider changing this to tmp/ to avoid ambiguity with the system's root /tmp directory.

@@ -0,0 +1,188 @@
#!/bin/bash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new file cluster-custom.sh appears to be unused in the refactored create-dpgce script or any other part of the PR. If it's work-in-progress or dead code, consider removing it to avoid confusion and reduce maintenance overhead.

Comment on lines +5 to +8
function exists_firewall() {
# This is a basic check. A more robust version might check for a list of rules.
_check_exists "gcloud compute firewall-rules describe '${FIREWALL}-in-ssh' --project='${PROJECT_ID}' --format='json(name,selfLink)'"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function exists_firewall seems to be dead code. The main audit script (bin/audit-dpgce) uses the more generic exists_firewall_rule function from lib/swp/firewall.sh to check for firewall rules. To avoid confusion and code duplication, consider removing this unused function.

export -f create_managed_certificate

function _delete_swp_ca_resources() {
function delete_managed_certificate() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The refactored delete_managed_certificate function appears to have lost the conditional logic based on the --force flag that existed in the previous implementation. The old behavior was to preserve versioned CA components unless --force was specified. Now, it seems all versioned components are deleted unconditionally. If preserving these resources by default is still desired, the --force check should be reintroduced.

@cjac cjac force-pushed the proxy-sync-2026-01 branch from 656ea83 to 059ff92 Compare March 18, 2026 07:02
cjac added 6 commits March 19, 2026 02:39
- Replaced separate create/destroy/audit scripts for standard and private clusters with unified scripts in bin/ (e.g., create-dpgce, destroy-dpgce, audit-dpgce).
- Moved common functions to lib/script-utils.sh.
- Updated lib/env.sh with new variables and logic.
- Added new utility scripts like connectivity-test and debug-init-action.
- Added cluster creation logic for custom and private cluster types in lib/dataproc/.
- Updated various library functions in lib/ to support the new unified scripts.
- Added new scripts for CI/CD setup (bin/setup-cicd.sh) and GitHub triggers (bin/create-github-trigger.sh).
- Updated README.md to reflect the new script organization.
- Enhanced .gitignore.
- Added env.json and updated env.json.sample.
- Added cloudbuild.yaml for CI/CD.
- Updated init/gce-proxy-setup.sh.
- Added test scripts in t/.
- Added project findings and work narratives.
Implements a robust audit system for the DPGCE environment:

-   Replaces the state.json file with an SQLite database (state.db) for atomic and concurrent-safe state management.
-   Refactors `bin/audit-dpgce` to perform checks for all resources concurrently, writing results to temporary files.
-   Updates the SQLite database sequentially from temp files to prevent locking issues.
-   Adds `exists_*` functions to numerous lib scripts to support the audit.
-   Ensures all necessary functions are exported from their respective lib files.
-   Introduces `report_audit_status` for clearer [Exists]/[Not Found] reporting.
-   Updates README.md with `sqlite3` prerequisite.
-   Fixes various bugs in the audit script and helper functions.

This provides a much faster, more reliable, and comprehensive audit of the environment.
This commit introduces a declarative, plan-based approach to the `create-dpgce` script and significantly improves the robustness and completeness of the `destroy-dpgce` script.

**`create-dpgce` Refactoring:**

*   Replaced the initial audit with loading state from the SQLite cache (`init_state_db`).
*   Added `display_status` function to show the current state of resources.
*   Implemented plan generation by comparing the current state with the intended state based on flags.
*   Added a plan execution block that creates resources in the correct order based on the generated plan.
*   Removed redundant `check_project` and `check_billing` calls.
*   Ensured all state checks use `get_state`.

**`destroy-dpgce` Enhancements:**

*   Integrated `get_state` for all resource existence checks, replacing old `jq` logic.
*   Added `delete_all_network_vms` to ensure all VMs in the network are deleted before network teardown.
*   Included more library scripts to cover a wider range of resources for deletion (PHS, GKE, DBs, SWP, etc.).
*   Refined route deletion to skip default internet and local routes.
*   Added cache updates (`update_state "resource" "null"`) after successful deletions for many resources.

**Library Updates (`lib/`):**

*   Added `refresh_resource_state` or `update_state` calls to CUD functions in:
    *   `dataproc/autoscaling.sh`
    *   `dataproc/cluster.sh`
    *   `dataproc/private-cluster.sh`
    *   `gcp/gcr.sh`
    *   `gcp/gcs.sh`
    *   `gcp/iam.sh`
    *   `gcp/kms.sh`
    *   `network/network.sh`
    *   `network/peering.sh`
    *   `network/router.sh`
    *   `network/routes.sh`
    *   `network/subnet.sh`
*   Added `exists_dataproc_cluster_vms` to `lib/dataproc/cluster.sh`.
*   Fixed `create_subnet` to accept parameters for name, key, and range.
*   Corrected line continuation in `create_vpc_network` in `lib/network/network.sh`.

**Other Changes:**

*   Fixed `recreate-cluster.sh` to correctly call `gcloud dataproc clusters describe` for displaying cluster details.
*   Updated `env.json` to use `us-west4-a`.
*   Cleaned up `lib/gcp/misc.sh`.

These changes make the scripts more reliable, idempotent, and easier to understand.
- Ensure all calls to _check_exists pass command and arguments
  as separate array elements, not a single string. This fixes
  the audit not detecting some existing resources.
- Refactor refresh_resource_state to handle array arguments.
- Add exit 1 to init/gce-proxy-setup.sh if proxy connection test fails.
- Improve stderr logging in bin/audit-dpgce background checks.
- Fail script in lib/swp/policy.sh if security rule deletion fails.
- Add clarifying comments in bin/destroy-dpgce.
- Fix relative paths in lib/env.sh and lib/dataproc/private-cluster.sh.
- Run audit at the start of create-dpgce.
- Add NAT pre-delete back to lib/network/router.sh.
@cjac cjac force-pushed the proxy-sync-2026-01 branch from 059ff92 to 729999a Compare March 19, 2026 02:40
- Switched to client-side jq filtering for default route checks in
  lib/network/routes.sh due to gcloud filter unreliability.
- Made create-dpgce idempotent for network infrastructure, ensuring
  components like NAT and default routes are created if missing,
  even if the Dataproc cluster already exists.
- create-dpgce no longer errors out if a cluster exists with a
  different configuration; it only creates the cluster if missing.
- Added Cloud NAT checks to the audit-dpgce script and report.
- Corrected GCLOUD_DIR path resolution in lib/env.sh to be
  self-contained and robust.
- Fixed bin/ssh-m to correctly set GCLOUD_DIR.
- Simplified SCRIPT_DIR assignment in all bin/ scripts.
- Added gcloud/.gitignore to exclude env.json.
- Removed env.json from git history using filter-branch.
- Rebased onto upstream/main.
@cjac cjac force-pushed the proxy-sync-2026-01 branch from 729999a to 8ce83a7 Compare March 19, 2026 02:43
cjac added 2 commits March 19, 2026 04:01
- Reworked `bin/destroy-dpgce` to run an audit first, then generate
  a teardown plan based on discovered resources.
- Added a context-aware status report to `destroy-dpgce` to display
  the state of resources found by the audit, respecting intent flags
  like --nat-egress, --custom, etc.
- Updated `destroy-dpgce` to accept the same intent flags as
  `create-dpgce` to tailor the status report.
- Made plan generation for network VMs and custom routes conditional
  on their detection in the audit.
- Added `get_any_network_vms_state` to `lib/gcp/misc.sh` and
  an `anyNetworkVms` check to `bin/audit-dpgce` to support more
  accurate teardown planning for any VM in the network.
- Ensured `destroy-dpgce` uses a unique TIMESTAMP for each run
  to isolate state.
- Updated `README.md` to reflect these significant changes.
- Re-instate unique TIMESTAMP generation per run for create-dpgce
  and destroy-dpgce to ensure state isolation.
- Corrected SQLite schema usage in script-utils.sh to use the
  'state' table and 'value' column consistently.
- Refactored Service Account management in lib/gcp/iam.sh:
  - Added auditing for SA IAM bindings (serviceAccountBindings)
    and readiness (serviceAccountRolesReady).
  - create_service_account now verifies all required roles are bound.
  - delete_service_account now properly removes role bindings
    before deleting the SA.
  - Fixed gcloud command execution within run_gcloud by using arrays
    to handle multi-line arguments correctly.
- Enhanced run_gcloud to suppress output for expected errors like
  IAM binding not found.
- Optimized create_subnet in lib/network/subnet.sh to avoid
  refetching details after creation.
- Removed various debug messages.
@cjac cjac requested a review from Abinash-Sharma March 20, 2026 07:13
@cjac
Copy link
Contributor Author

cjac commented Mar 20, 2026

/gemini review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant