Creating a Data Engineering cluster - User Guide

Building X Machine Learning User Guide

Product Family: Building X
Content Language: English
Content Type: Technical Documentation > User Guide
Document No: A6V13951403_en--_a
Download ID: A6V13951403
Access level: Internet
File Type: html

Creating a Data Engineering cluster

This section describes how to create Data Engineering clusters.

In the left navigation pane, select Data Science & Engineering > Compute.

In All-purpose compute (1) tab, select Create with Siemens_AllPurpose_ClusterPolicy (2) from the drop-down list.

A new cluster with a default cluster name appears.

Change the name (1) of the cluster according to your requirement.

Select Multi node (2).
Note: Using Single mode is not recommended.

For Access mode (3), Shared is selected by default and is not editable.

In the Performance section, select the following options:

Select a Databricks runtime version (1). Choose between Standard and ML as required. The latest Databricks runtime version is selected by default.

For Worker type (2), select the required value from the drop-down list.

For Min workers (3) and Max workers (4), type in the required value. Minimum value for Min worker is 1 and max value for the Max worker is 20.

For the Driver type (5), either choose Same as worker or pick from the available types.

Select or deselect Enable autoscaling local storage (6) as required. The options Enable autoscaling and Terminate after 30 minutes of inactivity are preselected and cannot be modified.

In the Instance profile section, meta-iam-role extension is preselected. It allows connectivity to the data lake.

In the Tags section, add Tags as required by providing values in the Key and Value fields. See table below for default tags.

In the Advanced options section, Enable credential passthrough for user-level data access option is preselected and cannot be edited. Keep default configuration for all other settings under Advanced options.

Select Create Cluster to create the cluster.

About using cluster policies

Using a cluster has the following advantages:

To avoid unnecessary costs in case of cluster inactivity, the cluster will shut down automatically after 30 min.
To avoid misconfiguration which might lead to high cost in case of selecting high number of workers.
To track the cost through applied tags for each cluster.
To enforce meta instance profile configuration which is required to build the connectivity to the Data Lake.

Default tags

Key	Value
env	Prodint
teams	Dataengineering