Tuesday, January 6, 2026

How IdP Groups Are Tied to Databricks Groups (Unity Catalog)

 

🔗 How IdP Groups Are Tied to Databricks Groups (Unity Catalog)

🔑 Key Principle (Read This First)

Databricks does NOT “map” IdP groups to Databricks groups manually.
The linkage happens through SCIM provisioning.

SCIM = the binding glue between IdP and Databricks.


1️⃣ High-Level Flow

Identity Provider (Okta / Azure AD) | | SCIM Provisioning v Databricks Account Console | | Group sync v Unity Catalog Authorization
  • IdP creates & owns the group

  • SCIM syncs it into Databricks

  • Unity Catalog grants privileges to the group

  • Databricks enforces access


2️⃣ Where Each Thing Is Defined

ItemWhere It Lives
Groups (finance-readers)IdP (Okta / Azure AD)
Group membershipIdP
Group syncSCIM
Group visibilityDatabricks Account Console
Data privilegesUnity Catalog (SQL)

3️⃣ Step-by-Step: Tie IdP Groups to Databricks


STEP 1: Create Groups in the IdP

Example: Azure AD / Okta

Create these groups:

  • finance-readers

  • finance-writers

  • finance-engineers

  • finance-data-owners

Add users and service principals only in the IdP.

📌 Databricks should never be the source of truth.


STEP 2: Enable SCIM Provisioning in Databricks

In Databricks Account Console:

  1. Go to User Management

  2. Enable SCIM provisioning

  3. Generate SCIM Token

  4. Copy SCIM endpoint URL

📌 This is one-time setup.


STEP 3: Configure SCIM in the IdP

Example: Azure AD

  • Add Databricks SCIM app

  • Configure:

    • SCIM endpoint

    • Bearer token

  • Assign:

    • Groups

    • Users

    • Service principals

Example: Okta

  • Enable SCIM provisioning

  • Assign groups to the Databricks app

  • Push groups & memberships

✔ Groups now auto-sync.


STEP 4: Verify Groups in Databricks

Since you’re not an admin, ask an admin to verify in:

Databricks Account Console → User Management → Groups

Or verify yourself using SQL:

SHOW GROUPS;

You should now see:

finance-readers finance-writers finance-engineers

These groups are:
✔ SCIM-managed
✔ Read-only in Databricks
✔ Governed by IdP


STEP 5: Grant Unity Catalog Privileges to SCIM Groups

Now comes the binding to data.

GRANT USE CATALOG ON CATALOG finance TO `finance-readers`; GRANT USE SCHEMA ON SCHEMA finance.gl TO `finance-readers`; GRANT SELECT ON TABLE finance.gl.transactions TO `finance-readers`; GRANT SELECT, MODIFY ON TABLE finance.gl.transactions TO `finance-writers`; GRANT CREATE TABLE ON SCHEMA finance.gl TO `finance-engineers`;

🎯 This is where “roles” become real.


4️⃣ How Membership Changes Are Enforced (Important)

ChangeWhere DoneResult
User added to groupIdPAccess granted automatically
User removedIdPAccess revoked automatically
User terminatedIdPImmediate loss of access
New user onboardedIdPGroup membership applies

🚀 No Databricks admin action required.


5️⃣ Service Principals (ETL / Genie)

Same exact model.

In IdP:

  • Create service account / app registration

  • Add to group finance-etl-sp

SCIM:

  • Syncs service principal

Databricks:

GRANT MODIFY ON TABLE finance.gl.transactions TO `finance-etl-sp`;

Jobs and Genie now run securely.


6️⃣ How to Tell If a Group Is SCIM-Managed

In SQL:

DESCRIBE GROUP `finance-readers`;

You’ll see:

  • External ID

  • Read-only membership

📌 If it’s editable → it’s a local group (anti-pattern).


7️⃣ Common Mistakes (Avoid These 🚫)

❌ Manually creating groups in Databricks for prod
❌ Adding users directly in Databricks
❌ Granting privileges to individual users
❌ Using workspace-local groups
❌ Mixing SCIM and local groups


8️⃣ One-Screen Mental Model

IdP (truth) → SCIM → Databricks Groups → Unity Catalog Grants → Enforcement

Authorization Using Unity Catalog Security Model

 

Authorization Using Unity Catalog Security Model

Author: Harvinder Singh — Resident Solution Architect, Databricks
 


1. Overview

Unity Catalog (UC) is Databricks’ centralized governance and authorization layer for all data and AI assets across the Lakehouse.
It enforces fine-grained, secure access control on:

  • Catalogs, Schemas, Tables, Views

  • Volumes & Files

  • Functions & Models

  • Clusters, SQL Warehouses, External Locations

  • AI/ML assets, Feature Tables, Vector Indexes

  • Databricks Agents (Genie)

This document defines:

  1. Authorization architecture

  2. Role-based access control (RBAC) structure

  3. Identity and resource model

  4. Step-by-step implementation

  5. Best practices for enterprise deployment

  6. Operational processes (audit, lineage, monitoring)


2. Authorization Architecture

Unity Catalog Authorization operates at 4 layers:

2.1 Identity Layer

  • Users (human identities)

  • Service Principals (machine identities)

  • Groups (SCIM/SSO synced)

  • Workspace-local groups (limited usage)

Identity is managed through:
✔ SCIM Provisioning
✔ SSO (Okta, Azure AD, Ping, ADFS)
✔ Databricks Account Console


2.2 Resource Layer

Unity Catalog governs:

Resource TypeExamples
MetastoreUnified governance boundary
Catalogfinance, sales, engineering
Schemasales.crm, finance.gl
Table / ViewManaged or External Delta Tables
VolumesUnstructured files
FunctionsSQL/Python functions
ModelsMLflow models
Vector IndexFor RAG/AI
External Locations / Storage CredentialsS3/ADLS locations
Clean RoomsCross-organization sharing

2.3 Privileges (Authorization Rules)

Privileges define what identity can perform on a resource:

High-level privileges

  • USE CATALOG

  • USE SCHEMA

  • SELECT (read rows)

  • MODIFY (update, delete, merge)

  • CREATE TABLE

  • CREATE FUNCTION

  • EXECUTE (for models/functions/AI tasks)

Advanced privileges

  • READ FILES, WRITE FILES (Volumes)

  • BYPASS GOVERNANCE (Admin only)

  • APPLY TAG / MANAGE TAG

  • MANAGE GRANTS

  • OWNERSHIP


2.4 Enforcement Engine

Unity Catalog enforces authorization:

  • At SQL execution time

  • At API call time

  • At Notebook execution time

  • Inside Genie / Agents

  • Through lineage and audit logs

  • Across all workspaces connected to the same metastore

Because UC is part of the Databricks control plane, enforcement is real-time and cannot be bypassed.


3. RBAC Design Patterns

Below are the best-practice authorization models.


3.1 Layered RBAC Structure

Administrative Roles

RolePurpose
Account AdminControls accounts, workspaces, identity
Metastore AdminManages governance boundary
Data StewardApplies tags, controls lineage
Data OwnersOwn schemas/tables

Data Access Roles

RolePrivileges
Data ReaderSELECT on tables/views
Data WriterSELECT, MODIFY, INSERT, DELETE
Data EngineerCREATE TABLE, MODIFY
BI AnalystSELECT + USE SCHEMA
ML EngineerEXECUTE models + SELECT

Job / Service Roles

IdentityUse Case
Workflow Service PrincipalETL jobs
Dashboard Service PrincipalMaterialized view refresh
Genie Agent PrincipalAgentic workflows

Each service principal receives only the minimum privileges needed.


4. Detailed Implementation Steps (Databricks)

This section walks through exact steps to implement authorization in UC.


STEP 1: Enable Unity Catalog and Create a Metastore

  1. Log into Databricks Account Console

  2. Create a Metastore

  3. Assign root storage (S3/ADLS secure path)

  4. Assign Metastore to one or more workspaces

# Validate metastore assignment databricks metastores get --metastore-id <ID>

STEP 2: Configure Identity Sync (Okta / Azure AD)

Enable SCIM provisioning

  • Sync users & groups to Databricks

  • Assign proper roles such as data-analysts, bi-users, etl-jobs

Validate Groups:

databricks groups list

STEP 3: Create Catalogs & Schemas and Assign Ownership

CREATE CATALOG finance; CREATE SCHEMA finance.gl; CREATE SCHEMA finance.ap; -- Assign ownership to Finance Data Owner group GRANT OWNERSHIP ON CATALOG finance TO `finance-data-owners`;

Catalog ownership allows delegated grants.


STEP 4: Define Access Roles

Example groups (from SCIM):

  • finance-readers

  • finance-writers

  • finance-engineers

  • etl-service-principals


STEP 5: Grant Privileges (RBAC Implementation)

Catalog Level

GRANT USE CATALOG ON CATALOG finance TO `finance-readers`;

Schema Level

GRANT USE SCHEMA ON SCHEMA finance.gl TO `finance-readers`; GRANT CREATE TABLE ON SCHEMA finance.gl TO `finance-writers`;

Table Level

GRANT SELECT ON TABLE finance.gl.transactions TO `finance-readers`; GRANT MODIFY ON TABLE finance.gl.transactions TO `finance-writers`;

Volume Level

GRANT READ FILES ON VOLUME finance.rawfiles TO `finance-readers`;

Function & Model Level

GRANT EXECUTE ON FUNCTION finance.gl.clean_data TO `etl-service-principals`;

STEP 6: Implement Data Masking / Row-Level Security (Optional)

Example: PII Masking

CREATE OR REPLACE VIEW finance.gl.transaction_masked AS SELECT account_id, CASE WHEN is_account_owner() THEN ssn ELSE '***-**-****' END AS ssn_masked, amount FROM finance.gl.transactions;

Then grant:

GRANT SELECT ON VIEW finance.gl.transaction_masked TO finance-readers;

STEP 7: Configure External Locations (Secure Access to S3/ADLS)

Create a storage credential (IAM role or SAS token):

CREATE STORAGE CREDENTIAL finance_sc WITH IAM_ROLE = 'arn:aws:iam::123456789012:role/finance-access';

Create external location:

CREATE EXTERNAL LOCATION finance_loc URL 's3://company-data/finance/' WITH CREDENTIAL finance_sc;

Grant access to engineers:

GRANT READ FILES, WRITE FILES ON EXTERNAL LOCATION finance_loc TO `finance-engineers`;

STEP 8: Configure Job/Workflow Authorization

Service principal access:

GRANT SELECT, MODIFY ON TABLE finance.gl.transactions TO `etl-service-principals`;

Workflow must run under the service principal identity.


STEP 9: Audit Logging, Lineage, and Monitoring

Databricks automatically logs:

  • Permission changes

  • Data accesses

  • Notebook executions

  • Model inferences

  • Workflow runs

  • Genie agent actions

Enable audit log delivery:

AWS → S3 bucket
Azure → Monitor / EventHub

Query audit logs:

SELECT * FROM system.access.audit WHERE user_name = 'john.doe@example.com';

Lineage Tracking

Unity Catalog automatically tracks lineage across:

  • SQL

  • Notebooks

  • Jobs

  • Delta Live Tables

  • ML pipelines

  • Databricks Agents

No extra configuration needed.


5. Operational Governance Model

Daily Operations

  • Assign users to appropriate SCIM groups

  • Keep least-privilege enforcement

  • Review lineage before modifying objects

Monthly

  • Permission review with data owners

  • Audit policy approvals

  • Schema evolution reviews

Quarterly

  • Access certification

  • Tagging verification (e.g., PII, Restricted)

  • AI Agent resource permission review


6. Best Practices

Access Control

✔ Use Groups, never direct user grants
✔ Grant only required privileges (principle of least privilege)
✔ Use views for masking sensitive columns
✔ Use Metastore-level admins sparingly
✔ Delegate schema/table ownership to data owners (not IT)


Storage

✔ Use External Locations with IAM roles
✔ One S3/ADLS path per domain
✔ Disable direct bucket access; enforce UC controls


Automation

✔ Use service principals for jobs, dashboards, and agents
✔ Do not run jobs as human users
✔ Enforce cluster policies that restrict external hosts


AI / Genie Authorization

Genie can only access:

  • Catalogs/Schemas/Tables the agent identity has rights to

  • Notebooks the identity can EXECUTE

  • Volumes the identity can READ FILES / WRITE FILES

No privilege escalation is possible.


7. Example End-to-End Setup Script

-- Create catalog CREATE CATALOG sales; -- Schema CREATE SCHEMA sales.orders; -- Ownership GRANT OWNERSHIP ON CATALOG sales TO `sales-data-owner`; -- Readers/Writers GRANT USE CATALOG ON CATALOG sales TO `sales-readers`; GRANT USE SCHEMA ON SCHEMA sales.orders TO `sales-readers`; GRANT SELECT ON TABLE sales.orders.order_delta TO `sales-readers`; GRANT MODIFY ON TABLE sales.orders.order_delta TO `sales-writers`; GRANT CREATE TABLE ON SCHEMA sales.orders TO `sales-engineers`; -- Service principal (ETL) GRANT SELECT, MODIFY ON TABLE sales.orders.order_delta TO `etl-service-principal`;

8. Final Architecture Diagram (Text-Based)

+------------------------------+ | Identity Layer | | Users, Groups, SSO, SCIM | +---------------+--------------+ | v +-------------------------------------------+ | Unity Catalog Authorization | | Catalog → SchemaTable/ViewColumn | | External Locations → Volumes → Models | +----------------------+----------------------+ | v +-------------------+ +------------------+ +--------------------+ | Analytics & SQL | | ETL / ML Jobs | | AI/Genie Agents | | Warehouses | | Service Principals| | Notebooks/Functions | +-------------------+ +------------------+ +--------------------+

Friday, May 30, 2025

Amazon Sagemaker Studio

Amazon SageMaker Studio is an integrated development environment (IDE) for machine learning that provides everything data scientists and developers need to build, train, and deploy ML models at scale in a unified web-based visual interface.


🔍 Core Capabilities of SageMaker Studio

CapabilityDescription
Unified InterfaceOne web interface for all stages: data prep, experimentation, training, tuning, deployment, and monitoring.
SageMaker NotebooksJupyter-based, one-click notebooks with elastic compute; kernel and instance lifecycle management.
SageMaker Data WranglerVisual UI to prepare data from various sources with transformations, joins, filters, and analysis.
SageMaker AutopilotAutoML functionality to automatically build, train, and tune ML models.
SageMaker ExperimentsTrack, compare, and visualize ML experiments easily.
SageMaker PipelinesCI/CD orchestration for ML workflows using pre-built or custom steps.
SageMaker Debugger & ProfilerDebug, optimize, and profile training jobs.
Model RegistryCentralized model catalog to register, manage, and version models for deployment.
Real-time & Batch InferenceSupport for real-time endpoints or batch transform jobs.
SageMaker JumpStartAccess to pre-built models and solutions for common ML use cases.


👤 Customer Onboarding Use Case: Intelligent Identity Verification

🎯 Objective

Automate the onboarding of customers for a financial institution, including identity verification, fraud detection, and customer classification using ML.

🔧 Steps in the Workflow

  1. Document Upload (via S3 or App API)
  2. Customer uploads a government-issued ID (e.g., passport) and proof of address.

  3. Data Ingestion & Preparation
  4. S3 receives files → SageMaker Studio (via Data Wrangler) cleans and normalizes data.
  5. OCR and image preprocessing done using Amazon Textract or custom image processing.

  6. Model Training
  7. SageMaker Studio Notebooks used to build models:
  8. Fraud Detection Model (binary classification)
  9. Document Authenticity Model (vision model using PyTorch/TensorFlow)
  10. Customer Tier Classification (multi-class ML model)


  11. Model Orchestration
  12. SageMaker Pipelines orchestrate preprocessing, training, evaluation, and registration.

  13. Model Deployment
  14. Real-time inference endpoints deployed using SageMaker Hosting Services.
  15. Registered in Model Registry with approval workflow.

  16. Inference & Feedback Loop
  17. API calls made from the customer portal to SageMaker Endpoints.
  18. Predictions used to drive automated or manual customer approvals.
  19. Results sent to Amazon EventBridge or SNS for notification or audit logging.
  20. Feedback ingested into the training dataset to improve model accuracy over time.


🧩 SageMaker Studio Integrations

ComponentPurpose
Amazon S3Data lake and model/artifact storage
AWS Glue / DataBrewData cataloging and ETL
Amazon RedshiftStructured analytics and model input
Amazon AthenaServerless querying of S3 data
Amazon Textract/ComprehendNLP/OCR support
Amazon ECRContainer storage for custom algorithms
AWS LambdaEvent-driven triggers and preprocessing
Amazon EventBridge/SNS/SQSEventing and pipeline notifications
AWS CloudWatch & CloudTrailMonitoring, logging, auditing
AWS KMS & IAMSecurity, encryption, and fine-grained access control
Lake Formation (optional)Data lake governance and fine-grained data access
Step FunctionsWorkflow orchestration beyond ML pipelines


📡 Data Dissemination in SageMaker Studio

Data dissemination refers to how data flows through the ML lifecycle—from ingestion to preprocessing, modeling, inference, and feedback.

SageMaker Studio Dissemination Pipeline:

  1. Ingest data via S3, Redshift, Glue, or JDBC sources.
  2. Transform using Data Wrangler or custom notebook-based processing.
  3. Train/Validate using custom models or Autopilot.
  4. Store outputs in S3, Model Registry, or downstream DBs.
  5. Distribute predictions through REST APIs (real-time), batch outputs (to S3), or events (SNS/SQS).
  6. Feedback loop enabled via pipelines and ingestion of labeled results.


🆚 SageMaker Studio vs AWS Lake Formation

FeatureSageMaker StudioAWS Lake Formation
Primary PurposeML development & opsSecure data lake governance
Target UsersData scientists, ML engineersData engineers, analysts, compliance teams
UI CapabilitiesJupyter-based, ML-centric IDELake-centric access management
Data Access ControlIAM-based permissionsFine-grained, column/row-level security
Workflow CapabilitiesML pipelines, experimentsData ingestion, transformation, sharing
IntegrationML & AI tools (e.g., Textract, Comprehend)Analytics tools (e.g., Athena, Redshift)
Security FocusNotebook and model access, endpoint policiesEncryption, data lake permissions, audit
Data DisseminationOrchestrated within ML pipelinesGoverned through data lake policies
Ideal Use CaseBuilding, training, deploying modelsCreating secure, centralized data lakes

Summary:

  1. Use SageMaker Studio when your goal is ML model development and operationalization.
  2. Use Lake Formation when your focus is centralized data governance, cross-account sharing, and secure access control for large datasets.


🚀 Conclusion

AWS SageMaker Studio empowers ML teams to work faster and more collaboratively by bringing together every piece of the ML lifecycle under one roof. From rapid prototyping to secure, scalable deployment, SageMaker Studio accelerates innovation. When combined with services like Lake Formation and Glue, it enables a secure, end-to-end AI/ML platform that can power modern, intelligent applications such as automated customer onboarding.

If your enterprise aims to bring AI into production with auditability, repeatability, and governance, SageMaker Studio is a foundational element in your AWS-based data science strategy.

Wednesday, May 7, 2025

Deploying Apache Tomcat and Running a WAR File on AWS ECS

Deploying Apache Tomcat and Running a WAR File on AWS ECS

Author: Harvinder Singh Saluja
Tags: #AWS #ECS #Tomcat #DevOps #Java #WARDeployment


Modern applications demand scalable and resilient infrastructure. Apache Tomcat, a popular Java servlet container, can be containerized and deployed on AWS ECS (Elastic Container Service) for high availability and manageability. In this blog, we walk through the end-to-end process of containerizing Tomcat with a custom WAR file and deploying it on AWS ECS using Fargate.


Objective

To deploy a Java .war file under Tomcat on AWS ECS Fargate and access the web application through an Application Load Balancer (ALB).


Prerequisites

  • AWS Account

  • Docker installed locally

  • AWS CLI configured

  • An existing .war file (e.g., myapp.war)

  • Basic understanding of ECS, Docker, and networking on AWS


Step 1: Create Dockerfile for Tomcat + WAR

Create a Dockerfile to extend the official Tomcat image and copy the WAR file into the webapps directory.

# Use official Tomcat base image
FROM tomcat:9.0

# Remove default ROOT webapp
RUN rm -rf /usr/local/tomcat/webapps/ROOT

# Copy custom WAR file
COPY myapp.war /usr/local/tomcat/webapps/ROOT.war

# Expose port
EXPOSE 8080

# Start Tomcat
CMD ["catalina.sh", "run"]

Place this Dockerfile alongside your myapp.war.


Step 2: Build and Push Docker Image to Amazon ECR

  1. Create ECR Repository

aws ecr create-repository --repository-name tomcat-myapp
  1. Authenticate Docker with ECR

aws ecr get-login-password | docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr.<region>.amazonaws.com
  1. Build and Push Docker Image

docker build -t tomcat-myapp .
docker tag tomcat-myapp:latest <aws_account_id>.dkr.ecr.<region>.amazonaws.com/tomcat-myapp:latest
docker push <aws_account_id>.dkr.ecr.<region>.amazonaws.com/tomcat-myapp:latest

Step 3: Setup ECS Cluster and Fargate Service

  1. Create ECS Cluster

aws ecs create-cluster --cluster-name tomcat-cluster
  1. Create Task Definition JSON

Example: task-def.json

{
  "family": "tomcat-task",
  "networkMode": "awsvpc",
  "containerDefinitions": [
    {
      "name": "tomcat-container",
      "image": "<aws_account_id>.dkr.ecr.<region>.amazonaws.com/tomcat-myapp:latest",
      "portMappings": [
        {
          "containerPort": 8080,
          "hostPort": 8080,
          "protocol": "tcp"
        }
      ],
      "essential": true
    }
  ],
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::<account_id>:role/ecsTaskExecutionRole"
}
  1. Register Task Definition

aws ecs register-task-definition --cli-input-json file://task-def.json
  1. Create Security Group & ALB

    • Create a security group allowing HTTP (port 80) and custom port 8080.

    • Create an Application Load Balancer with a target group pointing to port 8080.

  2. Run ECS Fargate Service

aws ecs create-service \
  --cluster tomcat-cluster \
  --service-name tomcat-service \
  --task-definition tomcat-task \
  --desired-count 1 \
  --launch-type FARGATE \
  --network-configuration '{
      "awsvpcConfiguration": {
          "subnets": ["subnet-xxxxxxx"],
          "securityGroups": ["sg-xxxxxxx"],
          "assignPublicIp": "ENABLED"
      }
  }' \
  --load-balancers '[
      {
          "targetGroupArn": "arn:aws:elasticloadbalancing:<region>:<account_id>:targetgroup/<target-group-name>",
          "containerName": "tomcat-container",
          "containerPort": 8080
      }
  ]'

Step 4: Access the Deployed App

Once the ECS service stabilizes, navigate to the DNS name of the ALB (e.g., http://<alb-dns-name>) to access your Java application running on Tomcat.


Troubleshooting Tips

  • WAR not deploying? Make sure it's named ROOT.war if you want it accessible directly at /.

  • Service unhealthy? Confirm security group rules allow traffic on port 8080.

  • Task failing? Check ECS task logs in CloudWatch.


A CloudFormation Template (CFT) was revised to deploy Apache Tomcat on ECS Fargate using private subnets. In this version:

  • Tomcat runs in private subnets

  • Application Load Balancer (ALB) resides in public subnets

  • NAT Gateway is used to allow ECS tasks to access the internet (e.g., for downloading updates)

  • WAR file is pre-packaged into the Docker image

  • Load Balancer forwards traffic to ECS service running in private subnets


tomcat-on-ecs-private.yaml

AWSTemplateFormatVersion: '2010-09-09'
Description: Deploy Tomcat in ECS Fargate with WAR file under private subnets and public-facing ALB

Parameters:
  VpcCidr:
    Type: String
    Default: 10.0.0.0/16
  PublicSubnet1Cidr:
    Type: String
    Default: 10.0.1.0/24
  PublicSubnet2Cidr:
    Type: String
    Default: 10.0.2.0/24
  PrivateSubnet1Cidr:
    Type: String
    Default: 10.0.3.0/24
  PrivateSubnet2Cidr:
    Type: String
    Default: 10.0.4.0/24
  ImageUrl:
    Type: String
    Description: ECR image URL for the Tomcat + WAR image

Resources:

  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: !Ref VpcCidr
      EnableDnsSupport: true
      EnableDnsHostnames: true

  # Subnets
  PublicSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: !Ref PublicSubnet1Cidr
      AvailabilityZone: !Select [ 0, !GetAZs '' ]
      MapPublicIpOnLaunch: true

  PublicSubnet2:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: !Ref PublicSubnet2Cidr
      AvailabilityZone: !Select [ 1, !GetAZs '' ]
      MapPublicIpOnLaunch: true

  PrivateSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: !Ref PrivateSubnet1Cidr
      AvailabilityZone: !Select [ 0, !GetAZs '' ]

  PrivateSubnet2:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: !Ref PrivateSubnet2Cidr
      AvailabilityZone: !Select [ 1, !GetAZs '' ]

  InternetGateway:
    Type: AWS::EC2::InternetGateway

  AttachGateway:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      VpcId: !Ref VPC
      InternetGatewayId: !Ref InternetGateway

  PublicRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC

  PublicRoute:
    Type: AWS::EC2::Route
    Properties:
      RouteTableId: !Ref PublicRouteTable
      DestinationCidrBlock: 0.0.0.0/0
      GatewayId: !Ref InternetGateway

  PublicRouteAssoc1:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnet1
      RouteTableId: !Ref PublicRouteTable

  PublicRouteAssoc2:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnet2
      RouteTableId: !Ref PublicRouteTable

  # NAT Gateway setup for private subnets
  EIP:
    Type: AWS::EC2::EIP

  NATGateway:
    Type: AWS::EC2::NatGateway
    Properties:
      AllocationId: !GetAtt EIP.AllocationId
      SubnetId: !Ref PublicSubnet1

  PrivateRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC

  PrivateRoute:
    Type: AWS::EC2::Route
    Properties:
      RouteTableId: !Ref PrivateRouteTable
      DestinationCidrBlock: 0.0.0.0/0
      NatGatewayId: !Ref NATGateway

  PrivateRouteAssoc1:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PrivateSubnet1
      RouteTableId: !Ref PrivateRouteTable

  PrivateRouteAssoc2:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PrivateSubnet2
      RouteTableId: !Ref PrivateRouteTable

  # Security Group
  ECSSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow inbound traffic from ALB
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 8080
          ToPort: 8080
          SourceSecurityGroupId: !Ref ALBSecurityGroup

  ALBSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTP from internet
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0

  # ALB
  LoadBalancer:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      Subnets: [!Ref PublicSubnet1, !Ref PublicSubnet2]
      SecurityGroups: [!Ref ALBSecurityGroup]
      Scheme: internet-facing
      Type: application

  TargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Port: 8080
      Protocol: HTTP
      VpcId: !Ref VPC
      TargetType: ip
      HealthCheckPath: /
      HealthCheckPort: 8080

  Listener:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      DefaultActions:
        - Type: forward
          TargetGroupArn: !Ref TargetGroup
      LoadBalancerArn: !Ref LoadBalancer
      Port: 80
      Protocol: HTTP

  ECSCluster:
    Type: AWS::ECS::Cluster

  ECSTaskExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

  TaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Family: tomcat-task
      Cpu: 512
      Memory: 1024
      NetworkMode: awsvpc
      RequiresCompatibilities: [FARGATE]
      ExecutionRoleArn: !GetAtt ECSTaskExecutionRole.Arn
      ContainerDefinitions:
        - Name: tomcat-container
          Image: !Ref ImageUrl
          PortMappings:
            - ContainerPort: 8080
          Essential: true

  ECSService:
    Type: AWS::ECS::Service
    DependsOn: Listener
    Properties:
      Cluster: !Ref ECSCluster
      DesiredCount: 1
      LaunchType: FARGATE
      TaskDefinition: !Ref TaskDefinition
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: DISABLED
          Subnets: [!Ref PrivateSubnet1, !Ref PrivateSubnet2]
          SecurityGroups: [!Ref ECSSecurityGroup]
      LoadBalancers:
        - TargetGroupArn: !Ref TargetGroup
          ContainerName: tomcat-container
          ContainerPort: 8080

Outputs:
  ALBDNS:
    Description: DNS of the Application Load Balancer
    Value: !GetAtt LoadBalancer.DNSName

Deploy Instructions

  1. Save this file as tomcat-on-ecs-private.yaml

  2. Deploy using AWS CLI:

aws cloudformation deploy \
  --template-file tomcat-on-ecs-private.yaml \
  --stack-name tomcat-private-ecs \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides ImageUrl=<your-ecr-image-url>
  1. Once stack creation is complete, access the application via the ALB DNS output.


Would you like this exported to PDF, or want a GitHub Actions pipeline to automate container builds and deployments?


Conclusion

You’ve now deployed a containerized Tomcat server running a WAR application to AWS ECS using Fargate. This setup abstracts away server management, allowing you to focus on your application logic while AWS handles the infrastructure.


 

Friday, May 2, 2025

Case study and tutorial for Amazon SageMaker Studio (Unified Experience)

Case study and tutorial for Amazon SageMaker Studio (Unified Experience), designed to help enterprise teams, data scientists, and ML engineers understand its capabilities, features, and implementation through a real-world example.


🧠 Case Study: Predicting Loan Defaults Using Amazon SageMaker Studio (Unified Experience)

🏢 Client Profile

Company: Condifential 
Industry: Financial Services
Objective: To build an end-to-end machine learning pipeline to predict loan default risks using Amazon SageMaker Studio Unified Experience.


🎯 Business Challenge

The company needed:

  • A collaborative, scalable, and secure ML environment

  • Model versioning and experimentation tracking

  • Integration with RDS, S3, and CI/CD workflows

  • Compliance with data governance and role-based access control (RBAC)


✅ Why Amazon SageMaker Studio (Unified Experience)?

  • Unified interface for data wrangling, experimentation, model building, deployment, and monitoring

  • Built-in JupyterLab & SageMaker JumpStart

  • MLOps integration with SageMaker Pipelines, Model Registry

  • Custom image support for enterprise tools like scikit-learn, PyTorch, TensorFlow

  • IAM-based access controls via SageMaker Domain


🛠️ Architecture Overview

              +-------------------------+
              |      Amazon S3          | <-- Raw Loan Data
              +-------------------------+
                         |
                         v
               +--------------------+
               | Amazon SageMaker   |
               |  Studio (Unified)  |
               +--------------------+
                  |     |      |
   +--------------+     |      +---------------------+
   |                    |                            |
Data Wrangler     SageMaker Pipelines        SageMaker Experiments
(Data Prep)       (ETL + Train + Deploy)      (Track Models & Metrics)
   |                    |                            |
   +--------------------+----------------------------+
                         |
                         v
               +---------------------------+
               |  SageMaker Model Registry |
               +---------------------------+
                         |
                         v
               +---------------------+
               | SageMaker Endpoints|
               +---------------------+
                         |
                         v
                +------------------+
                | Client App (UI)  |
                +------------------+

🧪 Step-by-Step Tutorial: ML Pipeline with SageMaker Studio

🔹 1. Set Up SageMaker Studio

  1. Go to the AWS Console → SageMaker → “SageMaker Domain” → Create Domain

  2. Use IAM authentication, enable default SageMaker Studio settings

  3. Create a User Profile with execution roles attached (AmazonSageMakerFullAccess, S3FullAccess, RDSReadOnly etc.)


🔹 2. Launch SageMaker Studio

  1. Select the created user → “Launch Studio”

  2. Choose Kernel → Python 3 (Data Science)

  3. Start a new Jupyter notebook


🔹 3. Data Ingestion & Exploration

import boto3
import pandas as pd

# Load from S3
s3_bucket = 's3://trustfund-data/loan-defaults.csv'
df = pd.read_csv(s3_bucket)

# Quick stats
df.describe()
df['default'].value_counts()

🔹 4. Data Preparation with SageMaker Data Wrangler

  1. Open Data Wrangler from Studio UI

  2. Import S3 dataset → Profile the data

  3. Add transforms: handle nulls, encode categorical, normalize

  4. Export flow to SageMaker Pipeline (generates .flow and .pipeline.py)


🔹 5. Build Training Script (train.py)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import joblib
import pandas as pd

df = pd.read_csv('loan-defaults.csv')
X = df.drop('default', axis=1)
y = df['default']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

joblib.dump(model, 'model.joblib')

🔹 6. Create and Run a SageMaker Pipeline

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep, ModelStep
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.sklearn.estimator import SKLearn

# Setup processor
sklearn_processor = SKLearnProcessor(
    framework_version='0.23-1',
    role='SageMakerRole',
    instance_type='ml.m5.xlarge',
    instance_count=1
)

# Define pipeline steps
step_process = ProcessingStep(...)
step_train = TrainingStep(...)
step_register = ModelStep(...)

pipeline = Pipeline(
    name="LoanDefaultPipeline",
    steps=[step_process, step_train, step_register]
)
pipeline.upsert(role_arn="SageMakerRole")
pipeline.start()

🔹 7. Deploy Model to Endpoint

from sagemaker.model import Model

model = Model(
    model_data='s3://.../model.tar.gz',
    role='SageMakerRole',
    entry_point='inference.py'
)

predictor = model.deploy(instance_type='ml.m5.large', initial_instance_count=1)

🔹 8. Monitor and Retrain

Use:

  • SageMaker Model Monitor for drift detection

  • SageMaker Pipelines to automate retraining on new data


📊 Results

Metric Value
AUC 0.91
Accuracy 88.4%
Training Time ~3 minutes
Retrain Schedule Weekly

🛡️ Security & Governance

  • IAM roles enforced per user profile

  • Audit trail via CloudTrail + SageMaker lineage tracking

  • Data encryption at rest and in transit (KMS)


🔚 Summary

Amazon SageMaker Studio Unified Experience empowers enterprises to:

  • Consolidate ML workflows in one secure UI

  • Integrate data prep, experimentation, model registry, and CI/CD

  • Boost productivity with reusable components

Would you like a downloadable diagram or sample repo structure for this use case?

How IdP Groups Are Tied to Databricks Groups (Unity Catalog)

  🔗 How IdP Groups Are Tied to Databricks Groups (Unity Catalog) 🔑 Key Principle (Read This First) Databricks does NOT “map” IdP groups...