Friday, October 18, 2024

Glue job in Python that connects to an on-premise Oracle database, creates a fixed-length file, and writes it to an S3 bucket

To write a Glue job in Python that connects to an on-premise Oracle database, creates a fixed-length file, and writes it to an S3 bucket, you would need to:

  1. Set up a connection to Oracle Database using JDBC.
  2. Retrieve the data from the Oracle database.
  3. Format the data into a fixed-length format.
  4. Write the formatted data to an S3 bucket.

Here’s an outline of a Glue job script to achieve this:

Prerequisites:

  • Ensure that AWS Glue has network access to your on-premise Oracle Database (usually via AWS Direct Connect or VPN).
  • Add the Oracle JDBC driver to your Glue job (by uploading it to S3 and referencing it in the job).
  • Set up IAM roles and S3 permissions to write to the bucket.

Python Glue Job Script:

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job import boto3 import cx_Oracle import os # Initialize Glue context and job args = getResolvedOptions(sys.argv, ['JOB_NAME', 'oracle_jdbc_url', 'oracle_username', 'oracle_password', 's3_output_path']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) # Oracle Database connection parameters jdbc_url = args['oracle_jdbc_url'] oracle_username = args['oracle_username'] oracle_password = args['oracle_password'] # S3 output path s3_output_path = args['s3_output_path'] # Oracle query (modify this query as per your requirement) query = "SELECT column1, column2, column3 FROM your_table" # Fetching data from Oracle DB using JDBC df = (spark.read.format("jdbc") .option("url", jdbc_url) .option("dbtable", f"({query}) as data") .option("user", oracle_username) .option("password", oracle_password) .option("driver", "oracle.jdbc.driver.OracleDriver") .load()) # Convert DataFrame to an RDD to process fixed-length formatting def format_row(row): column1 = str(row['column1']).ljust(20) # Adjust the length as per requirement column2 = str(row['column2']).ljust(30) # Adjust the length as per requirement column3 = str(row['column3']).ljust(50) # Adjust the length as per requirement return column1 + column2 + column3 fixed_length_rdd = df.rdd.map(format_row) # Create a file in S3 in the fixed-length format output_file_path = "/tmp/output_fixed_length_file.txt" with open(output_file_path, "w") as f: for line in fixed_length_rdd.collect(): f.write(line + "\n") # Uploading the file to S3 s3 = boto3.client('s3') bucket_name = s3_output_path.replace("s3://", "").split("/")[0] s3_key = "/".join(s3_output_path.replace("s3://", "").split("/")[1:]) s3.upload_file(output_file_path, bucket_name, s3_key) # Cleanup os.remove(output_file_path) # Mark the job as complete job.commit()

Explanation:

  1. Oracle JDBC connection: The script connects to your Oracle Database using the JDBC driver and retrieves data based on the query.
  2. Fixed-length formatting: The data is converted into fixed-length format by adjusting the length of each column using the ljust() method.
  3. File creation: The formatted data is written into a text file on the local disk.
  4. S3 upload: The file is uploaded to the specified S3 bucket using Boto3.
  5. Cleanup: Temporary files are removed after upload.

Glue Job Parameters:

You can pass the following arguments when you run the Glue job:

  • oracle_jdbc_url: The JDBC URL for your Oracle Database (e.g., jdbc:oracle:thin:@your_host:1521:your_service_name).
  • oracle_username: Oracle database username.
  • oracle_password: Oracle database password.
  • s3_output_path: The S3 path where you want to store the fixed-length file (e.g., s3://your-bucket/path/to/file.txt).

Use SSH Keys to clone GIT Repository using SSH

  1. Generate a New SSH Key Pair bash ssh-keygen -t rsa -b 4096 -C "HSingh@MindTelligent.com" -t rsa specifies the type of key (...