Friday, October 18, 2024

Glue job in Python that connects to an on-premise Oracle database, creates a fixed-length file, and writes it to an S3 bucket

To write a Glue job in Python that connects to an on-premise Oracle database, creates a fixed-length file, and writes it to an S3 bucket, you would need to:

  1. Set up a connection to Oracle Database using JDBC.
  2. Retrieve the data from the Oracle database.
  3. Format the data into a fixed-length format.
  4. Write the formatted data to an S3 bucket.

Here’s an outline of a Glue job script to achieve this:

Prerequisites:

  • Ensure that AWS Glue has network access to your on-premise Oracle Database (usually via AWS Direct Connect or VPN).
  • Add the Oracle JDBC driver to your Glue job (by uploading it to S3 and referencing it in the job).
  • Set up IAM roles and S3 permissions to write to the bucket.

Python Glue Job Script:

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job import boto3 import cx_Oracle import os # Initialize Glue context and job args = getResolvedOptions(sys.argv, ['JOB_NAME', 'oracle_jdbc_url', 'oracle_username', 'oracle_password', 's3_output_path']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) # Oracle Database connection parameters jdbc_url = args['oracle_jdbc_url'] oracle_username = args['oracle_username'] oracle_password = args['oracle_password'] # S3 output path s3_output_path = args['s3_output_path'] # Oracle query (modify this query as per your requirement) query = "SELECT column1, column2, column3 FROM your_table" # Fetching data from Oracle DB using JDBC df = (spark.read.format("jdbc") .option("url", jdbc_url) .option("dbtable", f"({query}) as data") .option("user", oracle_username) .option("password", oracle_password) .option("driver", "oracle.jdbc.driver.OracleDriver") .load()) # Convert DataFrame to an RDD to process fixed-length formatting def format_row(row): column1 = str(row['column1']).ljust(20) # Adjust the length as per requirement column2 = str(row['column2']).ljust(30) # Adjust the length as per requirement column3 = str(row['column3']).ljust(50) # Adjust the length as per requirement return column1 + column2 + column3 fixed_length_rdd = df.rdd.map(format_row) # Create a file in S3 in the fixed-length format output_file_path = "/tmp/output_fixed_length_file.txt" with open(output_file_path, "w") as f: for line in fixed_length_rdd.collect(): f.write(line + "\n") # Uploading the file to S3 s3 = boto3.client('s3') bucket_name = s3_output_path.replace("s3://", "").split("/")[0] s3_key = "/".join(s3_output_path.replace("s3://", "").split("/")[1:]) s3.upload_file(output_file_path, bucket_name, s3_key) # Cleanup os.remove(output_file_path) # Mark the job as complete job.commit()

Explanation:

  1. Oracle JDBC connection: The script connects to your Oracle Database using the JDBC driver and retrieves data based on the query.
  2. Fixed-length formatting: The data is converted into fixed-length format by adjusting the length of each column using the ljust() method.
  3. File creation: The formatted data is written into a text file on the local disk.
  4. S3 upload: The file is uploaded to the specified S3 bucket using Boto3.
  5. Cleanup: Temporary files are removed after upload.

Glue Job Parameters:

You can pass the following arguments when you run the Glue job:

  • oracle_jdbc_url: The JDBC URL for your Oracle Database (e.g., jdbc:oracle:thin:@your_host:1521:your_service_name).
  • oracle_username: Oracle database username.
  • oracle_password: Oracle database password.
  • s3_output_path: The S3 path where you want to store the fixed-length file (e.g., s3://your-bucket/path/to/file.txt).

Amazon Bedrock and AWS Rekognition comparison for Image Recognition

 Both Amazon Bedrock and AWS Rekognition are services provided by AWS, but they cater to different use cases, especially when it comes to ...