Harvinder Saluja's Springboot, OCI AWS, API, EIA, EISA, Data Science/Engineering and Dev/Ops BLOG: Glue job in Python that connects to an on-premise Oracle database, creates a fixed-length file, and writes it to an S3 bucket

Friday, October 18, 2024

Glue job in Python that connects to an on-premise Oracle database, creates a fixed-length file, and writes it to an S3 bucket

To write a Glue job in Python that connects to an on-premise Oracle database, creates a fixed-length file, and writes it to an S3 bucket, you would need to:

Set up a connection to Oracle Database using JDBC.
Retrieve the data from the Oracle database.
Format the data into a fixed-length format.
Write the formatted data to an S3 bucket.

Here’s an outline of a Glue job script to achieve this:

Prerequisites:

Ensure that AWS Glue has network access to your on-premise Oracle Database (usually via AWS Direct Connect or VPN).
Add the Oracle JDBC driver to your Glue job (by uploading it to S3 and referencing it in the job).
Set up IAM roles and S3 permissions to write to the bucket.

Python Glue Job Script:

 
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3
import cx_Oracle
import os

# Initialize Glue context and job
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'oracle_jdbc_url', 'oracle_username', 'oracle_password', 's3_output_path'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Oracle Database connection parameters
jdbc_url = args['oracle_jdbc_url']
oracle_username = args['oracle_username']
oracle_password = args['oracle_password']

# S3 output path
s3_output_path = args['s3_output_path']

# Oracle query (modify this query as per your requirement)
query = "SELECT column1, column2, column3 FROM your_table"

# Fetching data from Oracle DB using JDBC
df = (spark.read.format("jdbc")
    .option("url", jdbc_url)
    .option("dbtable", f"({query}) as data")
    .option("user", oracle_username)
    .option("password", oracle_password)
    .option("driver", "oracle.jdbc.driver.OracleDriver")
    .load())

# Convert DataFrame to an RDD to process fixed-length formatting
def format_row(row):
    column1 = str(row['column1']).ljust(20)  # Adjust the length as per requirement
    column2 = str(row['column2']).ljust(30)  # Adjust the length as per requirement
    column3 = str(row['column3']).ljust(50)  # Adjust the length as per requirement
    return column1 + column2 + column3

fixed_length_rdd = df.rdd.map(format_row)

# Create a file in S3 in the fixed-length format
output_file_path = "/tmp/output_fixed_length_file.txt"
with open(output_file_path, "w") as f:
    for line in fixed_length_rdd.collect():
        f.write(line + "\n")

# Uploading the file to S3
s3 = boto3.client('s3')
bucket_name = s3_output_path.replace("s3://", "").split("/")[0]
s3_key = "/".join(s3_output_path.replace("s3://", "").split("/")[1:])
s3.upload_file(output_file_path, bucket_name, s3_key)

# Cleanup
os.remove(output_file_path)

# Mark the job as complete
job.commit()

Explanation:

Oracle JDBC connection: The script connects to your Oracle Database using the JDBC driver and retrieves data based on the query.
Fixed-length formatting: The data is converted into fixed-length format by adjusting the length of each column using the ljust() method.
File creation: The formatted data is written into a text file on the local disk.
S3 upload: The file is uploaded to the specified S3 bucket using Boto3.
Cleanup: Temporary files are removed after upload.

Glue Job Parameters:

You can pass the following arguments when you run the Glue job:

oracle_jdbc_url: The JDBC URL for your Oracle Database (e.g., jdbc:oracle:thin:@your_host:1521:your_service_name).
oracle_username: Oracle database username.
oracle_password: Oracle database password.
s3_output_path: The S3 path where you want to store the fixed-length file (e.g., s3://your-bucket/path/to/file.txt).

Harvinder Saluja's Springboot, OCI AWS, API, EIA, EISA, Data Science/Engineering and Dev/Ops BLOG