مقامی طور پر AWS کے لیے ETL پائپ لائنز کی ترقی اور جانچ
عام طور پر، ETL پائپ لائنوں کی ترقی اور جانچ حقیقی ماحول/کلسٹرز پر کی جاتی ہے جس کے سیٹ اپ میں وقت لگتا ہے اور اسے دیکھ بھال کی ضرورت ہوتی ہے۔ یہ مضمون ڈوکر اور لوکل اسٹیک کی مدد سے مقامی طور پر ای ٹی ایل پائپ لائنوں کی ترقی اور جانچ پر توجہ مرکوز کرتا ہے۔ یہ حل کلاؤڈ پر کوئی خدمات ترتیب دیئے بغیر مقامی ماحول میں ٹیسٹ کرنے کے لیے لچک دیتا ہے۔
By Subhash Sreenivasachar, Software Engineer Technical Lead at Epsilon
تعارف
AWS plays a pivotal role in helping engineers, data-scientists focus on building solutions and problem solving without worrying about the need to setup infrastructure. With Serverless & pay-as-you-go approach for pricing, AWS provides ease of creating services on the fly.
AWS Glue is widely used by Data Engineers to build serverless ETL pipelines. PySpark being one of the common tech-stack used for development. However, despite the availability of services, there are certain challenges that need to be addressed.
Debugging code in AWS environment whether for ETL script (PySpark) or any other service is a challenge.
- Ongoing monitoring of AWS service usage is key to keep the cost factor under control
- AWS does offer Dev Endpoint with all the spark libraries installed, but considering the price, it’s not viable for use for large development teams
- Accessibility of AWS services may be محدود for certain users
حل
Solutions for AWS can be developed, tested in local environment without worrying about accessibility or cost factor. Through this article, we are addressing two problems –
- Debugging PySpark code locally without using AWS dev endpoints.
- Interacting with AWS Services locally
Both problems can be solved with use of Docker images.
- First, we do away the need for a server on AWS environment & instead, a docker image running on the machine acts as the environment to execute the code.
AWS provides a sandbox image which can be used for PySpark scripts. Docker image can be setup to execute PySpark Code. https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/
- With docker machine available to execute the code, there’s a need for a service like S3 to store (read/write) files while building an ETL pipeline.
Interactions with S3 can be replaced with LocalStack which provides an easy-to-use test/mocking framework for developing Cloud applications. It spins up a testing environment on your local machine that provides the same functionality and APIs as the real AWS cloud environment.
So far, the article deals with building an ETL pipeline and use of services available. However, similar approach can be adapted to any use case while working with AWS services like SNS, SQS, CloudFormation, Lambda functions etc.
نقطہ نظر
- Use docker containers as remote interpreter
- Run PySpark session on the containers
- Spin up S3 service locally using LocalStack
- Use PySpark code to read and write from S3 bucket running on LocalStack
پری ضروریات
Following tools must be installed on your machine
- میں Docker
- PyCharm Professional/ VisualStudio Code
سیٹ اپ
- Download or pull docker images (docker pull <image name>)
- libs:glue_libs_1.0.0_image_01
- localstack/localstack
- Docker containers can be used as remote interpreters in PyCharm professional version.
عمل
With Docker installed and images pulled to your local machine, start setting PyCharm with configurations to start the containers.
- Create a docker-compose.yml file
- Create a DockerFile
https://gist.github.com/subhash-sreenivasachar/526221a4ede6053b1d576e666db8ec87#file-dockerfile
- Use requirements file with packages to be installed
- Setup Python remote interpreter
- Setup Python interpreter using the docker-compose file.
- Select `glue-service` in PyCharm Docker Compose settings.
- Docker-compose file creates and runs the containers for both images
- LocalStack by default runs on port 4566 and S3 service is enabled on it
ضابطے
- Required libraries to be imported
https://gist.github.com/subhash-sreenivasachar/526221a4ede6053b1d576e666db8ec87#file-imports
- Add a file to S3 bucket running on LocalStack
https://gist.github.com/subhash-sreenivasachar/526221a4ede6053b1d576e666db8ec87#file-add_to_bucket
http://host.docker.internal:4566 is the S3 running locally inside docker container
- Setup PySpark session to read from S3
- PySpark session connects to S3 via mock credentials provided
- You can read from S3 directly using the PySpark session created
https://gist.github.com/subhash-sreenivasachar/526221a4ede6053b1d576e666db8ec87#file-read_from_s3
- Finally, it’s possible to write to S3 in any preferred format
https://gist.github.com/subhash-sreenivasachar/526221a4ede6053b1d576e666db8ec87#file-write_to_s3
Once the above-mentioned steps have been followed, we can create a dummy csv file with mock data for testing and you should be good to
- Add file to S3 (which is running on LocalStack)
- Read from S3
- Write back to S3 as parquet
You should be able to run the .py file to execute & PySpark session will be created that can read from S3 bucket which is running locally using LocalStack API.
Additionally, you can also check if LocalStack is running with http://localhost:4566/health
LocalStack provides you ability to run commands using AWS CLI as well.
نتیجہ
Use of Docker & Localstack provides a quick and easy way to run Pyspark code, debug on containers and write to S3 which is running locally. All this without having to connect to any AWS service.
حوالہ جات:
- Glue endpoint: https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint.html
ڈاکر: https://docs.docker.com/get-docker/ - پی چیرم: https://www.jetbrains.com/pycharm/
- PyCharm Remote interpreter: https://www.jetbrains.com/help/pycharm/using-docker-compose-as-a-remote-interpreter.html
- LocalStack: https://localstack.cloud
بیو: Subhash Sreenivasachar is Lead Software Engineer at Epsilon Digital Experience team, building engineering solutions to solve data science problems specifically personalization, and help drive ROI for clients.
متعلقہ:
ماخذ: https://www.kdnuggets.com/2021/08/development-testing-etl-pipelines-aws-locally.html
- "
- &
- رسائی پذیری
- تمام
- ایمیزون
- تجزیاتی
- اے پی آئی
- APIs
- ایپلی کیشنز
- مضمون
- دستیابی
- AWS
- تعمیر
- عمارت
- چیلنج
- کلائنٹس
- بادل
- بادل ایپلی کیشنز
- کوڈ
- کامن
- کنٹینر
- تخلیق
- اسناد
- اعداد و شمار
- ڈیٹا تجزیات
- ڈیٹا سائنس
- ڈیٹا گودام
- ڈیلز
- گہری سیکھنے
- دیو
- ترقی
- ڈیجیٹل
- ڈائریکٹر
- میں Docker
- اختتام پوائنٹ
- انجینئر
- انجنیئرنگ
- انجینئرز
- ماحولیات
- وغیرہ
- لچک
- توجہ مرکوز
- فارمیٹ
- فریم ورک
- GitHub کے
- اچھا
- GPUs
- یہاں
- کس طرح
- HTTPS
- تصویر
- انفراسٹرکچر
- IT
- کلیدی
- بڑے
- قیادت
- جانیں
- سیکھنے
- لنکڈ
- مقامی
- مقامی طور پر
- ML
- نگرانی
- پیش کرتے ہیں
- آن لائن
- دیگر
- شخصی
- اہم
- قیمت
- قیمتوں کا تعین
- مسائل کو حل کرنے
- ازگر
- ضروریات
- رن
- چل رہا ہے
- سینڈباکس
- سائنس
- سائنسدانوں
- بے سرور
- سروسز
- قائم کرنے
- سافٹ ویئر کی
- سافٹ ویئر انجنیئر
- حل
- حل
- شروع کریں
- ذخیرہ
- خبریں
- ٹیکنیکل
- ٹیسٹ
- ٹیسٹنگ
- وقت
- سب سے اوپر
- صارفین
- گودام
- X
- سال