오픈소스/airflow

airflow with spark

pastime 2024. 12. 22. 02:14
728x90

목표 : 로컬환경에서 airflow 와 spark를 연동하여 사용

 

로컬에서 airflow와 spark를 연동해서 사용하기 위해서는 java가 필요하기 떄문에 base 이미지 부터 수정이 필요.

 

Dockerfile

FROM apache/airflow:2.10.4-python3.12


USER root

RUN apt-get update && \
    apt-get install -y openjdk-17-jdk && \
    apt-get install -y ant && \
    apt-get clean;

# Set JAVA_HOME environment variable
ENV JAVA_HOME /usr/lib/jvm/java-17-openjdk-amd64
ENV PATH $JAVA_HOME/bin:$PATH

USER airflow

RUN pip install apache-airflow apache-airflow-providers-apache-spark pyspark

 

 

docker-compose.yml

version: '3'

x-spark-common: &spark-common
  image: bitnami/spark:latest
  volumes:
    - ./jobs:/opt/bitnami/spark/jobs
  networks:
    - airflow-spark

x-airflow-common: &airflow-common
  build:
    context: .
    dockerfile: Dockerfile
  env_file:
    - airflow.env
  volumes:
    - ./jobs:/opt/airflow/jobs
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
  depends_on:
    - postgres
  networks:
    - airflow-spark

services:
  spark-master:
    <<: *spark-common
    command: bin/spark-class org.apache.spark.deploy.master.Master
    ports:
      - "9090:8080"
      - "7077:7077"

  spark-worker:
    <<: *spark-common
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
    depends_on:
      - spark-master
    environment:
      SPARK_MODE: worker
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 1g
      SPARK_MASTER_URL: spark://spark-master:7077

  postgres:
    image: postgres:14.0
    environment:
      - POSTGRES_USER=airflow
      - POSTGRES_PASSWORD=airflow
      - POSTGRES_DB=airflow
    networks:
      - airflow-spark

  webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    depends_on:
      - scheduler

  scheduler:
    <<: *airflow-common
    command: bash -c "airflow db init && airflow db migrate && airflow users create --username admin --firstname admin --lastname admin --role Admin --email admin@gmail.com --password admin && airflow scheduler"

networks:
  airflow-spark:

 

airflow.env

AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
AIRFLOW__WEBSERVER_BASE_URL=http://localhost:8080
AIRFLOW__WEBSERVER__SECRET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=

 

 

DAG 코드

 

    python_job = SparkSubmitOperator(
        task_id="python_job",
        conn_id='spark-conn',
        conf={
            'spark.master': 'spark://spark-master:7077',  # Spark Master 주소
        },
        application="jobs/python/wordcountjob.py",
    )

 

 

 

동작 확인

 

시행착오 정리

  • openjdk-17
    • 구글링에서 나온 결과물들은 대부분 openjdk-11 이였지만, 설치가 안되어 17로 업그레이드
      이에 따른 java-17-openjdk 로 변경
    • 윈도우 또는 맥에 따라 amd64가 아닌 arm 으로 변경해서 해결했다는 글도 존재
  • 이상하게 버전이 올라오면서 웹드라이버가 메모리를 많이 잡아먹기 시작하여, celery -> loacl executor로 변경

 

728x90