728x90

 

 

Data Orchestration Airflow?

 ๋ฐ์ดํ„ฐ ์ข…์‚ฌ์ž๋ผ๋ฉด Data Orchestration(๋ฐฐ์น˜ ํˆด)์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ์ฏค ๋“ค์–ด๋ดค์„ ๊ฒƒ์ด๋‹ค. ๊ทธ์ค‘ Airflow๊ฐ€ ๊ฐ€์žฅ ๋Œ€์ค‘์ ์ด๊ณ  ์ธ๊ธฐ๊ฐ€ ์žˆ์ง€๋งŒ ์ด Dag system์„ ์ฒ˜์Œ ์ตํžˆ๊ธฐ๋Š” ์ƒ๋‹นํžˆ ์ƒ์†Œํ•˜๊ณ  ๋‹ค๋ฃจ๊ธฐ ์–ด๋ ค์šธ ๊ฒƒ์ด๋‹ค. execute date, start date, clear ๋“ฑ ๋“ฑ ๋ฐฐ์น˜ํ•˜๋‚˜ ํ•œ ์ค‘์š”ํ•œ task๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฐฐ์น˜๊ฐ€ ์‹œ์Šคํ…œ์„ ์ž˜ ์ดํ•ดํ•˜์ง€ ๋ชปํ•˜๋ฉด ์‹คํ–‰ ์‹œ๊ฐ„์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๊ณ  ๋ถˆํ•„์š”ํ•œ ๊ธฐ๋Šฅ์œผ๋กœ ์‹คํŒจํ•œ task๊ฐ€ ์‹คํŒจ ์‹œ์ ์—์„œ ์ง€๊ธˆ ์‹œ๊ฐ„๊นŒ์ง€ ์žฌ์‹œ์ž‘์ด ๋˜๋Š” ๋ถˆ์ƒ์‚ฌ ๋˜ํ•œ ์ผ์–ด๋‚  ์ˆ˜๋„ ์žˆ๋‹ค. (ex) catch_up)

Airflow Fail

 

 Airflow์˜ UI๋ฅผ ๋ณผ ๋•Œ ํ˜นํ•ด์„œ ๋งŽ์ด๋“ค ์„ ํƒํ•˜๊ณ  ํ›„ํšŒํ•˜๋Š” ํฌ์ธํŠธ๊ฐ€ ์—ฌ๋Ÿฟ ์žˆ๋‹ค. 

 

์ฒซ ๋ฒˆ์งธ. Airflow์˜ ํŒจํ‚ค์ง€ ๊ด€์  (build ์‹œ). Airflow์˜ package dependencies๋Š” ๋ฒ„์ „์—๋งŒ ์žˆ๋Š” ๊ฒŒ ์•„๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ provider๋Š” ๊ณต์‹์ ์œผ๋กœ airflow์—์„œ ๋‹ค๋ฅธ db connection๊ณผ ํŽธ๋ฆฌํ•œ ๊ธฐ๋Šฅ๋“ค์„ ์ œ๊ณตํ•˜๋ฏ€๋กœ ๋‚ด๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋ฒ„์ „๊ณผ ๋‹ค๋ฅผ ๊ฒฝ์šฐ ๊ฐ•์ œ๋กœ package(ex, pendulum < 3.0) ๋ฒ„์ „์„ ๋‚ฎ์ถฐ ๋นŒ๋“œํ•ด์•ผ ํ•œ๋‹ค. ์ด๋Ÿด ๊ฒฝ์šฐ provider์—์„œ ๋ณด์žฅํ•˜๋Š” ํŒจํ‚ค์ง€ ์ข…์†์„ฑ์ด ๊นจ์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. 

 

์˜ˆ๋ฅผ ๋“ค๋ฉด ๋‚ฎ์€ els๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์œ ์ €๋ผ๋ฉด Opensearch(<2), elasticsearch(<7.1) ์ผ ๋•Œ  package ๋ฒ„์ „์ด (opensearch-py <= 1.3.7, elasticsearch<=7.13.1) ์ข…์†์„ฑ์ด ์ƒ๊ธด๋‹ค. ์ด๋•Œ ๋†’์€ ๋ฒ„์ „์˜ airflow(>= 2.5)๋ฅผ ์„ค์น˜ํ•˜๋ ค๊ณ  ํ•œ๋‹ค๋ฉด Error๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. apache-airflow-providers-elasticsearch ๊ธฐ๋ณธ ๋‚ด์žฅ elasticsearch(>8, <9) ๋ฒ„์ „์„ ์„ค์น˜ํ•œ๋‹ค. ๋‚ฎ์€ ๋ฒ„์ „์„ ๊ณ„์† ์‚ฌ์šฉํ•˜๋‹ค. ๋ฒ„๊ทธ๋ฅผ ๊ณ ์น˜๊ธฐ ์œ„ํ•ด airflow๋ฅผ upgrade๋ฅผ ํ•ด์•ผ ํ•œ๋‹ค๋ฉด ๋˜ ๋‹ค๋ฅธ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธฐ๊ฒŒ ๋œ๋‹ค.  

 

๋‘ ๋ฒˆ์งธ. Airflow workflow ๋ฒ„์ „ ๊ด€๋ฆฌ. workflow์˜ ๋ฒ„์ „์ด ์ถ”์ ๋˜์ง€ ์•Š์•„ ์ฝ”๋“œ ๋ณ€๊ฒฝ๋œ ์ด๋ ฅ์„ ์•Œ ์ˆ˜ ์—†๋‹ค. airflow๋Š” Airbnb์˜ ์˜ˆ์•ฝ๊ด€๋ฆฌ scheduler๋กœ Dag ๊ฐœ๋…์ž์ฒด๊ฐ€ calendar๊ฐ€ ์•„๋‹Œ execute, interval, start์— ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์‹คํ–‰ ํšŸ์ˆ˜ ์ •๋„๋งŒ ์ถ”์ ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„์„๊ฐ€ ๊ด€์ ์—์„œ ๋ณด๋ฉด code๋ฅผ ๋งˆ์Œ๊ป ์ˆ˜์ •ํ•˜๊ณ  ์ฆ‰์‹œ ์‹คํ–‰ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์–ด ์ข‹์ง€๋งŒ ๊ฐœ๋ฐœ์ž ์ž…์žฅ์—์„œ ๋ณด์•˜์„ ๋•Œ ์ด๋Š” ๋ฌธํ™” ์ถฉ๊ฒฉ์— ๊ฐ€๊น๋‹ค. ์ด ๋Œ€์•ˆ์œผ๋กœ git-sync๋ฅผ ์ด์šฉํ•ด code CD๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์ง€๋งŒ ์ข‹์€ ๋Œ€์•ˆ์€ ์•„๋‹ˆ๋‹ค. 

์„ธ ๋ฒˆ์งธ. ๋†’์€ ํ•™์Šต ๊ณก์„ . Airflow๋ฅผ ์ฒ˜์Œ ์‹œ์ž‘ํ•˜๋ฉด ๋งŽ์€ Operator์— ๋Œ€ํ•ด ๋ฐ˜๊ฐ‘๊ฒŒ ๋А๊ปด์งˆ ๊ฒƒ์ด๋‹ค. ๊ทธ๊ฒƒ๋„ ์ž ์‹œ 99% ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ž‘๋™๋˜์ง€ ์•Š๋Š”๋‹ค. Python์˜ Dacorator๋ฅผ ๋‚จ๋ฐœํ•œ ํ•จ์ˆ˜๋“ค์ด ๋‚˜๋ฅผ ๋ฐฉํ•ดํ•˜๊ณ  ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค. ์•„์ด๋Ÿฌ๋‹ˆํ•˜๊ฒŒ Airflow์˜ ๊ธฐ๋Šฅ์„ ์ตœ์†Œ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด Airflow๋ฅผ ๊ฐ€์žฅ ์ž˜ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค๋Š” ์†Œ๋ฌธ์ด ์žˆ๋‹ค. 

 

 

 

๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ค Orchestra๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ• ๊นŒ?

Orchestration ์ข…๋ฅ˜

Siloed, uncoordinated orchestrators / following organizational boundaries

Domain-Specific / Orchestration

https://medium.com/apache-airflow/mind-the-gap-seamless-data-and-ml-pipelines-with-airflow-and-me taflow-7e40213dd719

 

Mind the Gap: Seamless data and ML pipelines with Airflow and Metaflow

Valay Dave, Software Engineer, Outerbounds

medium.com

Tool
โญ K
Type Support Docker
Airflow
35.7
Centralized Orchestrator
apache/airflow
Luigi
17.6
Domain-Specific
spotify/luigi
Prefect
15.5
Centralized Orchestrator
prefecthq/prefect
Dagster
10.9
Centralized Orchestrator
docs
Kedro
9.5
Domain-Specific
kedro-docker
Cadence
8
Centralized Orchestrator
-
Mage AI
7.5
Centralized Orchestrator
mageai/mageai
Apache NiFi
4.6
Domain-Specific
apache/nifi
Maestro
2.1
Domain-Specific
None Official
Apache Oozie
0.7
Centralized Orchestrator
None Official

 

  2024/08/01 ์‹œ์ ์˜ github star๋กœ ๋ณด์•„ ๊ฐ€์žฅ ์ธ๊ธฐ ์žˆ๋Š” ๊ฒƒ์€ airflow๊ณ  ๊ฐ€์žฅ ์ปค๋ฎค๋‹ˆํ‹ฐ๊ฐ€ ํ™œ์„ฑํ™”๋œ ํˆด ์—ญ์‹œ airflow๋‹ค. airflow์˜ ๋‹จ์ ์ด ๊ทธ๋ ‡๊ฒŒ ๋งŽ๊ณ  ๋ถˆํŽธํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๋ฐ์ดํ„ฐ Orchestra Tool ๊ฐ™์€ ๊ฒฝ์šฐ ํ•œ๋ฒˆ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋ช‡ ๋…„์„ ๋‚ด๋‹ค๋ณด๊ณ  ์‚ฌ์šฉํ•˜๊ณ  ์‰ฝ๊ฒŒ ๋ฐ”๊พธ์ง€ ์•Š๋Š” ํŠน์„ฑ์— ๊ฐ€์žฅ ์ข‹์€ ๊ฒƒ์„ ๋”ฐ๋ผ๊ฐ€๋Š” ๊ฒƒ์ด ํ•˜๋‚˜์˜ ์ •๋‹ต์ด ๋  ์ˆ˜ ์žˆ๋‹ค. 

 

 ์‚ฌ๋‚ด์—์„œ airflow์˜ ๊ธฐ๋Šฅ์ด ๋„ˆ๋ฌด ๋งŽ์€ ๋ฆฌ์†Œ์Šค๋ฅผ ์žก๊ณ  ์žˆ๋Š” ๊ฒŒ ์•„๋‹Œ๊ฐ€๋ผ๋Š” ์งˆ๋ฌธ์„ ์‹œ์ž‘์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ํˆด๋“ค์„ ๊ฒ€์ƒ‰ํ•˜๋ฉด์„œ ์•Œ์•„๋ณธ ๊ฒฐ๊ณผ ๋Œ€ํ˜• IT ๊ธฐ์—…์—์„œ ์‚ฌ๋‚ด์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฐ์น˜ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ์˜คํ”ˆ์†Œ์Šคํ™”๋œ ๊ฒŒ ์ƒ๋‹นํžˆ ๋งŽ์•˜๋‹ค. ์ตœ๊ทผ์— ๋‚˜์˜จ Netflix์˜ Maestro ๋˜ํ•œ ์ด์ค‘ํ•˜๋‚˜๋กœ airflow์˜ ๋Œ€ํ•ญ๋งˆ๊ฐ€ ๋ ์ง€ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. 

 

https://news.ycombinator.com/item?id=41037745 

 

Maestro: Netflix's Workflow Orchestrator | Hacker News

 

news.ycombinator.com

 

 People need to realize that code is a liability

  Data Orchestrator ๋„์ž… ์‹œ ๊ณ ๋ คํ•ด์•ผ ๋  ์‚ฌํ•ญ ์ค‘ ํ•œ ๊ฐ€์ง€๋กœ์‹œ๊ฐ„์ด ๋งŽ์ด ์ง€๋‚˜ ์ฝ”๋“œ์—์„œ ์ƒ๊ธฐ๋Š” ๋ถ€์ฑ„๊ฐ€ ์žˆ๋‹ค. ์ฝ”๋“œ์—์„œ ์ƒ๊ธด ๋ฒ„๊ทธ๋ฅผ ๊ณ ์น  ์‚ฌ๋žŒ์ด ์ ์  ์—†์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ์˜คํ”ˆ ์†Œ์Šค๋ฅผ ํ”„๋กœ๋•ํŠธ์— ๋„์ž…ํ•  ๋•Œ ์‹ ์ค‘ํ•ด์•ผ ํ•œ๋‹ค. ์ฝ”๋“œ๋ฅผ ์ž˜ ๋ชจ๋ฅธ๋‹ค๋ฉด ์ปค๋ฎค๋‹ˆํ‹ฐ ํ™œ์„ฑํ™”๋ฅผ ์—ผ๋‘ํ•˜๋Š” ๊ฒƒ์ด ์ •๋ง ์ค‘์š”ํ•˜๋‹ค.(git issue) 

 

 ํ›—๋‚  ๋‚˜์˜ ์ž๋ฆฌ์— ์žˆ์„ ๋‹ค๋ฅธ ์‚ฌ๋žŒ์„ ์œ„ํ•ด์„œ๋ผ๋„ ์—ฌ๋Ÿฌ ๋ฒˆ ๊ณ ๋ฏผํ–ˆ์œผ๋ฉด ์ข‹๊ฒ ๋‹ค... 

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค