728x90

A3C ๊ณ„๋ณด

Policy-Based

 ๊ธฐ์กด์— Value Based ์ฆ‰ Q-value๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์€ State์™€ action์— ์˜์กดํ•ด ํ•ญ์ƒ trajectories(state-action-reward sequence)๋ฅผ ๊ตฌํ•ด๋‚˜๊ฐ€์•ผํ•˜๋Š” ์ œ์•ฝ์ด ์žˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ Policy-Based๋Š” Q-value๋ฟ ์•„๋‹ˆ๋ผ Policy์— ๋Œ€ํ•œ ์ถ”์ •๋„ ๊ฐ™์ดํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๊ฒƒ์€ Agent๊ฐ€ ์˜ฌ๋ฐ”๋ฅธ ๊ธธ๋กœ ๊ฐ€๋Š” ์ „๋žต์„ ์ฐพ๋Š” ๊ฒƒ์œผ๋กœ Policy-Based๊ฐ€ ์ด๋ฅผ ๋” ์ž˜ ๋ฐ˜์˜ํ•ด์ฃผ์—ˆ๋‹ค.

 

 ์žฅ์ ์œผ๋กœ๋Š”

- policy๋ฅผ ์ง์ ‘ ํ•™์Šตํ•˜๋ฏ€๋กœ ์•ˆ์ •์„ฑ์ด ๋†’๋‹ค.(ํ™˜๊ฒฝ ๋ณ€ํ™”, ๋…ธ์ด์ฆˆ์— ๋œ ๋ฏผ๊ฐ)

- ํ™•๋ฅ ์ ์ธ ์ •์ฑ…(Exploration, Exploitation) ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ์กฐ์ ˆํ•˜๋ฉด์„œ ฯ€*(Optimal Policy)๋ฅผ ํ•™์Šต

- Continuous space์—์„œ๋„ ์ž˜ ์ž‘๋™

- ๋‹ค์–‘ํ•œ Optimizer ์‚ฌ์šฉ ๊ฐ€๋Šฅ

 

Advantage Value

A(s, a) = Q(s, a) - V(s)

โ€ข A(s, a) : ์ƒํƒœ s์—์„œ ํ–‰๋™ a๋ฅผ ์ทจํ–ˆ์„ ๋•Œ์˜ Advantage ๊ฐ’

โ€ข Q(s, a) : ์ƒํƒœ s์—์„œ ํ–‰๋™ a๋ฅผ ์ทจํ–ˆ์„ ๋•Œ์˜ Q-value 

โ€ข Q-value : ์ƒํƒœ์™€ ํ–‰๋™ ์Œ์— ๋Œ€ํ•œ ๊ธฐ๋Œ€ ๋ณด์ƒ

โ€ข V(s) : ์ƒํƒœ s์˜ Value function ๊ฐ’ (Value function์€ ํ˜„์žฌ ์ƒํƒœ์˜ ๊ธฐ๋Œ€ ๋ณด์ƒ)

 

Advantage ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” Policy Gradient์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ณด์ƒ๊ฐ’ ๋Œ€์‹  Advantage๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ gradinet ์ถ”์ •์น˜์˜ ๊ธฐ๋Œ“๊ฐ’์„ 0์œผ๋กœ ๋งŒ๋“ค์–ด Variance๋ฅผ ์ค„์ธ๋‹ค.

 

Actor-Critic

actor-critic์€ actorํ™˜๊ฒฝ์—์„œ ๋งˆ๊ตฌ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์„ ์ ‘ํ•˜๊ฒŒํ•ด ํ•ด๋‹น ํ™˜๊ฒฝ์—์„œ policy๋ฅผ ํ•™์Šตํ• ๋•Œ ์ƒ๊ธฐ๋Š” gradient๋ฅผ Global Network์— Updateํ•œ๋‹ค. ์ด๋•Œ ์œ ์˜๋ฏธํ•œ Gradient๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด Critic์ด ํ‰๊ฐ€๋ฅผ ํ•œ๋’ค Update ํ•˜๊ฒŒ ๋œ๋‹ค. 

 

๊ทธ๋ž˜์„œ model ์•ˆ์—๋Š” ์ด ๋‘๊ฐ€์ง€์˜ gradient๊ฐ€ ์กด์žฌํ•œ๋‹ค. Actor_network, Critic_network

 

 

A3C Global Network

 

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค