728x90

Adam

Adam: Adaptive moment estimation

Adam = RMSprop + Momentum

Momentum :  gradient descent ์‹œ ์ตœ์†Œ์ ์„ ์ฐพ๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ์Šคํ…์„ ๋ฐŸ๋Š” ๊ฒƒ์ด ์•„๋‹Œ ์Šคํ…์„ ๊ฑด๋„ˆ ๋›ด๋‹ค. 

Stochastic gradient descent(SGD)

 

 

Image Credit: CS231n

Adagrad

It makes big updates for infrequent parameters and small updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

The main benefit of Adagrad is that we donโ€™t need to tune the learning rate manually. Most implementations use a default value of 0.01 and leave it at that.

Disadvantage โ€”

Its main weakness is that its learning rate is always Decreasing and decaying.

 

AdaDelta

It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it.

Another thing with AdaDelta is that we donโ€™t even need to set a default learning rate.

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค