LOSSGRAD: Automatic Learning Rate in Gradient Descent

Bartosz  Wójcik; Łukasz Maziarka; Jacek Tabor

LOSSGRAD: Automatic Learning Rate in Gradient Descent

Publication date: 2018

Schedae Informaticae, 2018, Volume 27, pp. 47-57

https://doi.org/10.4467/20838476SI.18.004.10409

Authors

Download full text

PDF

Titles

LOSSGRAD: Automatic Learning Rate in Gradient Descent

Abstract

In this paper, we propose a simple, fast and easy to implement algorithm LOSSGRAD (locally optimal step-size in gradient descent), which automatically modifies the step-size in gradient descent during neural networks training. Given a function f, a point x, and the gradient ▽_xf of f, we aim to find the step-size h which is (locally) optimal, i.e. satisfies:

h = arg min f(x - t▽_xf).
t≥0

Making use of quadratic approximation, we show that the algorithm satisfies the above assumption. We experimentally show that our method is insensitive to the choice of initial learning rate while achieving results comparable to other methods.

Keywords

gradient descent, optimization methods, adaptive step size, dynamic learning rate, neural networks

References

Download references

[1] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121-2159, 2011.

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.

[3] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.

[4] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[5] Alex Krizhevsky and Georey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[6] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730-3738, 2015.

[7] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142-150. Association for Computational Linguistics, 2011.

[8] Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization. In Advances in Neural Information Processing Systems, pages 181-189, 2015.

[9] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532-1543, 2014.

[10] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145-151, 1999.

[11] Michal Rolinek and Georg Martius. L4: Practical loss-based stepsize adaptation for deep learning. In Advances in Neural Information Processing Systems, pages 6434-6444, 2018.

[12] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.

[13] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26-31, 2012.

[14] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.

[15] Xiaoxia Wu, Rachel Ward, and Léon Bottou. Wngrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.

[16] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

[17] Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018.

[18] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

Information

Information: Schedae Informaticae, 2018, Volume 27, pp. 47-57

DOI: https://doi.org/10.4467/20838476SI.18.004.10409

Article type: Original article

Authors

Bartosz Wójcik

Institute of Mathematics, Faculty of Physics, Mathematics and Computer Science, Cracow University of Technology

Łukasz Maziarka

Faculty of Mathematics and Computer Science, Jagiellonian University, Krakow, Poland

Jacek Tabor

https://orcid.org/0000-0001-6652-7727

Faculty of Mathematics and Computer Science, Jagiellonian University ul. Łojasiewicza 6, 30-348 Kraków, Poland

Published at: 2018

Article status: Open

Licence: CC BY-NC-ND

Percentage share of authors:

Bartosz Wójcik (Author) - 33%

Łukasz Maziarka (Author) - 33%

Jacek Tabor (Author) - 34%

Article corrections:

Publication languages:

English

View count: 1959

Number of downloads: 1412

<p>LOSSGRAD: Automatic Learning Rate in Gradient Descent</p>

Download xml