Back

Quantum Training of a Classical DNN

"A quantum algorithm for training wide and deep classical neural networks"

Published by Alexander Zlokapa, Hartmut Neven, Seth Lloyd (MIT, Google), 22nd July 2021

arXiv:2107.09200
Machine learning
Quantum Training of a Classical DNN

Quantum Machine Learning (QML) techniques have been rigorously researched in the past few years, showing great promise in a number of applications, using various quantum algorithmic techniques and potential speedups as compared to its classical counterparts. Many of these techniques exploit the high dimensionality of the Hilbert space through kernel methods while most exponential speedups require fault-tolerant quantum computers with access to a quantum memory (QRAM).

While previous works have successfully established quantum analogues to common classical approaches like classification, clustering, regression, and other tasks in data analysis, there has been a lack of a clear demonstration of a quantum speedup for tasks on classical datasets for existing quantum neural networks proposals. While algorithms for quantum machine learning are largely based on methods from linear algebra, neural networks rely on nonlinearity to act as a universal approximator. Given the success of deep neural networks in classical machine learning, the use of quantum computers to efficiently train classical deep neural networks remains an interesting open question.

In this work, the authors show how one may exploit the benefit of deep neural networks via the Neural Tangent Kernel (NTK) formalism, i.e. increasing the number of hidden layers L. As the neural network is deepened, the NTK matrix becomes increasingly well-conditioned, improving the speed at which gradient descent trains the neural network. The authors propose a general framework for fully connected neural network architectures via a quantum algorithm to train a wide and deep neural network under an approximation of the NTK, estimating the trained neural network output with vanishing error as the training set size increases. Furthermore, the linear form of the NTK offers a general framework for neural network architectures similar to those used in state-of-the-art applications of deep learning. The work provides two different approximations: a sparsified NTK and a diagonal NTK approximation. The diagonal NTK is defined by setting all off-diagonal elements of the NTK to zero, while the sparsified NTK is defined by only permitting off-diagonal elements to be nonzero in any row or column. For both-approximations, the matrix element bounds of the NTK guarantee the convergence of the approximation and enable efficient gradient descent which highlights the correspondence between conditions for trainable classical neural networks and an efficient quantum algorithm.

The sparsified NTK approximation has a strictly tighter upper bound given by the Gershgorin circle theorem on the error compared to the diagonal NTK approximation, suggesting the superior performance of the former. To support this theoretical result, numerical demonstration has been done using the MNIST image dataset, more accurately considering the MNIST binary image classification task between pairs of digits. Two infinite-width neural network architectures are used: i) a feedforward fully-connected neural network and ii) an architecture based on the convolutional Myrtle network. In each case, the handwritten image dataset is projected onto the surface of a unit sphere, during the initial encoding into QRAM. The results demonstrate that even relatively common data distributions satisfy the necessary conditions for efficient quantum state preparation and quantum state measurement, fully realizing an exponential quantum speedup over gradient descent.

This work demonstrates a quantum algorithm to train classical neural networks in logarithmic time O(log n) and provides numerical evidence of such efficiency on the standard MNIST image dataset. As the neural tangent kernel offers a versatile framework across architectures such as convolutional neural networks, graph neural networks, and transformers, the approximation introduced by a sparsified or diagonal kernel in this work has the potential to extend to any chaotic kernel. Also, the increase of the depth of a chaotic kernel might give rise to the kernel structure key to well-conditioning and successful approximation in logarithmic time. This can potentially open new possibilities for improved machine learning methods to quantum computing beyond the scope of classical neural networks.