Proposal | Checkpoint | Final Report
Yifan Jiang(yjiang1) Xiangguang Zheng(xiangguz)
The goal of this project is to implement a LSTM DSL(Domain Specific Language), which provides RNN researchers a set of handy primitives to experiment with different LSTM-like RNN structures, which are then scheduled and run on GPUs in an efficient way. To achieve the goal of scheduling LSTM network dynamically, we started from optimizing the feedforward process of the classic LSTM network in a static way using Cuda blocks and CuBlas. Referring to the technical blog [link] by NVIDIA’s team, we implemented a series of optimizations and achieved a ~776x speedup compared to the native python implementation (not our main focus) and a ~10x speedup compared to the naive LSTM implementation in Cuda (both run on GHC machines). Then we designed a generalized and schedulable LSTM engine, targeting to get closed to the performance achieved through static optimization as much as possible. Due to time constraint, we shifted our focus to building an IR (intermediate representation) in C++ and cuda, which acts as the backend engine of the DSL frontend in our original plan. This LSTM scheduler implementation applied several optimizationsis expected to achieve a speedup around ~4x compared to the naive CUDA implementation and ~0.8x compared to the static LSTM optimization.
Long short-term memory(LSTM) is a recurrent neural network architecture that is capable of learning long-term dependencies. It has been proven to be very powerful in classifying, processing and predicting inputs with time series (composing articles, translating etc.). The image below shows a classic LSTM cell and the operations involved in it:
Figure 1
Our optimizations are inspired by this blog.
The traditional LSTM formula requires to do matrix multiplication for input data with four weight matrix. Instead of do 4 matrix multiplications, we combined 4 weight matrix together to do a combined matrix multiplication. Overall, we reduce 8 matrix multiplication to 2 for both Rh and Wx.
Figure 2&3 show the details of combined matrix multiplication:
Figure 2
Figure 3
In order to do combined matrix multiplication correctly, we need to stack 4 matrix vertically. However, cublas is column-dominated matrix, vertically stacking matrix requires that all elements in 4 matrix stored in interleaved position in memory. Therefore, assuming the weight is generated in contiguous memory space, we need to do a pre-transpose in order to get a vertically stacking combined matrix for combined matrix multiplication.
Figure 4
Besides the 8 matrix multiplications, the rest of the computations are all elementwise operations. Therefore, we can avoid the overhead of launching multiple kernels by combining these computations into one kernel.
For matrix multiplication between W and x, dependency relationship is in vertical direction, because the input x at time t only depends on the previous layer at time t. Therefore, we are able to combine the input matrix x in contiguous horizontal node to achieve better performance.
Figure 5
There are two dependency relationships in LSTM, input x at time tis depend on previous layer at time t (the node below), input h is depend on same layer at time t - 1(the node to the left). Therefore, an approach to achieve parallelism between multiple layers is to run nodes in diagonal at the same time, as the graph shown below.
Figure 6
Our experiment is running on GHC machines(NVDIA GTX1080).
Figure 7
Compare the speedup of different static optimizations, using network 1.
Figure 8
Figure 9
Figure 10
Our approach is inspired by Tensorflow’s architecture - using a DAG to represent a computation flow. Each node in the graph either represents a data point(weights/vectors) or operator.
Figure 11
Figure 12
Figure 13
Because we spent too much time in research as well as incorrectly predicted the complexity of this project, we don’t have a working version yet for the dynamic scheduler. We were able to implemented the pre-processing part but not the generation of CUDA code. However, based on the optimization we applied to our design of dynamic scheduler, we predict the expect result in Figure 14.
Figure 14
Equal work was performed by both project members.
[1]: CS231n Lecture 10 - Recurrent Neural Networks: https://www.youtube.com/watch?v=iX5V1WpxxkY\
[2]: Christopher Olah. Understanding LSTM Networks: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
[3]: Andrej Karpathy. The Unreasonable Effectiveness of Recurrent Neural Networks: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
[4]: Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A Critical Review of Recurrent Neural Networks for Sequence Learning.
[5]: Nvidia. cuBLAS toolkit documentation: http://docs.nvidia.com/cuda/cublas/#axzz4fJQIdbFQ
[6]: Nvidia. cuDNN User Manual: https://developer.nvidia.com/cudnn
[7]: Nvidia. Optimizing Recurrent Neural Networks in cuDNN 5: https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/
[8]: Engel, Jesse. “Optimizing RNNs with Differentiable Graphs.” Baidu Silicon Valley AI Lab