Efficient execution of deep learning models is an ongoing systems problem. It was particularly hard before the reach ecosystem of frameworks (such as Caffe, PyTorch) and libraries (such as CUDNN and DNNL), were established. This talk follows a path taken by the author in the pursuit for efficient execution of non-conventional deep learning model on a petabyte scale datasets. We reflect on methods and systems that allow for great speedup of seemingly naive computation - nested set of loops. These methods both leverage hardware specific (either CPU or GPU) architectures, as well as apply different methods for performing computation (through FFT or Winograd transforms). Finally, we discuss where the current state of the art is headed and how additional gains in execution speed can be achieved.
Technical / Advanced
MIT BSc Physics, MIT BSc Electrical Engineering and Computer Science, MIT MEng Computer Science. MIT PhD Computer Science.
Research in systems/machine learning at MIT and Princeton.
Engineering and consulting at Google/Facebook. Worked at startups Kayak/NeuralMagic.
Currently at a big tech company working as a research scientist.