Why?
Some time ago, while trying to solve a hairy problem I had the need to add a customized transform to my network. Why? Basically, this custom layer would encode my assumptions about the problem. And encode it using very few parameters. So there are less degrees of freedom for the network to explore, resulting in faster convergence, less data hungry and more stable learning. At least, this was my theory.
Ok, there are drawbacks, when the assumption is wrong, but we will not get into that.
How?

Well, this is not hard! There are tutorials in PyTorch! Enough to implement the forward()
and backward()
operation!
To implement backward()
we have to write our derivatives. And this is where, for me, things are getting hairier. Because it seems as soon as I close a NN tutorial, math just goes away. So I collected the math and the PyTorch specs under the same notation.
I wrote everything in a nice PDF, with LaTeX sources attached. With colors and step by step explanations. And linked some sources, never know. Maybe I did something wrong.

First, I refreshed the chain rule, how to derivate multivariate functions and solved a “textbook” example that I found on the internet. Math checked out! The result was on par with the “official” answer.
Moving on, I wrote how PyTorch wants our derivatives. Derived for one neuron layer, derived for multi neuron layer. This part was a bit pesky, until I figured out how the indexes must be summed up!
Finally, after all my math intuitions checked out I moved to the actual “business” and derived a layer that does a Gaussian filter. Its parameters are learned from the data.
Filtering in action
Suppose we have an input signal that, due to some reasons it gets delayed and smoothed out. Usually by going through a physical system of some kind.
We want to detect and reproduce these changes, from the data! We want a NN that will learn these transformations for us!
The most basic transform that can do the above operations is a Gaussian convolutional kernel:

Implementation
A Gauss kernel depends on one term, the standard deviation. But I added the mean in there, too, so the kernel is asymmetric, giving it the ability to shift, in space/time the input signal. Then I implemented the backward()
equations.
Code is available! Check also the tests! I check if the results are numerically correct.
Ok, but does it learn anything? Well, let’s generate some data and proceed to learning!

The code for this, is also available in a Jupyter Notebook. Including the script for generating the above image.
Neat? Yes! Rewarding? Yes! Needed? Well . . .
One pillar of all Deep Learning explosion is automatic differentiation. Which works very well! And not only for max()
, min()
sum()
operations! I implemented another variant of my layer, with only the forward()
step. Its behavior? Unchanged. And I did not note any speedups in compute.
For other layers, with more complicated transforms, I did not even bother with looking at derivatives. However it was a very rewarding exercise and not entirely futile. If one derivative has a term like this:
\(\frac{1}{\sqrt{x^2+y^2}}\)
for small inputs, the derivative explodes. A clipping might be in order, but if the above expression is hidden inside a larger expression? The only chance is to write the backward()
by hand and maybe apply the clipping to the partial result.
Another use case is when we want to replace the actual derivative with an approximation (eg for faster/more stable computations)
Enjoy!