Deep Learning with Phase Retrieval

Introduction

I spent the second year and half of the third year during my Ph.D. program to develop Chemical Holographical Imaging System (CHIS). It is designed to acquire not only the intensity of the light (electromagnetic field) but also the phase information, which is usually lost when using a standard camera or detector to collect the photons. I will write another blog to explain the mechanism behind the scene. Here, I will talk about the downstream work, which happens after the phase of the field is reconstructed.

To characterize and prove the concept, we implemented a forward model and an inverse model for CHIS, simulating the process of acquiring a phase-sensitive measurement and reconstructing the sample, respectively. Here I will only talk about the inverse model, where machine learning algorithms are applied.

We use the Mie scattering model to simulate a micro-sphere whose radius is close or longer than the wavelength of the incident light. Mie scattering is the light scattered when electromagnetic wave interacts with uniform isotropic particles (spheres) in a uniform isotropic dielectric infinite medium. Mie theory is a rigorous solution to Maxwell equations that describes the propagation of electromagnetic radiation by spherical scatterers. It has been widely applied in the characterization of atmospheric absorption and cellular structure detection in spectroscopic imaging of biological tissues.

According to the superposition principle, any form of propagating electromagnetic field is equivalent to a combination of multiple plane waves. To simplify the model, the incident light going through a condenser with a low numerical aperture is considered as a plane wave. While the forward model of CHIS can be described by Mie scattering, the inverse model is equivalent to solving the inverse Mie problem.

Here we first examine the forward Mie scattering. The scattered field for a single plane wave produced by a sphere placed at the origin with radius a can be written as:

\textbf{E}_s(\textbf{p}) = \sum _{l = 0} ^ \infty B_l(\lambda, n, a) h_l^{(1)} (kr)P_l(cos\theta)

where h_l^{(1)}(\cdot) is the order-l spherical Hankel function. The coefficients B that couple the internal and external fields are given by:
B(\lambda, n, a) = (2l+1)i^l\frac{j_l(ka)j_l^{‘}(kna)n-j_l(kna)j_l^{‘}(k)}{j_l(kna)h_l^{(1)}(ka)-h_l^{(1)’}(ka)-h_l^{(1)}(ka)j_l^{‘}(kna)n}

where j_l^{‘}(\cdot) is the first derivative of the spherical Bessel function of the first kind. And h_l^{(1)’}(\cdot) is the first derivative of the spherical Hankel function.

If we take a closer look at these two equations, it is not difficult to notice that the material properties of the sphere (radius a and refractive index n) are only embedded in the B coefficient. In other words, to build an inverse model of the system, our first step is to extract the B coefficient from the field \textbf{E}_s.

Since the Hankel function and its derivatives can be derived from the Bessel functions of the first kind, the Bessel functions of the second kind, and their derivatives, the B coefficient behaves similarly to Bessel functions. In Figure 1, B coefficients of different scattered fields scattered by different spheres are plotted (only real part is shown).

Figure 1. B Coefficients of the scattered field with different sphere radius
Figure 1. B Coefficients of the scattered field with different sphere radius

The animation shows how the B coefficient changes with a \in [1, 10] \mu m. You can run this Jupyter Notebook demo to examine the property of B coefficient with Bokeh interactive plot implementation. 

The point is, the B coefficient trends to converge to zero after going up and down along increasing order. In other words, it is highly sparse and fluctuating, which makes a lot of sense: because the incident light is scattered by a sphere, the scattered field around the surface of the sphere would have the highest intensity. Then the intensity goes up and down due to scattering and interference. When the scattered field travels further away from the sphere, its intensity converges to zero eventually (scattered into broader space). In fact, on top of the sparsity of the B coefficient, the rest of the equation is highly sparse and fluctuating as well. If we consider the scattered field as a system of linear equations where the B matrix is the input and the rest of the equation is the transferring matrix and the field is the output, we can get the following linear system: Ax = b where A is the scatter matrix, A = h_l^{(1)} (kr)P_l(cos \theta) and  x is the B coefficient and b is the field E_s. If we calculate the condition number of the linear system, it can go up to 10^{21}.

In numerical analysis, the condition number with respect to an argument of a function stands for the amount of change to the output with respect to a small shift in the input argument. A low condition number means the system is well-conditioned thus more robust to the disturbance in the input. In contrast, a high condition number means that the system is sensitive to disturbance. A small change in the input could cause a dramatic change to the output hence an ill-conditioned system. A linear system with condition number \kappa = 10^k means besides the arithmetic error during the calculation, another k digits of loss of accuracy would be induced.

In our case, if we apply traditional analytical methods such as Pseudo-inverse or Gaussian elimination to solve the inversed linear system to get B coefficient from E_s (which is our measurement), the ill-conditioned property makes the recovery with noise impossible since a tiny little bit of noise in the measurement would result in massive change in the calculation of B. Another way to think about it is the error is the noise multiplied by the condition number.

Deep Learning Approach

One pathway to the solution is broken does not mean the answer is unreachable. Before we jump into machine learning, let us take a look at some of the advantages and disadvantages of different strategies:

  1. Analytical solutions are mathematical grounded. We can break down the solution into pieces and prove each of part of it rigorously. We know what we are doing and what should we expect.
  2. Analytical solutions are universal and scalable. Top level applications are grounded with similar fundamental theories, the analytical solution (if we have one) of our system can be applied to any other similar imaging systems.
  3. Analytical solutions are one-time work. They can be applied directly once the strategy is finalized whereas machine learning methods are not. A different training set has to be provided if we want to use the model for another system even if there is only a little difference, for accuracy.
  4. Machine learning is a black box, which is well-known in the community. We do not have the detailed information of what our models are doing when they give the prediction, either do we know what to expect when we are trying to tweak some fine gears (not excluding the ability to judge if the model has a high bias or high variance, of course).

At this point you might want to ask, seems like there must be better choices than using machine learning algorithms to solve the ill-conditioned inverse problem, why don’t you try something else? The answer is Yes, we did try something else. And that something is called Compressive Sensing, which is specialized for sparse reconstruction and convex optimization. The short story is 1. the linear system we have is not a convex problem, 2. accuracy of sparse reconstruction algorithms seems to suffer from the synthetic noise, 3. they all require training somehow.

Look over all the pros and cons of all methods available, we stick with the machine learning (ANN specifically, more on later) algorithms despite all the disadvantages they might have: for proof of concept, we believe a better accuracy should be weighted higher than scalability and the ability to be generalized; the training process can be optimized in various ways such as feature selection and smart sampling.

Convolutional Neural Network

Even though Convolutional Neural Networks (CNNs) were invented during the 1980s, they were not so popular until the GPU implementation was done and generalized at the early 2010s. They are mighty in terms of image processing with the function of “convolving” the images to extract higher dimensional features. One would normally take CNN as the first choice to solve the problem (at least we did). The structure of the CNN we applied to our system is shown in Figure 2:

Figure 2. CNN Structure
Figure 2. CNN Structure

As shown in Figure 2, a convolution and max-pooling based set of layers are adopted as the basic structure. Three pairs of convolutional layer and max-pooling layer are cascaded to enable the extraction of higher level abstract features from the scattering images. The number of filters for each convolutional layer is 128, 64, and 32, respectively. The window sizes are set to 3\times 3 and 2\times 2 for convolutional layers and the sequential max-pooling layers. A fully connected layer with 128 units is attached to the last max pooling layer followed by the output layer. Given a complex field as the input for the trained CNN model, the final output is represented by the three units in the output layer corresponding to the radius a, the refractive index \eta, and the attenuation coefficient \kappa, respectively.

Since we have the forward model implemented for the system, one of the advantages is that we do not need to worry about the amount of data we could use for training. However, the trade-off between model structural complexity and training time cost is not neglectable. The three-layer structure is a result of leveraging accuracy and time cost. 

CNN performance

To ensure the CNN is trained with low bias and low variance, 8000 phase-reconstructed images with 640 \times 640 spatial resolution are generated by the forward model with different material properties for the simulated spheres. Each of the feature vectors has 20 entries uniformly distributed within their predefined range. Specifically, the radius a of the sphere is ranging in [1, 2]\mu m, the refractive index \eta is varying in the range [1.1, 2.0], and the attenuation coefficient \kappa \in [0.01, 0.05]

Before training the model, stratified sampling is applied to split the training and testing data set since the overall data set is too large to fit in memory. The sphere radius clusters the 8000 images into 20 subsets. Within each subset, the training set and test set are randomly sampled and shuffled with a 20\% ratio, and all images are cropped to 128 \times 128 resolution during this stage. As a result, there are 6400 and 1600 images in the training and test set, respectively. And the order of the features is randomized for \eta and \kappa but ascending for a.

During the training, the training set is further divided into training and validation set with the same ratio (20% for validation), which results in 5120 images for training and 1280 images for validation. The training process contains 25 epochs (i.e., the training and cross-validation are repeated for 25 times) and the batch size is set to 50. Meanwhile, label scaling is applied to the attenuation coefficient set due to scale differences.

Figure 3. Relative RMSE of (left) complex CNN (with phase) and (right) intensity CNN (without phase)
Figure 3. Relative RMSE of (left) complex CNN (with phase) and (right) intensity CNN (without phase)

Figure 3 shows the performance of the CNN trained on 6400 complex images and tested on the rest of the 1600 images. The plots only show 50 samples uniformly down-sampled from the testing and prediction set by a factor of 32. The relative root-mean-square error (RMSE) \epsilon_R is calculated by:

\epsilon_R = \frac{\sum_{m=1}^M\frac{\sqrt{(y_{pred}^{(m)} – y_{test}^{(m)})^2}}{y_{test}^{(m)}}}{M}

where y_{pred} is the prediction given by the CNN and y_{test} is the groud truth (label) of the corresponding sample. And the total number of testing samples is denoted by M = 1600.

To demonstrate the merits of phase-sensitive measurements achieved by chemical holography, another CNN model is trained with traditional intensity images with an identical structure and the same number of training and testing samples. The intensity images are acquired by calculating the square of the absolute value of the complex images. And the performance of the intensity CNN is shown on the right (Figure 3).

Everything seems beautiful if we stop here: not only the CNN is working with a pretty lovely regression accuracy; also the one trained with complex images (phase reconstructed) is outperforming the one trained with standard intensity images. To generalize our model, or, to demonstrate the robustness of CHIS, we added more experiments to back it up.

Table 1. Relative Root Mean Square Error (RMSE) for CNNs (%)
Table 1. Relative Root Mean Square Error (RMSE) for CNNs (%)

Table 1 shows the performance of complex CNNs and intensity CNNs summarized from 30 repetitive trails. It is clear that the complex CNNs perform better than intensity CNNs, which can be partially explained by the fact that the additional phase information existing in the complex training set is no longer available for the intensity training set (during the transaction from complex to intensity).

Artificial Neural Network

It seems a little bit weird seeing Artificial Neural Network (ANN) being mentioned at this point since, for most of us, CNN is far more advanced and powerful than ANN. We already implemented the model with CNN, why bother even talking about ANN? Well, one of the most useful lessons I learn from my advisor Dr. David Mayerich is that, as a scientist, or an engineer, we often need to calm down and take a step back. Essentially, when we are doing research or trying hard to get things done, it is quite natural for us to be tunnel-visioned. Once we stuck, it is relatively easy to get too extreme and anxious as well. Only if we take a step back and make sure everything basic has been done correctly, the path to move on would show up.

Back to the topic, why we choose ANN? The answer can be as simple as, we don’t need CNN because we can achieve the same performance with ANN, even faster, easier, and better. And I will explain the reasons explicitly and rigorously.

First of all, CNNs are widely applied to image processing to take care of complicated high dimensional feature extraction. The goal is to train the network so it can distinguish features or patterns from a 2-D image, or even a 3-D volume and CNN is extremely good at this. Now let’s consider the following statement:

We have a dataset and a model trained with this dataset, if the model is CNN and it is the best model we tried, then this dataset must contains two (or higher) dimensional images.

This statement is true (at least in most of the cases, welcome to correct me if I am wrong). But how about the converse of the statement:

If we have a dataset that contains two (or higher) dimensional images, then CNN will be the best model for this dataset.

Is this converse statement also true? There are numerous examples where it does not hold, including our case. Consider an image with all the pixel values in the x axis being the same number and only varies along the y axis, then we can “compress” this 2-D image into a 1-D vector with one of the vertical lines without losing any information. Furthermore, if all the pixel values are the same, we can even compress the whole image into a single pixel, and this is where the symmetricity comes into play. In highly symmetrical situations, such as center symmetry, radial symmetry, or even reflectional symmetry, the more symmetrical the original 2-D image is, the more redundant information it has thus, the less spatial diversity encoded in the local pixels. If there is not much or no higher dimensional features to be extracted, a simpler model such as ANN can take care of all the work adequately. 

In our case, assume there is no systematic noise or the noise is low, the whole image will be radially symmetrical if we place the sphere at the center of the image. Hence we can reduce the dimension of our dataset from 2-D images to 1-D vectors to train the model without losing any spatial information.

Once we switched from CNN to ANN using 1-D vectors reduced from the 2-D images, several obvious advantages appear naturally:

  1. Dramatically reduced data size, computational cost, and time cost: if the original 2-D simulation has a N\times N resolution, there are N^2 points to be calculated. Once been reduced to 1-D, the number of calculations decreased into N/2 (one vertical or horizontal line from half of the image). Therefore, the dataset for training and testing for the 1-D ANN model is 2N times smaller than the 2-D case. Previously, our training image resolution after padding is 640\times 640 with a total size of more than 100 GB of data, and now it will be shrunk into 1/1280 of the original size which is no more than 100MB, while without losing any information. This 2N relationship also applies to the amount of computational cost and time cost reduced.
  2. Simpler network model and more sophisticated training: as one of the more fundamental machine learning models, ANNs are more mathematically grounded, especially for single layer ANNs. The weights between layers can be interpreted more concretely. That is, for each pixel in the input 1-D vector, there is a combination of weights (the hidden layer) being optimized to minimize the cost function, including the bias term. The more units in the hidden layer, the more “fit” the model will be (also more chance of overfitting). This point-to-point mapping is less abstract than CNN’s higher dimensional feature extraction mechanism thus, easier to be controlled. With that being said, ANN models can be trained more thoroughly with much less work required compares to CNN models.

Figure 5 demonstrates the radial symmetry of the simulation and the effectiveness of the dimension reduction. As we mentioned before, for the simulation to be radially symmetrical, the sphere must be placed at the center of the field of view. One thing to keep in mind is that this condition is hard to meet in real data. In other words, in the experimental dataset, the error induced by non-symmetry would impact the performance of the proposed pipeline. However, this drawback also applies to other models such as CNNs. Adding more synthetic data for training can be a potential solution.

Figure 5. (a) 2-D Mie scattering simulation in far-field (top) and near-field (bottom). (b) 1-D far-field Mie scattering simulation corresponds to the red dashed line in (a). (c) 1-D near-field Mie-scattering simulation corresponds to the green dashed-line in (a).
Figure 5. (a) 2-D Mie scattering simulation in far-field (top) and near-field (bottom). (b) 1-D far-field Mie scattering simulation corresponds to the red dashed line in (a). (c) 1-D near-field Mie-scattering simulation corresponds to the green dashed-line in (a).

With one-dimensional data, the structure of the model can be simplified dramatically, as shown in Figure 6. As mentioned before, the 1-D dataset only occupies 1/1280 of the memory space compares to the 2-D dataset; hence the bottom neck of memory limit no longer exists. The dimension of each feature is increased to 30 results in 30^3=27000 total number of samples. Each sample contains a 320\times 1 vector as the input for the neural network. 

Figure 6. ANN structure
Figure 6. ANN structure

In terms of fine-tuning ANN model, there are several things to pay attention to:

  1. Unlike CNN where processes like convolution and max-pooling generalize the local differences in the data, ANN takes all the local information and arrange combinations of weights to each of them. As a result, ANNs with more complicated structures are more likely to be struggling with overfitting, which becomes more noticeable when applying deep ANNs to real-life data where lots of noise are presented. In a word, complicated ANNs with higher accuracy are more sensitive to noise than simple ANNs assuming that they are trained without noise. How to leverage on the trade-off between noise sensitivity and accuracy is yet to be determined by the problem to be solved.
  2. Random initialization. To ensure that the number of stochastic processes is minimized while developing a model, each operation involving randomness, such as weights initialization, should be assigned with a seed or random state. 
  3. Activation function. While the most popular activation function being ReLu (Rectified Linear units), to optimize the training process, other activation functions should be considered, too. One thing to keep in mind is the dying ReLu problem. Since ReLu sets everything negative to zero, the gradient can converge towards zero. Once a neuron (unit) hits this state, the corresponding weights will stop getting updated thus so-called dead neuron. In our case, there is a fair chance that the weighted sum can hit negative numbers. Therefore a tanh (hyperbolic tangent) activation function would be a better choice. In fact, the training process does converge significantly faster with tanh function than sigmoid or ReLu when the network has a small number of hidden units. When the network grows deeper or has more hidden units, ReLu would be a better choice. If a lot of units stuck at the edge of sigmoid or tanh functions, vanishing gradients appears which can make the training unstable and more difficult to converge. On the other hand, Dying ReLu would not be a problem here since the number of units is big enough.
ANN Performance

Starting with a simple model as shown in Figure 6, there is only one hidden layer and five hidden units. And the number of parameters of the intensity ANN is shown below.

For complex cases, since the 1-D vector contains both real and complex components, it is split into two channels for the training. The model has the same structure with 1 hidden layer and 5 hidden units. And the number of parameters in the complex model is shown below.

As shown in the tables, the number of parameters only increased by 5 for the complex ANN, which is equal to the number of extra channels added to the input layer. The model is trained with Keras package (GPU implemented) and 400 epochs, which takes around 2 minutes for 19440 training samples (19440 =  27000 \times 10% \times 20% , 10% for test set and 20% for validation set). And the results averaged from 30 repetitive trials are shown in table 2.

Table 2. Relative RMSE for simple ANNs (%)
Table 2. Relative RMSE for simple ANNs (%)

Before we judge the performances of ANNs and CNNs, let’s compare the number of parameters in both models first. Recall that the number of parameters for the complex ANN is 4818, which is not a small number. However, it is just the tip of an iceberg compares to complex CNN, of which the number of parameters is shown below:

Well, there are more than 1.7 \times 10^6 parameters in the CNN model even with max-pooling layers. In contrast, 5\times 10^3 is a small number of parameters. While using no more than 1/1000 of the parameters, ANNs are able to get only several percentages of accuracy loss, which is the point we want to point out here: 

Choosing the correct model is the first priority in machine learning.

If you are not convinced yet, totally understandable. One of the questions could be: the accuracy we are getting with the current ANNs is still worse than the CNN version, which is unwanted if we do not care about resources cost to get the model trained. In other words, what if we simply want to achieve the best accuracy, should we switch back to CNN?

The answer is, not yet. The ANN we currently have is just the simplest version. Instead of turning into CNN again, what we can do is to make our ANN model more complicated, such as adding more hidden layers, or more units at each layer. Such as a deeper ANN shown in Figure 7:

Figure 7. A more complicated ANN with 3 hidden layers and more hidden units in each layer.
Figure 7. A more complicated ANN with 3 hidden layers and more hidden units in each layer.

And the number of parameters of this new ANN is shown below:

Without the max-pooling layer and dropout layer, the number of parameters in the new ANN is significantly increased due to the last fully connected layer. But still, the training process costs less time compares to the training time of a deep CNN (no convolution needed for ANN). In this case, the network was trained with 100 epochs in 13 minutes with an Nvidia GeForce GTX 970 GPU. And its performance is shown in Table 3.

Table 2. Relative RMSE for deep ANNs (%)
Table 2. Relative RMSE for deep ANNs (%)

In this single experiment, ANN trained with phase-reconstructed 1-D data outperforms CNN trained with 2-D images, despite the comparable numbers for the intensity ANN accuracy, which in return demonstrates the merits of phase retrieval.

Conclusion

Machine learning is a profound and sophisticated discipline that should be carefully considered and examined before implementing it. ANN seems weaker than CNN at first glance, or at least the author was having some trouble giving up CNN until the ANN results were carried out.

In other words, model selection should be placed at the top with priority before taking any action. Only after taking every factor related to a specific problem into account (even considering changing the strategy of data generation), should the model be determined.

There might be error or confusion exist in this blog, feel free to correct me if I am wrong or let me know if you have any questions.

At last, the code for 

  • Generating the Mie scattering data can be found here.
  • Training and evaluating the CNN model can be found here
  • Training and evaluating the ANN model can be found here

Thanks for reading 🙂