Raman spectroscopy & AI (Part 1)

With the emergence of ChatGPT, big tech companies are now obsessing over the uses of AI and Machine Learning algorithms. The biggest problem taking the first steps in learning about Machine Learning, having no prior ideas other than watching The Matrix and The Terminator, is that you can’t just say “I want to learn AI.” Instead, ask yourself “Why do I want to learn AI?”

Many students and graduates will go along the naive route and say “I hear you can be very employable, so I’m learning AI to become more employable.”

Again, this is not the correct approach. AI is about solving problems by giving an algorithm immense amounts of data which would otherwise be strenuous and counterproductive if a human being did so. So your answer should be, in regards to employability, “I want to provide tools and skills to a company which would make such a company more productive.”

For example, let’s you are an online manager of a globally recognised clothing website. You sell a huge variety of clothes, shipped from all over the world. You want to organise your online store based on a few categories: type (i.e. shoes, t-shirts, jumpers, trousers etc), gender, colour, and shipping location. You would use an image and category classifier which places each item in its correct location. You wouldn’t use a human being to spend hours clicking, dragging, clicking, typing, dragging thousands and thousands of items —unless, of course, you’re a tyrannical and power-hungry manager.

True power is much faster than this. True power begins to emerge when you automate this entire process to a computer. If power means the rate at which you use energy to change something, then an algorithm which can significantly reduce your processing time without reducing the energy you have exerted would raise the power of your system significantly.

If you want to be more employable, or you think it would be “cool” to know how AI works, first consider the problems and inefficiencies around you. What are some of the tedious tasks you do every day? How can you automate it? How can you make your day-to-day easier?

The biggest problem I have, in the area of Raman spectroscopy, is handling a large hoard of spectrums and data sets. I am beginning to utilise using AI to classify chemicals in Raman spectroscopy. However, like any set of data, you have to dig your way through the noise, biases and inconsistencies to understand the characteristics you’re dealing with. In my case, chemical I’m analysing under a laser.

Even before exploring an AI classifier, one first needs to correct the spectrum such that it is more manageable. The presence of baselines and random noise can negatively affect the results of the qualitative analysis of substances. Therefore, the raw Raman spectrum cannot be directly used for identification without a few pre-processing techniques.

Improved Vancouver Raman Algorithm Based on Empirical Mode Decomposition for Denoising Biological Samples - Fabiola León-Bejarano, Martin O. Méndez, Miguel G. Ramírez-Elías, Alfonso Alba, 2019

Fig. 1: Raman signals for biological samples. With cleaned and de-noised and baselined signal (top), with the original raw spectrums (bottom).[1]

Cosmic Ray Artefact Remover
Noise Filter Smoothing
Baseline Correction

Cosmic Ray Artefacts Removal

A cosmic ray is a high-energy particle, typically a proton or atomic nucleus, that travels through space at nearly the speed of light. These particles are thought to originate from various sources, such as supernovae, black holes, and other energetic events in the universe.

When cosmic rays collide with particles in Earth's atmosphere, they can produce secondary particles, including muons, electrons, and photons. Such cosmic rays show up as artefacts on the Raman spectrum as a single and discrete wavelength, high energy spike. To an algorithm reading the spectrum for peaks, this may be mistaken as a Raman mode.

Fig 1

Fig. 2: Example of a Raman spectrum. (1, green) Shows a cosmic ray spike, (2, blue) shows instrumental influences on the spectrum, and (3, red) shows a 5^th degree polynomial fitting for background correction [2].

Noise Filter Smoothing

The Savitzky-Golay smoothing filter is a digital signal processing algorithm used to smooth noisy data or reduce the amount of high-frequency noise while preserving the shape of the signal. It works by fitting a series of adjacent data points with a polynomial function, and then estimating the smoothed value of each point as the value of the polynomial at that point. The coefficients of the polynomial are determined using a least-squares regression method. A consequence of SG smoothing is that it also removes the Cosmic Ray artefacts from your spectrum.

The filter is named after its inventors, Abraham Savitzky and Marcel J. E. Golay, who first published the algorithm in 1964. It is commonly used in various fields other than spectroscopy, including chromatography and image processing.

One of the advantages of the Savitzky-Golay filter is that it can be applied to unevenly spaced data points, which makes it useful for smoothing signals from various sources. Another advantage is that it can be used to estimate derivatives of a signal, by fitting higher-order polynomials to the data.

Fig. 3: The profile results from Savitzky–Golay-filtering of the test function [3].

The filter can be implemented using different orders of polynomial functions, and different window sizes (the number of data points used to fit the polynomial). The choice of these parameters depends on the nature of the signal and the desired level of smoothing. Typically, the window size is chosen to be odd, and the order of the polynomial is lower than the window size.

In summary, the Savitzky-Golay smoothing filter is a useful tool for smoothing noisy data, which is widely used in various fields of signal processing.

Here's an example of how to use the Savitzky-Golay smoothing filter in Python with a small data set:

Baseline Correction

Typically the last process to complete is removing the background fluorescence from our spectrum. Background fluorescence is a main consequence of impurities or contaminants in the sample. These impurities can absorb light at certain wavelengths, leading to fluorescence emission at different wavelengths, which can interfere with the Raman signal. To minimize this effect, it is important to use high-quality samples and to carefully clean and prepare the sample prior to measurement.

Another potential source of background fluorescence is the sample preparation process itself. For example, if the sample is heated or irradiated during preparation, this can lead to the creation of new fluorescent species, which can contribute to the background signal. To minimize this effect, it is important to use gentle sample preparation methods and to avoid exposing the sample to excessive heat or radiation.

Finally, the instrument itself can contribute to background fluorescence, particularly if the detector or other optical components are not optimized for Raman measurements. To minimize this effect, it is important to use a high-quality Raman instrument with optimized optical components and detector settings.

During my own Final Year Project, the fluorescence level actually played a positive role in classifying different types of whisky. Each whisky contained the same level of 40% alcohol but had variations in texture and colour, leading to a unique classification based on the fluorescence pattern. In my project, a baseline correction would not be advantageous to the classification.

Fig 3: Spectrums of Glenfiddich whisky at various ages [4].

Typical processing techniques involve identifying the endpoints of the Raman spectrum signal peak and utilizing the piecewise linear approximation approach to model the curve as the baseline. For example, if the next line, has a large gradient, it may be approaching a peak, and thus that point would be close to one of two of the endpoints around the peak.

Piecewise linear approximation, also known as linear interpolation, is a numerical approximation method used to approximate a function with a piecewise linear function. It involves dividing the domain of the function into smaller intervals and approximating the function with a straight line within each interval.

Furthermore, to generalise a function which closely matches our piecewise function, we approximate a polynomial function which closely fits the piecewise. Using methods in Response Surface Methodology, a polyfit function may be applied to determine the coefficients of our polynomial.

Lastly, to remove the baseline, we simply subtract the domain of the polynomial from the spectrum’s energy output across the range of our function (i.e. wavelengths, Raman shifts).

Fig 9

Fig 4: Effect of the processing on instrumental influences. (a) Shows a graph without instrumental influences (PBS) and (b) shows a graph with instrumental influences (rabbit eye). The upper line shows a RAW Raman spectrum and the lower line represents a processed Raman spectrum [2].

Part 2

Now that we have a smooth spectrum with recognisable peaks, we now need to process this through a variety of AI models to classify its chemical.

Head over to Part 2.

References

https://journals.sagepub.com/doi/10.1177/0003702819860121
https://www.sciencedirect.com/science/article/pii/S2215016120301023
SGFilter. A stand-alone implementation of the Savitzky–Golay smoothing filter (December 2011), Fredrik Jonsson
A compact Raman system for food and liquor inspection, Project report for final year project in Physics & Astronomy, Clark Gray BSc, 2022

Furthermore, if you would like to generate the code yourself which has been touched upon in this article, copy the code below into your own IDLE or wherever you input Python.

Noise Filter Smoothing

pythonCopy code
import numpy as np
from scipy.signal import savgol_filter
import matplotlib.pyplot as plt

# generate sample data
x = np.linspace(0, 2*np.pi, 21)
y = np.sin(x) + 0.1*np.random.randn(len(x))

# apply Savitzky-Golay filter
y_smooth = savgol_filter(y, window_length=7, polyorder=3)

# plot the original and smoothed data
plt.plot(x, y, 'o', label='original')
plt.plot(x, y_smooth, label='smoothed')
plt.legend()
plt.show()

Baseline Correction

Piecewise Linear Approximation

import numpy as np
import matplotlib.pyplot as plt

# Define the original function to be approximated
def original_function(x):
    return x**2 + 2*x + 1

# Define the interval over which the function will be approximated
x_values = np.linspace(0, 5, num=11)  # x values from 0 to 5 with 11 points

# Compute the y values of the original function at the x values
y_values = original_function(x_values)

# Define the points at which the function will be approximated
x_interpolation = np.array([1, 3, 4.5])  # x values at which interpolation is performed

# Perform piecewise linear approximation
y_interpolation = np.interp(x_interpolation, x_values, y_values)

# Plot the original function
plt.plot(x_values, y_values, 'bo-', label='Original Function')

# Plot the piecewise linear approximation
plt.plot(x_interpolation, y_interpolation, 'rx-', label='Piecewise Linear Approximation')

# Add labels and legend
plt.xlabel('x')
plt.ylabel('y')
plt.legend()

# Show the plot
plt.show()

Polyfit

import numpy as np
import matplotlib.pyplot as plt

# Define some example data points
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 7, 8, 9])

# Plot the points
plt.plot(x, y, 'ro', label='Data Points')

# Fit a polynomial of degree 3 to the data
p = np.polyfit(x, y, 3)
f = np.poly1d(p)

# Generate some x values to plot the fitted function
x_fit = np.linspace(x.min(), x.max(), 100)

# Plot the fitted function
plt.plot(x_fit, f(x_fit), 'b-', label='Polyfit Function')

# Add axis labels and a legend
plt.xlabel('X')
plt.ylabel('Y')
plt.legend(loc='best')
print(f)
# Display the plot
plt.show()

Raman spectroscopy & AI (Part 1)

You have to see before you can think

Cosmic Ray Artefacts Removal

Noise Filter Smoothing

Baseline Correction

Part 2

Noise Filter Smoothing

Baseline Correction

Piecewise Linear Approximation

Polyfit