CS 426 Senior Project in Computer Science, Spring 2024, at UNR, CSE Department
Technology has advanced vastly over the past 20 years to the point where many devices take advantage of smart voice technology, from phones to physical devices that sit in your home and provide help with different tasks. Many devices today rely on Automatic Speech Recognition (ASR) to identify keywords and phrases that trigger their functionality. Our project aim is to create Adversarial Examples (AE) that can be created by using methods of perturbing audio and then concurrently playing them while an auditory command that is given to ASR devices and cause them to misclassify the given command. First we attacked the Wav2Vec2 model, as well as the Whisper model to determine how efficient the AE’s are at causing a misclassification. We then tested our AE’s on APIs like Google, Amazon and DeepSpeech to cause a misclassification in a given command. To deliver the attacks, we used a Raspberry Pi device to listen for the trigger phrase and play the AE at the same time a user gives a command. The ASR device will then receive the user input as well as the AE simultaneously and misclassify the command.
FFT in an algorithm that calculates the DFT (Discrete Fourier Transformation) of a series. Any signal can be represented by a series of sine functions that add up to the original signal. The DFT is used to extract frequency information about audio in any ASR (Automatic Speech Recognition) System. Allows a vector of attack where certain frequency ranges can be downplayed that would be less perceivable to people while causing misclassifications for ASR systems.
Deconstructs a time series into a few smaller subsets of components that represent various trends and noise that can be summed back into the original data or certain subsets can be isolated or removed from the time series including the noise. Lowering the intensity of frequencies less perceptible to the human ear can allow for misclassifications by ASR systems while the overall audio is still perceivable to people.
Algorithm for compression where low energy parts can be discarded. Maintaining the coefficients that sum to the most significant part of the signal's energy. Maintaining these parts makes the original audio still perceivable, while more noisy, but requires less overall information. In our case the intensity of these low energy parts can be manipulated in order to generate a misclassification in ASR system while being perceivable to people.
Method of attack consisting of combining two different sources of audio on top of each other in order to produce a misclassification by ASR systems.