ASR Attack Capstone

About Our Project

Technology has advanced vastly over the past 20 years to the point where many devices take advantage of smart voice technology, from phones to physical devices that sit in your home and provide help with different tasks. Many devices today rely on Automatic Speech Recognition (ASR) to identify keywords and phrases that trigger their functionality. Our project aim is to create Adversarial Examples (AE) that can be created by using methods of perturbing audio and then concurrently playing them while an auditory command that is given to ASR devices and cause them to misclassify the given command. First we attacked the Wav2Vec2 model, as well as the Whisper model to determine how efficient the AE’s are at causing a misclassification. We then tested our AE’s on APIs like Google, Amazon and DeepSpeech to cause a misclassification in a given command. To deliver the attacks, we used a Raspberry Pi device to listen for the trigger phrase and play the AE at the same time a user gives a command. The ASR device will then receive the user input as well as the AE simultaneously and misclassify the command.

Raspberry PI Demo

ASR Misclassification

Problem Domain Book

LI, J., Deng, L., Haeb-Umbach, R., & Gong, Y. (2015). Robust Automatic Speech Recognition: A Bridge to Practical Applications (1st ed.). Academic Press.

Related Resources

Abdullah, H., Rahman, M. S., Garcia, W., Warren, K., Yadav, A. S., Shrimpton, T., & Traynor, P. (2021). Hear “no evil”, see “kenansville”:
Efficient and transferable black-box attacks on speech recognition and Voice Identification Systems. 2021 IEEE Symposium on Security and Privacy (SP). IEEE

Carlini N, Wagner D. Audio Adversarial Examples:
Targeted Attacks on Speech-to-Text. IEEE Xplore. IEEE

Y. Chen, X. Yuan, J. Zhang, Y. Zhao, S. Zhang, K. Chen, and X. Wang, “Devil’s whisper:
A general approach for physical adversarial attacks against commercial black-box speech recognition devices,” in 29th USENIX Security Symposium (USENIX Security 20), 2020.

Attack Methods

FFT (Fast Fourier Transformation)

FFT in an algorithm that calculates the DFT (Discrete Fourier Transformation) of a series. Any signal can be represented by a series of sine functions that add up to the original signal. The DFT is used to extract frequency information about audio in any ASR (Automatic Speech Recognition) System. Allows a vector of attack where certain frequency ranges can be downplayed that would be less perceivable to people while causing misclassifications for ASR systems.

Spectrogram of original audio vs FFT audio

FFT Audio EX. 1

FFT Audio EX. 2

SSA (Singular Spectrum Analysis)

Deconstructs a time series into a few smaller subsets of components that represent various trends and noise that can be summed back into the original data or certain subsets can be isolated or removed from the time series including the noise. Lowering the intensity of frequencies less perceptible to the human ear can allow for misclassifications by ASR systems while the overall audio is still perceivable to people.

Spectrogram of original audio vs DCT audio

SSA Audio EX. 1

SSA Audio EX. 2

DCT (Discrete Cosign Transformation)

Algorithm for compression where low energy parts can be discarded. Maintaining the coefficients that sum to the most significant part of the signal's energy. Maintaining these parts makes the original audio still perceivable, while more noisy, but requires less overall information. In our case the intensity of these low energy parts can be manipulated in order to generate a misclassification in ASR system while being perceivable to people.