speech word recognition definition

Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text, is a capability that enables a program to process human speech into a written format.

While speech recognition is commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.

IBM has had a prominent role within speech recognition since its inception, releasing of “Shoebox” in 1962. This machine had the ability to recognize 16 different words, advancing the initial work from Bell Labs from the 1950s. However, IBM didn’t stop there, but continued to innovate over the years, launching VoiceType Simply Speaking application in 1996. This speech recognition software had a 42,000-word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words.

While speech technology had a limited vocabulary in the early days, it is utilized in a wide number of industries today, such as automotive, technology, and healthcare. Its adoption has only continued to accelerate in recent years due to advancements in deep learning and big data. Research (link resides outside ibm.com) shows that this market is expected to be worth USD 24.9 billion by 2025.

Explore the free O'Reilly ebook to learn how to get started with Presto, the open source SQL engine for data analytics.

Many speech recognition applications and devices are available, but the more advanced solutions use AI and machine learning . They integrate grammar, syntax, structure, and composition of audio and voice signals to understand and process human speech. Ideally, they learn as they go — evolving responses with each interaction.

The best kind of systems also allow organizations to customize and adapt the technology to their specific requirements — everything from language and nuances of speech to brand recognition. For example:

Language weighting: Improve precision by weighting specific words that are spoken frequently (such as product names or industry jargon), beyond terms already in the base vocabulary.
Speaker labeling: Output a transcription that cites or tags each speaker’s contributions to a multi-participant conversation.
Acoustics training: Attend to the acoustical side of the business. Train the system to adapt to an acoustic environment (like the ambient noise in a call center) and speaker styles (like voice pitch, volume and pace).
Profanity filtering: Use filters to identify certain words or phrases and sanitize speech output.

Meanwhile, speech recognition continues to advance. Companies, like IBM, are making inroads in several areas, the better to improve human and machine interaction.

The vagaries of human speech have made development challenging. It’s considered to be one of the most complex areas of computer science – involving linguistics, mathematics and statistics. Speech recognizers are made up of a few components, such as the speech input, feature extraction, feature vectors, a decoder, and a word output. The decoder leverages acoustic models, a pronunciation dictionary, and language models to determine the appropriate output.

Speech recognition technology is evaluated on its accuracy rate, i.e. word error rate (WER), and speed. A number of factors can impact word error rate, such as pronunciation, accent, pitch, volume, and background noise. Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the goal of speech recognition systems. Research from Lippmann (link resides outside ibm.com) estimates the word error rate to be around 4 percent, but it’s been difficult to replicate the results from this paper.

Various algorithms and computation techniques are used to recognize speech into text and improve the accuracy of transcription. Below are brief explanations of some of the most commonly used methods:

Natural language processing (NLP): While NLP isn’t necessarily a specific algorithm used in speech recognition, it is the area of artificial intelligence which focuses on the interaction between humans and machines through language through speech and text. Many mobile devices incorporate speech recognition into their systems to conduct voice search—e.g. Siri—or provide more accessibility around texting.
Hidden markov models (HMM): Hidden Markov Models build on the Markov chain model, which stipulates that the probability of a given state hinges on the current state, not its prior states. While a Markov chain model is useful for observable events, such as text inputs, hidden markov models allow us to incorporate hidden events, such as part-of-speech tags, into a probabilistic model. They are utilized as sequence models within speech recognition, assigning labels to each unit—i.e. words, syllables, sentences, etc.—in the sequence. These labels create a mapping with the provided input, allowing it to determine the most appropriate label sequence.
N-grams: This is the simplest type of language model (LM), which assigns probabilities to sentences or phrases. An N-gram is sequence of N-words. For example, “order the pizza” is a trigram or 3-gram and “please order the pizza” is a 4-gram. Grammar and the probability of certain word sequences are used to improve recognition and accuracy.
Neural networks: Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent. While neural networks tend to be more accurate and can accept more data, this comes at a performance efficiency cost as they tend to be slower to train compared to traditional language models.
Speaker Diarization (SD): Speaker diarization algorithms identify and segment speech by speaker identity. This helps programs better distinguish individuals in a conversation and is frequently applied at call centers distinguishing customers and sales agents.

A wide number of industries are utilizing different applications of speech technology today, helping businesses and consumers save time and even lives. Some examples include:

Automotive: Speech recognizers improves driver safety by enabling voice-activated navigation systems and search capabilities in car radios.

Technology: Virtual agents are increasingly becoming integrated within our daily lives, particularly on our mobile devices. We use voice commands to access them through our smartphones, such as through Google Assistant or Apple’s Siri, for tasks, such as voice search, or through our speakers, via Amazon’s Alexa or Microsoft’s Cortana, to play music. They’ll only continue to integrate into the everyday products that we use, fueling the “Internet of Things” movement.

Healthcare: Doctors and nurses leverage dictation applications to capture and log patient diagnoses and treatment notes.

Sales: Speech recognition technology has a couple of applications in sales. It can help a call center transcribe thousands of phone calls between customers and agents to identify common call patterns and issues. AI chatbots can also talk to people via a webpage, answering common queries and solving basic requests without needing to wait for a contact center agent to be available. It both instances speech recognition systems help reduce time to resolution for consumer issues.

Security: As technology integrates into our daily lives, security protocols are an increasing priority. Voice-based authentication adds a viable level of security.

Convert speech into text using AI-powered speech recognition and transcription.

Convert text into natural-sounding speech in a variety of languages and voices.

AI-powered hybrid cloud software.

Enable speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics.

Learn how to keep up, rethink how to use technologies like the cloud, AI and automation to accelerate innovation, and meet the evolving customer expectations.

IBM watsonx Assistant helps organizations provide better customer experiences with an AI chatbot that understands the language of the business, connects to existing customer care systems, and deploys anywhere with enterprise security and scalability. watsonx Assistant automates repetitive tasks and uses machine learning to resolve customer support issues quickly and efficiently.

Engineering Mathematics
Discrete Mathematics
Operating System
Computer Networks
Digital Logic and Design
C Programming
Data Structures
Theory of Computation
Compiler Design
Computer Org and Architecture

What is Speech Recognition?

What is Image Recognition?
Automatic Speech Recognition using Whisper
Speech Recognition Module Python
Speech Recognition in Python using CMU Sphinx
What is Recognition vs Recall in UX Design?
How to Set Up Speech Recognition on Windows?
What is Machine Learning?
Python | Speech recognition on large audio files
What is a Microphone?
What is Optical Character Recognition (OCR)?
Audio Recognition in Tensorflow
What is a Speaker?
What is Memory Decoding?
Automatic Speech Recognition using CTC
Speech Recognition in Hindi using Python
What is Communication?
Restart your Computer with Speech Recognition
Speech Recognition in Python using Google Speech API
Convert Text to Speech in Python

Speech recognition or speech-to-text recognition, is the capacity of a machine or program to recognize spoken words and transform them into text. Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence, etc. In this article, we are going to discuss every point about What is Speech Recognition.

What is speech recognition in a Computer?

Speech Recognition , also known as automatic speech recognition ( ASR ), computer speech recognition, or speech-to-text, focuses on enabling computers to understand and interpret human speech. Speech recognition involves converting spoken language into text or executing commands based on the recognized words. This technology relies on sophisticated algorithms and machine learning models to process and understand human speech in real-time , despite the variations in accents , pitch , speed , and slang .

Key Features of Speech Recognition

Accuracy and Speed: They can process speech in real-time or near real-time, providing quick responses to user inputs.
Natural Language Understanding (NLU): NLU enables systems to handle complex commands and queries , making technology more intuitive and user-friendly .
Multi-Language Support: Support for multiple languages and dialects , allowing users from different linguistic backgrounds to interact with technology in their native language.
Background Noise Handling: This feature is crucial for voice-activated systems used in public or outdoor settings.

Speech Recognition Algorithms

Speech recognition technology relies on complex algorithms to translate spoken language into text or commands that computers can understand and act upon. Here are the algorithms and approaches used in speech recognition:

1. Hidden Markov Models (HMM)

Hidden Markov Models have been the backbone of speech recognition for many years. They model speech as a sequence of states, with each state representing a phoneme (basic unit of sound) or group of phonemes. HMMs are used to estimate the probability of a given sequence of sounds, making it possible to determine the most likely words spoken. Usage : Although newer methods have surpassed HMM in performance, it remains a fundamental concept in speech recognition, often used in combination with other techniques.

2. Natural language processing (NLP)

NLP is the area of artificial intelligence which focuses on the interaction between humans and machines through language through speech and text. Many mobile devices incorporate speech recognition into their systems to conduct voice search. Example such as : Siri or provide more accessibility around texting.

3. Deep Neural Networks (DNN)

DNNs have improved speech recognition’s accuracy a lot. These networks can learn hierarchical representations of data, making them particularly effective at modeling complex patterns like those found in human speech. DNNs are used both for acoustic modeling , to better understand the sound of speech , and for language modeling, to predict the likelihood of certain word sequences.

4. End-to-End Deep Learning

Now, the trend has shifted towards end-to-end deep learning models , which can directly map speech inputs to text outputs without the need for intermediate phonetic representations. These models, often based on advanced RNNs , Transformers, or Attention Mechanisms , can learn more complex patterns and dependencies in the speech signal.

What is Automatic Speech Recognition?

Automatic Speech Recognition (ASR) is a technology that enables computers to understand and transcribe spoken language into text. It works by analyzing audio input, such as spoken words, and converting them into written text , typically in real-time. ASR systems use algorithms and machine learning techniques to recognize and interpret speech patterns , phonemes, and language models to accurately transcribe spoken words. This technology is widely used in various applications, including virtual assistants , voice-controlled devices , dictation software , customer service automation , and language translation services .

What is Dragon speech recognition software?

Dragon speech recognition software is a program developed by Nuance Communications that allows users to dictate text and control their computer using voice commands. It transcribes spoken words into written text in real-time , enabling hands-free operation of computers and devices. Dragon software is widely used for various purposes, including dictating documents , composing emails , navigating the web , and controlling applications . It also features advanced capabilities such as voice commands for editing and formatting text , as well as custom vocabulary and voice profiles for improved accuracy and personalization.

What is a normal speech recognition threshold?

The normal speech recognition threshold refers to the level of sound, typically measured in decibels (dB) , at which a person can accurately recognize speech. In quiet environments, this threshold is typically around 0 to 10 dB for individuals with normal hearing. However, in noisy environments or for individuals with hearing impairments , the threshold may be higher, meaning they require a louder volume to accurately recognize speech .

Speech Recognition Use Cases

Virtual Assistants: These are like digital helpers that understand what you say. They can do things like set reminders, search the internet, and control smart home devices, all without you having to touch anything. Examples include Siri , Alexa , and Google Assistant .
Accessibility Tools: Speech recognition makes technology easier to use for people with disabilities. Features like voice control on phones and computers help them interact with devices more easily. There are also special apps for people with disabilities.
Automotive Systems: In cars, you can use your voice to control things like navigation and music. This helps drivers stay focused and safe on the road. Examples include voice-activated navigation systems in cars.
Healthcare: Doctors use speech recognition to quickly write down notes about patients, so they have more time to spend with them. There are also voice-controlled bots that help with patient care. For example, doctors use dictation tools to write down patient information quickly.
Customer Service: Speech recognition is used to direct customer calls to the right place or provide automated help. This makes things run smoother and keeps customers happy. Examples include call centers that you can talk to and customer service bots .
Education and E-Learning: Speech recognition helps people learn languages by giving them feedback on their pronunciation. It also transcribes lectures, making them easier to understand. Examples include language learning apps and lecture transcribing services.
Security and Authentication: Voice recognition, combined with biometrics , keeps things secure by making sure it’s really you accessing your stuff. This is used in banking and for secure facilities. For example, some banks use your voice to make sure it’s really you logging in.
Entertainment and Media: Voice recognition helps you find stuff to watch or listen to by just talking. This makes it easier to use things like TV and music services . There are also games you can play using just your voice.

Speech recognition is a powerful technology that lets computers understand and process human speech. It’s used everywhere, from asking your smartphone for directions to controlling your smart home devices with just your voice. This tech makes life easier by helping with tasks without needing to type or press buttons, making gadgets like virtual assistants more helpful. It’s also super important for making tech accessible to everyone, including those who might have a hard time using keyboards or screens. As we keep finding new ways to use speech recognition, it’s becoming a big part of our daily tech life, showing just how much we can do when we talk to our devices.

What is Speech Recognition?- FAQs

What are examples of speech recognition.

Note Taking/Writing: An example of speech recognition technology in use is speech-to-text platforms such as Speechmatics or Google’s speech-to-text engine. In addition, many voice assistants offer speech-to-text translation.

Is speech recognition secure?

Security concerns related to speech recognition primarily involve the privacy and protection of audio data collected and processed by speech recognition systems. Ensuring secure data transmission, storage, and processing is essential to address these concerns.

What is speech recognition in AI?

Speech recognition is the process of converting sound signals to text transcriptions. Steps involved in conversion of a sound wave to text transcription in a speech recognition system are: Recording: Audio is recorded using a voice recorder. Sampling: Continuous audio wave is converted to discrete values.

How accurate is speech recognition technology?

The accuracy of speech recognition technology can vary depending on factors such as the quality of audio input , language complexity , and the specific application or system being used. Advances in machine learning and deep learning have improved accuracy significantly in recent years.

Please Login to comment...

Improve your Coding Skills with Practice

What kind of Experience do you want to share?

Speech Recognition: Everything You Need to Know in 2024

Speech recognition, also known as automatic speech recognition (ASR) , enables seamless communication between humans and machines. This technology empowers organizations to transform human speech into written text. Speech recognition technology can revolutionize many business applications , including customer service, healthcare, finance and sales.

In this comprehensive guide, we will explain speech recognition, exploring how it works, the algorithms involved, and the use cases of various industries.

If you require training data for your speech recognition system, here is a guide to finding the right speech data collection services.

What is speech recognition?

Speech recognition, also known as automatic speech recognition (ASR), speech-to-text (STT), and computer speech recognition, is a technology that enables a computer to recognize and convert spoken language into text.

Speech recognition technology uses AI and machine learning models to accurately identify and transcribe different accents, dialects, and speech patterns.

What are the features of speech recognition systems?

Speech recognition systems have several components that work together to understand and process human speech. Key features of effective speech recognition are:

Audio preprocessing: After you have obtained the raw audio signal from an input device, you need to preprocess it to improve the quality of the speech input The main goal of audio preprocessing is to capture relevant speech data by removing any unwanted artifacts and reducing noise.
Feature extraction: This stage converts the preprocessed audio signal into a more informative representation. This makes raw audio data more manageable for machine learning models in speech recognition systems.
Language model weighting: Language weighting gives more weight to certain words and phrases, such as product references, in audio and voice signals. This makes those keywords more likely to be recognized in a subsequent speech by speech recognition systems.
Acoustic modeling : It enables speech recognizers to capture and distinguish phonetic units within a speech signal. Acoustic models are trained on large datasets containing speech samples from a diverse set of speakers with different accents, speaking styles, and backgrounds.
Speaker labeling: It enables speech recognition applications to determine the identities of multiple speakers in an audio recording. It assigns unique labels to each speaker in an audio recording, allowing the identification of which speaker was speaking at any given time.
Profanity filtering: The process of removing offensive, inappropriate, or explicit words or phrases from audio data.

What are the different speech recognition algorithms?

Speech recognition uses various algorithms and computation techniques to convert spoken language into written language. The following are some of the most commonly used speech recognition methods:

Hidden Markov Models (HMMs): Hidden Markov model is a statistical Markov model commonly used in traditional speech recognition systems. HMMs capture the relationship between the acoustic features and model the temporal dynamics of speech signals.
Estimate the probability of word sequences in the recognized text
Convert colloquial expressions and abbreviations in a spoken language into a standard written form
Map phonetic units obtained from acoustic models to their corresponding words in the target language.
Speaker Diarization (SD): Speaker diarization, or speaker labeling, is the process of identifying and attributing speech segments to their respective speakers (Figure 1). It allows for speaker-specific voice recognition and the identification of individuals in a conversation.

Figure 1: A flowchart illustrating the speaker diarization process

The image describes the process of speaker diarization, where multiple speakers in an audio recording are segmented and identified.

Dynamic Time Warping (DTW): Speech recognition algorithms use Dynamic Time Warping (DTW) algorithm to find an optimal alignment between two sequences (Figure 2).

Figure 2: A speech recognizer using dynamic time warping to determine the optimal distance between elements

Dynamic time warping is a technique used in speech recognition to determine the optimum distance between the elements.

5. Deep neural networks: Neural networks process and transform input data by simulating the non-linear frequency perception of the human auditory system.

6. Connectionist Temporal Classification (CTC): It is a training objective introduced by Alex Graves in 2006. CTC is especially useful for sequence labeling tasks and end-to-end speech recognition systems. It allows the neural network to discover the relationship between input frames and align input frames with output labels.

Speech recognition vs voice recognition

Speech recognition is commonly confused with voice recognition, yet, they refer to distinct concepts. Speech recognition converts spoken words into written text, focusing on identifying the words and sentences spoken by a user, regardless of the speaker’s identity.

On the other hand, voice recognition is concerned with recognizing or verifying a speaker’s voice, aiming to determine the identity of an unknown speaker rather than focusing on understanding the content of the speech.

What are the challenges of speech recognition with solutions?

While speech recognition technology offers many benefits, it still faces a number of challenges that need to be addressed. Some of the main limitations of speech recognition include:

Acoustic Challenges:

Assume a speech recognition model has been primarily trained on American English accents. If a speaker with a strong Scottish accent uses the system, they may encounter difficulties due to pronunciation differences. For example, the word “water” is pronounced differently in both accents. If the system is not familiar with this pronunciation, it may struggle to recognize the word “water.”

Solution: Addressing these challenges is crucial to enhancing speech recognition applications’ accuracy. To overcome pronunciation variations, it is essential to expand the training data to include samples from speakers with diverse accents. This approach helps the system recognize and understand a broader range of speech patterns.

For instance, you can use data augmentation techniques to reduce the impact of noise on audio data. Data augmentation helps train speech recognition models with noisy data to improve model accuracy in real-world environments.

Figure 3: Examples of a target sentence (“The clown had a funny face”) in the background noise of babble, car and rain.

Background noise makes distinguishing speech from background noise difficult for speech recognition software.

Linguistic Challenges:

Out-of-vocabulary words: Since the speech recognizers model has not been trained on OOV words, they may incorrectly recognize them as different or fail to transcribe them when encountering them.

Figure 4: An example of detecting OOV word

Solution: Word Error Rate (WER) is a common metric that is used to measure the accuracy of a speech recognition or machine translation system. The word error rate can be computed as:

Figure 5: Demonstrating how to calculate word error rate (WER)

Word Error Rate (WER) is metric to evaluate the performance and accuracy of speech recognition systems.

Homophones: Homophones are words that are pronounced identically but have different meanings, such as “to,” “too,” and “two”. Solution: Semantic analysis allows speech recognition programs to select the appropriate homophone based on its intended meaning in a given context. Addressing homophones improves the ability of the speech recognition process to understand and transcribe spoken words accurately.

Technical/System Challenges:

Data privacy and security: Speech recognition systems involve processing and storing sensitive and personal information, such as financial information. An unauthorized party could use the captured information, leading to privacy breaches.

Solution: You can encrypt sensitive and personal audio information transmitted between the user’s device and the speech recognition software. Another technique for addressing data privacy and security in speech recognition systems is data masking. Data masking algorithms mask and replace sensitive speech data with structurally identical but acoustically different data.

Figure 6: An example of how data masking works

Data masking protects sensitive or confidential audio information in speech recognition applications by replacing or encrypting the original audio data.

Limited training data: Limited training data directly impacts the performance of speech recognition software. With insufficient training data, the speech recognition model may struggle to generalize different accents or recognize less common words.

Solution: To improve the quality and quantity of training data, you can expand the existing dataset using data augmentation and synthetic data generation technologies.

13 speech recognition use cases and applications

In this section, we will explain how speech recognition revolutionizes the communication landscape across industries and changes the way businesses interact with machines.

Customer Service and Support

Interactive Voice Response (IVR) systems: Interactive voice response (IVR) is a technology that automates the process of routing callers to the appropriate department. It understands customer queries and routes calls to the relevant departments. This reduces the call volume for contact centers and minimizes wait times. IVR systems address simple customer questions without human intervention by employing pre-recorded messages or text-to-speech technology . Automatic Speech Recognition (ASR) allows IVR systems to comprehend and respond to customer inquiries and complaints in real time.
Customer support automation and chatbots: According to a survey, 78% of consumers interacted with a chatbot in 2022, but 80% of respondents said using chatbots increased their frustration level.
Sentiment analysis and call monitoring: Speech recognition technology converts spoken content from a call into text. After speech-to-text processing, natural language processing (NLP) techniques analyze the text and assign a sentiment score to the conversation, such as positive, negative, or neutral. By integrating speech recognition with sentiment analysis, organizations can address issues early on and gain valuable insights into customer preferences.
Multilingual support: Speech recognition software can be trained in various languages to recognize and transcribe the language spoken by a user accurately. By integrating speech recognition technology into chatbots and Interactive Voice Response (IVR) systems, organizations can overcome language barriers and reach a global audience (Figure 7). Multilingual chatbots and IVR automatically detect the language spoken by a user and switch to the appropriate language model.

Figure 7: Showing how a multilingual chatbot recognizes words in another language

Customer authentication with voice biometrics: Voice biometrics use speech recognition technologies to analyze a speaker’s voice and extract features such as accent and speed to verify their identity.

Sales and Marketing:

Virtual sales assistants: Virtual sales assistants are AI-powered chatbots that assist customers with purchasing and communicate with them through voice interactions. Speech recognition allows virtual sales assistants to understand the intent behind spoken language and tailor their responses based on customer preferences.
Transcription services : Speech recognition software records audio from sales calls and meetings and then converts the spoken words into written text using speech-to-text algorithms.

Automotive:

Voice-activated controls: Voice-activated controls allow users to interact with devices and applications using voice commands. Drivers can operate features like climate control, phone calls, or navigation systems.
Voice-assisted navigation: Voice-assisted navigation provides real-time voice-guided directions by utilizing the driver’s voice input for the destination. Drivers can request real-time traffic updates or search for nearby points of interest using voice commands without physical controls.

Healthcare:

Recording the physician’s dictation
Transcribing the audio recording into written text using speech recognition technology
Editing the transcribed text for better accuracy and correcting errors as needed
Formatting the document in accordance with legal and medical requirements.
Virtual medical assistants: Virtual medical assistants (VMAs) use speech recognition, natural language processing, and machine learning algorithms to communicate with patients through voice or text. Speech recognition software allows VMAs to respond to voice commands, retrieve information from electronic health records (EHRs) and automate the medical transcription process.
Electronic Health Records (EHR) integration: Healthcare professionals can use voice commands to navigate the EHR system , access patient data, and enter data into specific fields.

Technology:

Virtual agents: Virtual agents utilize natural language processing (NLP) and speech recognition technologies to understand spoken language and convert it into text. Speech recognition enables virtual agents to process spoken language in real-time and respond promptly and accurately to user voice commands.

External Links

1. Databricks
2. PubMed Central
3. Qin, L. (2013). Learning Out-of-vocabulary Words in Automatic Speech Recognition . Carnegie Mellon University.
4. Wikipedia

Next to Read

10+ speech data collection services in 2024, top 5 speech recognition data collection methods in 2024, top 4 speech recognition challenges & solutions in 2024.

Your email address will not be published. All fields are required.

Related research

Top 11 Voice Recognition Applications in 2024

Speech Recognition

Speech recognition is the capability of an electronic device to understand spoken words. A microphone records a person's voice and the hardware converts the signal from analog sound waves to digital audio. The audio data is then processed by software , which interprets the sound as individual words.

A common type of speech recognition is "speech-to-text" or "dictation" software, such as Dragon Naturally Speaking, which outputs text as you speak. While you can buy speech recognition programs, modern versions of the Macintosh and Windows operating systems include a built-in dictation feature. This capability allows you to record text as well as perform basic system commands.

In Windows, some programs support speech recognition automatically while others do not. You can enable speech recognition for all applications by selecting All Programs → Accessories → Ease of Access → Windows Speech Recognition and clicking "Enable dictation everywhere." In OS X, you can enable dictation in the "Dictation & Speech" system preference pane. Simply check the "On" button next to Dictation to turn on the speech-to-text capability. To start dictating in a supported program, select Edit → Start Dictation . You can also view and edit spoken commands in OS X by opening the "Accessibility" system preference pane and selecting "Speakable Items."

Another type of speech recognition is interactive speech, which is common on mobile devices, such as smartphones and tablets . Both iOS and Android devices allow you to speak to your phone and receive a verbal response. The iOS version is called "Siri," and serves as a personal assistant. You can ask Siri to save a reminder on your phone, tell you the weather forecast, give you directions, or answer many other questions. This type of speech recognition is considered a natural user interface (or NUI ), since it responds naturally to your spoken input .

While many speech recognition systems only support English, some speech recognition software supports multiple languages. This requires a unique dictionary for each language and extra algorithms to understand and process different accents. Some dictation systems, such as Dragon Naturally Speaking, can be trained to understand your voice and will adapt over time to understand you more accurately.

Test Your Knowledge

Real-time graphics performance is measured by what metric?

Tech Factor

The tech terms computer dictionary.

The definition of Speech Recognition on this page is an original definition written by the TechTerms.com team . If you would like to reference this page or cite this definition, please use the green citation links above.

The goal of TechTerms.com is to explain computer terminology in a way that is easy to understand. We strive for simplicity and accuracy with every definition we publish. If you have feedback about this definition or would like to suggest a new technical term, please contact us .

Sign up for the free TechTerms Newsletter

You can unsubscribe or change your frequency setting at any time using the links available in each email. Questions? Please contact us .

We just sent you an email to confirm your email address. Once you confirm your address, you will begin to receive the newsletter.

If you have any questions, please contact us .

How Does Speech Recognition Work? (9 Simple Questions Answered)

by Team Experts
July 2, 2023 July 3, 2023

Discover the Surprising Science Behind Speech Recognition – Learn How It Works in 9 Simple Questions!

Speech recognition is the process of converting spoken words into written or machine-readable text. It is achieved through a combination of natural language processing , audio inputs, machine learning , and voice recognition . Speech recognition systems analyze speech patterns to identify phonemes , the basic units of sound in a language. Acoustic modeling is used to match the phonemes to words , and word prediction algorithms are used to determine the most likely words based on context analysis . Finally, the words are converted into text.

What is Natural Language Processing and How Does it Relate to Speech Recognition?

How do audio inputs enable speech recognition, what role does machine learning play in speech recognition, how does voice recognition work, what are the different types of speech patterns used for speech recognition, how is acoustic modeling used for accurate phoneme detection in speech recognition systems, what is word prediction and why is it important for effective speech recognition technology, how can context analysis improve accuracy of automatic speech recognition systems, common mistakes and misconceptions.

Natural language processing (NLP) is a branch of artificial intelligence that deals with the analysis and understanding of human language. It is used to enable machines to interpret and process natural language, such as speech, text, and other forms of communication . NLP is used in a variety of applications , including automated speech recognition , voice recognition technology , language models, text analysis , text-to-speech synthesis , natural language understanding , natural language generation, semantic analysis , syntactic analysis, pragmatic analysis, sentiment analysis, and speech-to-text conversion. NLP is closely related to speech recognition , as it is used to interpret and understand spoken language in order to convert it into text.

Audio inputs enable speech recognition by providing digital audio recordings of spoken words . These recordings are then analyzed to extract acoustic features of speech, such as pitch, frequency, and amplitude. Feature extraction techniques , such as spectral analysis of sound waves, are used to identify and classify phonemes . Natural language processing (NLP) and machine learning models are then used to interpret the audio recordings and recognize speech. Neural networks and deep learning architectures are used to further improve the accuracy of voice recognition . Finally, Automatic Speech Recognition (ASR) systems are used to convert the speech into text, and noise reduction techniques and voice biometrics are used to improve accuracy .

Machine learning plays a key role in speech recognition , as it is used to develop algorithms that can interpret and understand spoken language. Natural language processing , pattern recognition techniques , artificial intelligence , neural networks, acoustic modeling , language models, statistical methods , feature extraction , hidden Markov models (HMMs), deep learning architectures , voice recognition systems, speech synthesis , and automatic speech recognition (ASR) are all used to create machine learning models that can accurately interpret and understand spoken language. Natural language understanding is also used to further refine the accuracy of the machine learning models .

Voice recognition works by using machine learning algorithms to analyze the acoustic properties of a person’s voice. This includes using voice recognition software to identify phonemes , speaker identification, text normalization , language models, noise cancellation techniques , prosody analysis , contextual understanding , artificial neural networks, voice biometrics , speech synthesis , and deep learning . The data collected is then used to create a voice profile that can be used to identify the speaker .

The different types of speech patterns used for speech recognition include prosody , contextual speech recognition , speaker adaptation , language models, hidden Markov models (HMMs), neural networks, Gaussian mixture models (GMMs) , discrete wavelet transform (DWT), Mel-frequency cepstral coefficients (MFCCs), vector quantization (VQ), dynamic time warping (DTW), continuous density hidden Markov model (CDHMM), support vector machines (SVM), and deep learning .

Acoustic modeling is used for accurate phoneme detection in speech recognition systems by utilizing statistical models such as Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs). Feature extraction techniques such as Mel-frequency cepstral coefficients (MFCCs) are used to extract relevant features from the audio signal . Context-dependent models are also used to improve accuracy . Discriminative training techniques such as maximum likelihood estimation and the Viterbi algorithm are used to train the models. In recent years, neural networks and deep learning algorithms have been used to improve accuracy , as well as natural language processing techniques .

Word prediction is a feature of natural language processing and artificial intelligence that uses machine learning algorithms to predict the next word or phrase a user is likely to type or say. It is used in automated speech recognition systems to improve the accuracy of the system by reducing the amount of user effort and time spent typing or speaking words. Word prediction also enhances the user experience by providing faster response times and increased efficiency in data entry tasks. Additionally, it reduces errors due to incorrect spelling or grammar, and improves the understanding of natural language by machines. By using word prediction, speech recognition technology can be more effective , providing improved accuracy and enhanced ability for machines to interpret human speech.

Context analysis can improve the accuracy of automatic speech recognition systems by utilizing language models, acoustic models, statistical methods , and machine learning algorithms to analyze the semantic , syntactic, and pragmatic aspects of speech. This analysis can include word – level , sentence- level , and discourse-level context, as well as utterance understanding and ambiguity resolution. By taking into account the context of the speech, the accuracy of the automatic speech recognition system can be improved.

Misconception : Speech recognition requires a person to speak in a robotic , monotone voice. Correct Viewpoint: Speech recognition technology is designed to recognize natural speech patterns and does not require users to speak in any particular way.
Misconception : Speech recognition can understand all languages equally well. Correct Viewpoint: Different speech recognition systems are designed for different languages and dialects, so the accuracy of the system will vary depending on which language it is programmed for.
Misconception: Speech recognition only works with pre-programmed commands or phrases . Correct Viewpoint: Modern speech recognition systems are capable of understanding conversational language as well as specific commands or phrases that have been programmed into them by developers.

Automatic Speech Recognition

What is Automatic Speech Recognition?

Automatic Speech Recognition (ASR), also known as speech-to-text, is the process by which a computer or electronic device converts human speech into written text. This technology is a subset of computational linguistics that deals with the interpretation and translation of spoken language into text by computers. It enables humans to speak commands into devices, dictate documents, and interact with computer-based systems through natural language.

How Does Automatic Speech Recognition Work?

ASR systems typically involve several processing stages to accurately transcribe speech. The process begins with the acoustic signal being captured by a microphone. This signal is then digitized and processed to filter out noise and improve clarity.

The core of ASR technology involves two main models:

Acoustic Model: This model is trained to recognize the basic units of sound in speech, known as phonemes. It maps segments of audio to these phonemes and considers variations in pronunciation, accent, and intonation.
Language Model: This model is used to understand the context and semantics of the spoken words. It predicts the sequence of words that form a sentence, based on the likelihood of word sequences in the language. This helps in distinguishing between words that sound similar but have different meanings.

Once the audio has been processed through these models, the ASR system generates a transcription of the spoken words. Advanced systems may also include additional components, such as a dialogue manager in interactive voice response systems, or a natural language understanding module to interpret the intent behind the words.

Challenges in Automatic Speech Recognition

Despite significant advancements, ASR systems face numerous challenges that can affect their accuracy and performance:

Variability in Speech: Differences in accents, dialects, and individual speaker characteristics can make it difficult for ASR systems to accurately recognize words.
Background Noise: Noisy environments can interfere with the system's ability to capture clear audio, leading to transcription errors.
Homophones and Context: Words that sound the same but have different meanings can be challenging for ASR systems to differentiate without understanding the context.
Continuous Speech: Unlike written text, spoken language does not have clear boundaries between words, making it challenging to segment speech accurately.
Colloquialisms and Slang: Everyday speech often includes informal language and slang, which may not be present in the training data used for ASR models.

Applications of Automatic Speech Recognition

ASR technology has a wide range of applications across various industries:

Virtual Assistants: Devices like smartphones and smart speakers use ASR to enable voice commands and provide user assistance.
Accessibility: ASR helps individuals with disabilities by enabling voice control over devices and converting speech to text for those who are deaf or hard of hearing.
Transcription Services: ASR is used to automatically transcribe meetings, lectures, and interviews, saving time and effort in documentation.
Customer Service: Call centers use ASR to route calls and handle inquiries through interactive voice response systems.
Healthcare: ASR enables hands-free documentation for medical professionals, allowing them to dictate notes and records.

The Future of Automatic Speech Recognition

The future of ASR is promising, with ongoing research focused on improving accuracy, reducing latency, and understanding natural language more effectively. As machine learning algorithms become more sophisticated, we can expect ASR systems to become more reliable and integrated into an even broader array of applications, making human-computer interaction more seamless and natural.

Automatic Speech Recognition technology has revolutionized the way we interact with machines, making it possible to communicate with computers using our most natural form of communication: speech. While challenges remain, the continuous improvements in ASR systems are opening up new possibilities for innovation and convenience in our daily lives.

The world's most comprehensive data science & artificial intelligence glossary

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

Random article
Teaching guide
Privacy & cookies

Speech recognition software

by Chris Woodford . Last updated: August 17, 2023.

I t's just as well people can understand speech. Imagine if you were like a computer: friends would have to "talk" to you by prodding away at a plastic keyboard connected to your brain by a long, curly wire. If you wanted to say "hello" to someone, you'd have to reach out, chatter your fingers over their keyboard, and wait for their eyes to light up; they'd have to do the same to you. Conversations would be a long, slow, elaborate nightmare—a silent dance of fingers on plastic; strange, abstract, and remote. We'd never put up with such clumsiness as humans, so why do we talk to our computers this way?

Scientists have long dreamed of building machines that can chatter and listen just like humans. But although computerized speech recognition has been around for decades, and is now built into most smartphones and PCs, few of us actually use it. Why? Possibly because we never even bother to try it out, working on the assumption that computers could never pull off a trick so complex as understanding the human voice. It's certainly true that speech recognition is a complex problem that's challenged some of the world's best computer scientists, mathematicians, and linguists. How well are they doing at cracking the problem? Will we all be chatting to our PCs one day soon? Let's take a closer look and find out!

Photo: A court reporter dictates notes into a laptop with a noise-cancelling microphone and speech-recogition software. Photo by Micha Pierce courtesy of US Marine Corps and DVIDS .

What is speech?

Language sets people far above our creeping, crawling animal friends. While the more intelligent creatures, such as dogs and dolphins, certainly know how to communicate with sounds, only humans enjoy the rich complexity of language. With just a couple of dozen letters, we can build any number of words (most dictionaries contain tens of thousands) and express an infinite number of thoughts.

Photo: Speech recognition has been popping up all over the place for quite a few years now. Even my old iPod Touch (dating from around 2012) has a built-in "voice control" program that let you pick out music just by saying "Play albums by U2," or whatever band you're in the mood for.

When we speak, our voices generate little sound packets called phones (which correspond to the sounds of letters or groups of letters in words); so speaking the word cat produces phones that correspond to the sounds "c," "a," and "t." Although you've probably never heard of these kinds of phones before, you might well be familiar with the related concept of phonemes : simply speaking, phonemes are the basic LEGO™ blocks of sound that all words are built from. Although the difference between phones and phonemes is complex and can be very confusing, this is one "quick-and-dirty" way to remember it: phones are actual bits of sound that we speak (real, concrete things), whereas phonemes are ideal bits of sound we store (in some sense) in our minds (abstract, theoretical sound fragments that are never actually spoken).

Computers and computer models can juggle around with phonemes, but the real bits of speech they analyze always involves processing phones. When we listen to speech, our ears catch phones flying through the air and our leaping brains flip them back into words, sentences, thoughts, and ideas—so quickly, that we often know what people are going to say before the words have fully fled from their mouths. Instant, easy, and quite dazzling, our amazing brains make this seem like a magic trick. And it's perhaps because listening seems so easy to us that we think computers (in many ways even more amazing than brains) should be able to hear, recognize, and decode spoken words as well. If only it were that simple!

Why is speech so hard to handle?

The trouble is, listening is much harder than it looks (or sounds): there are all sorts of different problems going on at the same time... When someone speaks to you in the street, there's the sheer difficulty of separating their words (what scientists would call the acoustic signal ) from the background noise —especially in something like a cocktail party, where the "noise" is similar speech from other conversations. When people talk quickly, and run all their words together in a long stream, how do we know exactly when one word ends and the next one begins? (Did they just say "dancing and smile" or "dance, sing, and smile"?) There's the problem of how everyone's voice is a little bit different, and the way our voices change from moment to moment. How do our brains figure out that a word like "bird" means exactly the same thing when it's trilled by a ten year-old girl or boomed by her forty-year-old father? What about words like "red" and "read" that sound identical but mean totally different things (homophones, as they're called)? How does our brain know which word the speaker means? What about sentences that are misheard to mean radically different things? There's the age-old military example of "send reinforcements, we're going to advance" being misheard for "send three and fourpence, we're going to a dance"—and all of us can probably think of song lyrics we've hilariously misunderstood the same way (I always chuckle when I hear Kate Bush singing about "the cattle burning over your shoulder"). On top of all that stuff, there are issues like syntax (the grammatical structure of language) and semantics (the meaning of words) and how they help our brain decode the words we hear, as we hear them. Weighing up all these factors, it's easy to see that recognizing and understanding spoken words in real time (as people speak to us) is an astonishing demonstration of blistering brainpower.

It shouldn't surprise or disappoint us that computers struggle to pull off the same dazzling tricks as our brains; it's quite amazing that they get anywhere near!

Photo: Using a headset microphone like this makes a huge difference to the accuracy of speech recognition: it reduces background sound, making it much easier for the computer to separate the signal (the all-important words you're speaking) from the noise (everything else).

How do computers recognize speech?

Speech recognition is one of the most complex areas of computer science —and partly because it's interdisciplinary: it involves a mixture of extremely complex linguistics, mathematics, and computing itself. If you read through some of the technical and scientific papers that have been published in this area (a few are listed in the references below), you may well struggle to make sense of the complexity. My objective is to give a rough flavor of how computers recognize speech, so—without any apology whatsoever—I'm going to simplify hugely and miss out most of the details.

Broadly speaking, there are four different approaches a computer can take if it wants to turn spoken sounds into written words:

1: Simple pattern matching

Ironically, the simplest kind of speech recognition isn't really anything of the sort. You'll have encountered it if you've ever phoned an automated call center and been answered by a computerized switchboard. Utility companies often have systems like this that you can use to leave meter readings, and banks sometimes use them to automate basic services like balance inquiries, statement orders, checkbook requests, and so on. You simply dial a number, wait for a recorded voice to answer, then either key in or speak your account number before pressing more keys (or speaking again) to select what you want to do. Crucially, all you ever get to do is choose one option from a very short list, so the computer at the other end never has to do anything as complex as parsing a sentence (splitting a string of spoken sound into separate words and figuring out their structure), much less trying to understand it; it needs no knowledge of syntax (language structure) or semantics (meaning). In other words, systems like this aren't really recognizing speech at all: they simply have to be able to distinguish between ten different sound patterns (the spoken words zero through nine) either using the bleeping sounds of a Touch-Tone phone keypad (technically called DTMF ) or the spoken sounds of your voice.

From a computational point of view, there's not a huge difference between recognizing phone tones and spoken numbers "zero", "one," "two," and so on: in each case, the system could solve the problem by comparing an entire chunk of sound to similar stored patterns in its memory. It's true that there can be quite a bit of variability in how different people say "three" or "four" (they'll speak in a different tone, more or less slowly, with different amounts of background noise) but the ten numbers are sufficiently different from one another for this not to present a huge computational challenge. And if the system can't figure out what you're saying, it's easy enough for the call to be transferred automatically to a human operator.

Photo: Voice-activated dialing on cellphones is little more than simple pattern matching. You simply train the phone to recognize the spoken version of a name in your phonebook. When you say a name, the phone doesn't do any particularly sophisticated analysis; it simply compares the sound pattern with ones you've stored previously and picks the best match. No big deal—which explains why even an old phone like this 2001 Motorola could do it.

2: Pattern and feature analysis

Automated switchboard systems generally work very reliably because they have such tiny vocabularies: usually, just ten words representing the ten basic digits. The vocabulary that a speech system works with is sometimes called its domain . Early speech systems were often optimized to work within very specific domains, such as transcribing doctor's notes, computer programming commands, or legal jargon, which made the speech recognition problem far simpler (because the vocabulary was smaller and technical terms were explicitly trained beforehand). Much like humans, modern speech recognition programs are so good that they work in any domain and can recognize tens of thousands of different words. How do they do it?

Most of us have relatively large vocabularies, made from hundreds of common words ("a," "the," "but" and so on, which we hear many times each day) and thousands of less common ones (like "discombobulate," "crepuscular," "balderdash," or whatever, which we might not hear from one year to the next). Theoretically, you could train a speech recognition system to understand any number of different words, just like an automated switchboard: all you'd need to do would be to get your speaker to read each word three or four times into a microphone, until the computer generalized the sound pattern into something it could recognize reliably.

The trouble with this approach is that it's hugely inefficient. Why learn to recognize every word in the dictionary when all those words are built from the same basic set of sounds? No-one wants to buy an off-the-shelf computer dictation system only to find they have to read three or four times through a dictionary, training it up to recognize every possible word they might ever speak, before they can do anything useful. So what's the alternative? How do humans do it? We don't need to have seen every Ford, Chevrolet, and Cadillac ever manufactured to recognize that an unknown, four-wheeled vehicle is a car: having seen many examples of cars throughout our lives, our brains somehow store what's called a prototype (the generalized concept of a car, something with four wheels, big enough to carry two to four passengers, that creeps down a road) and we figure out that an object we've never seen before is a car by comparing it with the prototype. In much the same way, we don't need to have heard every person on Earth read every word in the dictionary before we can understand what they're saying; somehow we can recognize words by analyzing the key features (or components) of the sounds we hear. Speech recognition systems take the same approach.

The recognition process

Practical speech recognition systems start by listening to a chunk of sound (technically called an utterance ) read through a microphone. The first step involves digitizing the sound (so the up-and-down, analog wiggle of the sound waves is turned into digital format, a string of numbers) by a piece of hardware (or software) called an analog-to-digital (A/D) converter (for a basic introduction, see our article on analog versus digital technology ). The digital data is converted into a spectrogram (a graph showing how the component frequencies of the sound change in intensity over time) using a mathematical technique called a Fast Fourier Transform (FFT) ), then broken into a series of overlapping chunks called acoustic frames , each one typically lasting 1/25 to 1/50 of a second. These are digitally processed in various ways and analyzed to find the components of speech they contain. Assuming we've separated the utterance into words, and identified the key features of each one, all we have to do is compare what we have with a phonetic dictionary (a list of known words and the sound fragments or features from which they're made) and we can identify what's probably been said. Probably is always the word in speech recognition: no-one but the speaker can ever know exactly what was said.)

Seeing speech

In theory, since spoken languages are built from only a few dozen phonemes (English uses about 46, while Spanish has only about 24), you could recognize any possible spoken utterance just by learning to pick out phones (or similar key features of spoken language such as formants , which are prominent frequencies that can be used to help identify vowels). Instead of having to recognize the sounds of (maybe) 40,000 words, you'd only need to recognize the 46 basic component sounds (or however many there are in your language), though you'd still need a large phonetic dictionary listing the phonemes that make up each word. This method of analyzing spoken words by identifying phones or phonemes is often called the beads-on-a-string model : a chunk of unknown speech (the string) is recognized by breaking it into phones or bits of phones (the beads); figure out the phones and you can figure out the words.

Most speech recognition programs get better as you use them because they learn as they go along using feedback you give them, either deliberately (by correcting mistakes) or by default (if you don't correct any mistakes, you're effectively saying everything was recognized perfectly—which is also feedback). If you've ever used a program like one of the Dragon dictation systems, you'll be familiar with the way you have to correct your errors straight away to ensure the program continues to work with high accuracy. If you don't correct mistakes, the program assumes it's recognized everything correctly, which means similar mistakes are even more likely to happen next time. If you force the system to go back and tell it which words it should have chosen, it will associate those corrected words with the sounds it heard—and do much better next time.

Screenshot: With speech dictation programs like Dragon NaturallySpeaking, shown here, it's important to go back and correct your mistakes if you want your words to be recognized accurately in future.

3: Statistical analysis

In practice, recognizing speech is much more complex than simply identifying phones and comparing them to stored patterns, and for a whole variety of reasons: Speech is extremely variable: different people speak in different ways (even though we're all saying the same words and, theoretically, they're all built from a standard set of phonemes) You don't always pronounce a certain word in exactly the same way; even if you did, the way you spoke a word (or even part of a word) might vary depending on the sounds or words that came before or after. As a speaker's vocabulary grows, the number of similar-sounding words grows too: the digits zero through nine all sound different when you speak them, but "zero" sounds like "hero," "one" sounds like "none," "two" could mean "two," "to," or "too"... and so on. So recognizing numbers is a tougher job for voice dictation on a PC, with a general 50,000-word vocabulary, than for an automated switchboard with a very specific, 10-word vocabulary containing only the ten digits. The more speakers a system has to recognize, the more variability it's going to encounter and the bigger the likelihood of making mistakes. For something like an off-the-shelf voice dictation program (one that listens to your voice and types your words on the screen), simple pattern recognition is clearly going to be a bit hit and miss. The basic principle of recognizing speech by identifying its component parts certainly holds good, but we can do an even better job of it by taking into account how language really works. In other words, we need to use what's called a language model .

When people speak, they're not simply muttering a series of random sounds. Every word you utter depends on the words that come before or after. For example, unless you're a contrary kind of poet, the word "example" is much more likely to follow words like "for," "an," "better," "good", "bad," and so on than words like "octopus," "table," or even the word "example" itself. Rules of grammar make it unlikely that a noun like "table" will be spoken before another noun ("table example" isn't something we say) while—in English at least—adjectives ("red," "good," "clear") come before nouns and not after them ("good example" is far more probable than "example good"). If a computer is trying to figure out some spoken text and gets as far as hearing "here is a ******* example," it can be reasonably confident that ******* is an adjective and not a noun. So it can use the rules of grammar to exclude nouns like "table" and the probability of pairs like "good example" and "bad example" to make an intelligent guess. If it's already identified a "g" sound instead of a "b", that's an added clue.

Virtually all modern speech recognition systems also use a bit of complex statistical hocus-pocus to help figure out what's being said. The probability of one phone following another, the probability of bits of silence occurring in between phones, and the likelihood of different words following other words are all factored in. Ultimately, the system builds what's called a hidden Markov model (HMM) of each speech segment, which is the computer's best guess at which beads are sitting on the string, based on all the things it's managed to glean from the sound spectrum and all the bits and pieces of phones and silence that it might reasonably contain. It's called a Markov model (or Markov chain), for Russian mathematician Andrey Markov , because it's a sequence of different things (bits of phones, words, or whatever) that change from one to the next with a certain probability. Confusingly, it's referred to as a "hidden" Markov model even though it's worked out in great detail and anything but hidden! "Hidden," in this case, simply means the contents of the model aren't observed directly but figured out indirectly from the sound spectrum. From the computer's viewpoint, speech recognition is always a probabilistic "best guess" and the right answer can never be known until the speaker either accepts or corrects the words that have been recognized. (Markov models can be processed with an extra bit of computer jiggery pokery called the Viterbi algorithm , but that's beyond the scope of this article.)

4: Artificial neural networks

HMMs have dominated speech recognition since the 1970s—for the simple reason that they work so well. But they're by no means the only technique we can use for recognizing speech. There's no reason to believe that the brain itself uses anything like a hidden Markov model. It's much more likely that we figure out what's being said using dense layers of brain cells that excite and suppress one another in intricate, interlinked ways according to the input signals they receive from our cochleas (the parts of our inner ear that recognize different sound frequencies).

Back in the 1980s, computer scientists developed "connectionist" computer models that could mimic how the brain learns to recognize patterns, which became known as artificial neural networks (sometimes called ANNs). A few speech recognition scientists explored using neural networks, but the dominance and effectiveness of HMMs relegated alternative approaches like this to the sidelines. More recently, scientists have explored using ANNs and HMMs side by side and found they give significantly higher accuracy over HMMs used alone.

Artwork: Neural networks are hugely simplified, computerized versions of the brain—or a tiny part of it that have inputs (where you feed in information), outputs (where results appear), and hidden units (connecting the two). If you train them with enough examples, they learn by gradually adjusting the strength of the connections between the different layers of units. Once a neural network is fully trained, if you show it an unknown example, it will attempt to recognize what it is based on the examples it's seen before.

Speech recognition: a summary

Artwork: A summary of some of the key stages of speech recognition and the computational processes happening behind the scenes.

What can we use speech recognition for?

We've already touched on a few of the more common applications of speech recognition, including automated telephone switchboards and computerized voice dictation systems. But there are plenty more examples where those came from.

Many of us (whether we know it or not) have cellphones with voice recognition built into them. Back in the late 1990s, state-of-the-art mobile phones offered voice-activated dialing , where, in effect, you recorded a sound snippet for each entry in your phonebook, such as the spoken word "Home," or whatever that the phone could then recognize when you spoke it in future. A few years later, systems like SpinVox became popular helping mobile phone users make sense of voice messages by converting them automatically into text (although a sneaky BBC investigation eventually claimed that some of its state-of-the-art speech automated speech recognition was actually being done by humans in developing countries!).

Today's smartphones make speech recognition even more of a feature. Apple's Siri , Google Assistant ("Hey Google..."), and Microsoft's Cortana are smartphone "personal assistant apps" who'll listen to what you say, figure out what you mean, then attempt to do what you ask, whether it's looking up a phone number or booking a table at a local restaurant. They work by linking speech recognition to complex natural language processing (NLP) systems, so they can figure out not just what you say , but what you actually mean , and what you really want to happen as a consequence. Pressed for time and hurtling down the street, mobile users theoretically find this kind of system a boon—at least if you believe the hype in the TV advertisements that Google and Microsoft have been running to promote their systems. (Google quietly incorporated speech recognition into its search engine some time ago, so you can Google just by talking to your smartphone, if you really want to.) If you have one of the latest voice-powered electronic assistants, such as Amazon's Echo/Alexa or Google Home, you don't need a computer of any kind (desktop, tablet, or smartphone): you just ask questions or give simple commands in your natural language to a thing that resembles a loudspeaker ... and it answers straight back.

Screenshot: When I asked Google "does speech recognition really work," it took it three attempts to recognize the question correctly.

Will speech recognition ever take off?

I'm a huge fan of speech recognition. After suffering with repetitive strain injury on and off for some time, I've been using computer dictation to write quite a lot of my stuff for about 15 years, and it's been amazing to see the improvements in off-the-shelf voice dictation over that time. The early Dragon NaturallySpeaking system I used on a Windows 95 laptop was fairly reliable, but I had to speak relatively slowly, pausing slightly between each word or word group, giving a horribly staccato style that tended to interrupt my train of thought. This slow, tedious one-word-at-a-time approach ("can – you – tell – what – I – am – saying – to – you") went by the name discrete speech recognition . A few years later, things had improved so much that virtually all the off-the-shelf programs like Dragon were offering continuous speech recognition , which meant I could speak at normal speed, in a normal way, and still be assured of very accurate word recognition. When you can speak normally to your computer, at a normal talking pace, voice dictation programs offer another advantage: they give clumsy, self-conscious writers a much more attractive, conversational style: "write like you speak" (always a good tip for writers) is easy to put into practice when you speak all your words as you write them!

Despite the technological advances, I still generally prefer to write with a keyboard and mouse . Ironically, I'm writing this article that way now. Why? Partly because it's what I'm used to. I often write highly technical stuff with a complex vocabulary that I know will defeat the best efforts of all those hidden Markov models and neural networks battling away inside my PC. It's easier to type "hidden Markov model" than to mutter those words somewhat hesitantly, watch "hiccup half a puddle" pop up on screen and then have to make corrections.

Screenshot: You an always add more words to a speech recognition program. Here, I've decided to train the Microsoft Windows built-in speech recognition engine to spot the words 'hidden Markov model.'

Mobile revolution?

You might think mobile devices—with their slippery touchscreens —would benefit enormously from speech recognition: no-one really wants to type an essay with two thumbs on a pop-up QWERTY keyboard. Ironically, mobile devices are heavily used by younger, tech-savvy kids who still prefer typing and pawing at screens to speaking out loud. Why? All sorts of reasons, from sheer familiarity (it's quick to type once you're used to it—and faster than fixing a computer's goofed-up guesses) to privacy and consideration for others (many of us use our mobile phones in public places and we don't want our thoughts wide open to scrutiny or howls of derision), and the sheer difficulty of speaking clearly and being clearly understood in noisy environments. Recently, I was walking down a street and overheard a small garden party where the sounds of happy laughter, drinking, and discreet background music were punctuated by a sudden grunt of "Alexa play Copacabana by Barry Manilow"—which silenced the conversation entirely and seemed jarringly out of place. Speech recognition has never been so indiscreet. What you're doing with your computer also makes a difference. If you've ever used speech recognition on a PC, you'll know that writing something like an essay (dictating hundreds or thousands of words of ordinary text) is a whole lot easier than editing it afterwards (where you laboriously try to select words or sentences and move them up or down so many lines with awkward cut and paste commands). And trying to open and close windows, start programs, or navigate around a computer screen by voice alone is clumsy, tedious, error-prone, and slow. It's far easier just to click your mouse or swipe your finger.

Photo: Here I'm using Google's Live Transcribe app to dictate the last paragraph of this article. As you can see, apart from the punctuation, the transcription is flawless, without any training at all. This is the fastest and most accurate speech recognition software I've ever used. It's mainly designed as an accessibility aid for deaf and hard of hearing people, but it can be used for dictation too.

Developers of speech recognition systems insist everything's about to change, largely thanks to natural language processing and smart search engines that can understand spoken queries. ("OK Google...") But people have been saying that for decades now: the brave new world is always just around the corner. According to speech pioneer James Baker, better speech recognition "would greatly increase the speed and ease with which humans could communicate with computers, and greatly speed and ease the ability with which humans could record and organize their own words and thoughts"—but he wrote (or perhaps voice dictated?) those words 25 years ago! Just because Google can now understand speech, it doesn't follow that we automatically want to speak our queries rather than type them—especially when you consider some of the wacky things people look for online. Humans didn't invent written language because others struggled to hear and understand what they were saying. Writing and speaking serve different purposes. Writing is a way to set out longer, more clearly expressed and elaborated thoughts without having to worry about the limitations of your short-term memory; speaking is much more off-the-cuff. Writing is grammatical; speech doesn't always play by the rules. Writing is introverted, intimate, and inherently private; it's carefully and thoughtfully composed. Speaking is an altogether different way of expressing your thoughts—and people don't always want to speak their minds. While technology may be ever advancing, it's far from certain that speech recognition will ever take off in quite the way that its developers would like. I'm typing these words, after all, not speaking them.

If you liked this article...

Find out more, on this website.

Microphones
Neural networks
Speech synthesis
Automatic Speech Recognition: A Deep Learning Approach by Dong Yu and Li Deng. Springer, 2015. Two Microsoft researchers review state-of-the-art, neural-network approaches to recognition.
Theory and Applications of Digital Speech Processing by Lawrence R. Rabiner and Ronald W. Schafer. Pearson, 2011. An up-to-date review at undergraduate level.
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky, James Martin. Prentice Hall, 2009. An up-to-date, interdisciplinary review of speech recognition technology.
Statistical Methods for Speech Recognition by Frederick Jelinek. MIT Press, 1997. A detailed guide to Hidden Markov Models and the other statistical techniques that computers use to figure out human speech.
Fundamentals of Speech Recognition by Lawrence R. Rabiner and Biing-Hwang Juang. PTR Prentice Hall, 1993. A little dated now, but still a good introduction to the basic concepts.
Speech Recognition: Invited Papers Presented at the 1974 IEEE Symposium by D. R. Reddy (ed). Academic Press, 1975. A classic collection of pioneering papers from the golden age of the 1970s.

Easy-to-understand

Lost voices, ignored words: Apple's speech recognition needs urgent reform by Colin Hughes, The Register, 16 August 2023. How speech recognition software ignores the needs of the people who need it most—disabled people with different accessibility needs.
Android's Live Transcribe will let you save transcriptions and show 'sound events' by Dieter Bohn, The Verge, 16 May 2019. An introduction to Google's handy, 70-language transcription app.
Hey, Siri: Read My Lips by Emily Waltz, IEEE Spectrum, 8 February 2019. How your computer can translate your words... without even listening.
Interpol's New Software Will Recognize Criminals by Their Voices by Michael Dumiak, 16 May 2018. Is it acceptable for law enforcement agencies to store huge quantities of our voice samples if it helps them trap the occasional bad guy?
Cypher: The Deep-Learning Software That Will Help Siri, Alexa, and Cortana Hear You : by Amy Nordrum. IEEE Spectrum, 24 October 2016. Cypher helps voice recognition programs to separate speech signals from background noise.
In the Future, How Will We Talk to Our Technology? : by David Pierce. Wired, 27 September 2015. What sort of hardware will we use with future speech recognition software?
The Holy Grail of Speech Recognition by Janie Chang: Microsoft Research, 29 August 2011. How neural networks are making a comeback in speech recognition research. [Archived via the Wayback Machine.]
Audio Alchemy: Getting Computers to Understand Overlapping Speech by John R. Hershey et al. Scientific American, April 12, 2011. How can computers make sense of two people talking at once?
How Siri Works: Interview with Tom Gruber by Nova Spivack, Minding the Planet, 26 January 2010. Gruber explains some of the technical tricks that allow Siri to understand natural language.
A sound start for speech tech : by LJ Rich. BBC News, 15 May 2009. Cambridge University's Dr Tony Robinson talks us through the science of speech recognition.
Speech Recognition by Computer by Stephen E. Levinson and Mark Y. Liberman, Scientific American, Vol. 244, No. 4 (April 1981), pp. 64–77. A more detailed overview of the basic concepts. A good article to continue with after you've read mine.

More technical

An All-Neural On-Device Speech Recognizer by Johan Schalkwyk, Google AI Blog, March 12, 2019. Google announces a state-of-the-art speech recognition system based entirely on what are called recurrent neural network transducers (RNN-Ts).
Improving End-to-End Models For Speech Recognition by Tara N. Sainath, and Yonghui Wu, Google Research Blog, December 14, 2017. A cutting-edge speech recognition model that integrates traditionally separate aspects of speech recognition into a single system.
A Historical Perspective of Speech Recognition by Xuedong Huang, James Baker, Raj Reddy. Communications of the ACM, January 2014 (Vol. 57 No. 1), Pages 94–103.
[PDF] Application Of Pretrained Deep Neural Networks To Large Vocabulary Speech Recognition by Navdeep Jaitly, Patrick Nguyen, Andrew Senior, Vincent Vanhoucke. Proceedings of Interspeech 2012. An insight into Google's use of neural networks for speech recognition.
Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition by George Dahl et al. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20 No. 1, January 2012. A review of Microsoft's recent research into using neural networks with HMMs.
Speech Recognition Technology: A Critique by Stephen E. Levinson, Proceedings of the National Academy of Sciences of the United States of America. Vol. 92, No. 22, October 24, 1995, pp. 9953–9955.
Hidden Markov Models for Speech Recognition by B. H. Juang and L. R. Rabiner, Technometrics, Vol. 33, No. 3, August, 1991, pp. 251–272.
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition by Lawrence R. Rabiner. Proceedings of the IEEE, Vol 77 No 2, February 1989. A classic introduction to Markov models, though non-mathematicians will find it tough going.
US Patent: 4,783,803: Speech recognition apparatus and method by James K. Baker, Dragon Systems, 8 November 1988. One of Baker's first Dragon patents. Another Baker patent filed the following year follows on from this. See US Patent: 4,866,778: Interactive speech recognition apparatus by James K. Baker, Dragon Systems, 12 September 1989.
US Patent 4,783,804: Hidden Markov model speech recognition arrangement by Stephen E. Levinson, Lawrence R. Rabiner, and Man M. Sondi, AT&T Bell Laboratories, 6 May 1986. Sets out one approach to probabilistic speech recognition using Markov models.
US Patent: 4,363,102: Speaker identification system using word recognition templates by John E. Holmgren, Bell Labs, 7 December 1982. A method of recognizing a particular person's voice using analysis of key features.
US Patent 2,938,079: Spectrum segmentation system for the automatic extraction of formant frequencies from human speech by James L. Flanagan, US Air Force, 24 May 1960. An early speech recognition system based on formant (peak frequency) analysis.
A Historical Perspective of Speech Recognition by Raj Reddy (an AI researcher at Carnegie Mellon), James Baker (founder of Dragon), and Xuedong Huang (of Microsoft). Speech recognition pioneers look back on the advances they helped to inspire in this four-minute discussion.

Rate this page

Tell your friends, cite this page, more to explore on our website....

Get the book
Send feedback

Pardon our stardust! You've reached our interactive prototype, where we're polishing and adding new content daily!

View Our Glossary
Top Q&As
Our Community

Where am I in the Reading Universe Taxonomy?

What Is Word Recognition?

word recognition skills.

When we teach children to read, we spend a lot of time in the early years teaching them word recognition skills, including phonological awareness and phonics . We need to teach them how to break the code of our alphabet — decoding — and then to become fluent with that code, like we are. This will allow your students to focus on the meaning of what they read, which is the main thing! Word recognition is a tool or a means to the end goal of reading comprehension.

What Does Word Recognition Look Like?

Here's a child in the early stages of decoding . She's working hard and having success sounding out individual words — and quickly recognizing some that she's read a few times before.

Thumbnail for the video 'First grade reader 2'

Produced by Reading Universe, a partnership of WETA, Barksdale Reading Institute, and First Book

View Video Transcript

This next child is a little bit further along with her decoding skills.

Thumbnail for the video 'First grade reader 1'

The second child's brain can focus on the story. We can tell because she's able to use expression. She only has to slow down when she comes across a new word.

How Does Word Recognition Connect to Reading for Meaning?

Of course, it helps immensely that the young girl in the second video knows the meanings of the words she's decoding . And that's understated. Word recognition without language comprehension won't work. In order to be able to read for meaning … to comprehend what they read … children need to be able to recognize words and apply meaning to those words. The faster and easier they can do both, the more they'll be able to gain from reading in their life.

The simple view of reading is explained as word recognition times language comprehension equals reading comprehension by Gough and Tunmar, 1990

The Integration of Word Recognition and Language Comprehension

When we read, these two sets of skills — word recognition and language comprehension — intertwine and overlap, and your children will need your help to integrate them. As they read a new story word by word, they'll need to be able to sound out each word (or recognize it instantly), call up its meaning, connect it with their knowledge about the meaning, and apply it to the context of what they're reading — as quickly as possible.

In your classroom, whether you're a pre-K teacher or a second grade teacher, you'll spend time every day on both word recognition skills and language comprehension skills, often at the same time!

Picture singing a rhyming song and talking about the characters in the rhyme … that's an integrated lesson!

The word recognition section of the Reading Universe Taxonomy and is where we break each word recognition skill out, because, in the beginning, we need to spend significant time teaching skills in isolation, so that students can master them.

If you'd like a meatier introduction to the two sets of skills children need to read, we've got a one-hour presentation by reading specialist Margaret Goldberg for you to watch. Orthographic mapping , anyone?

How Children Learn to Read, with Margaret Goldberg

How Can Reading Universe Help You Teach Word Recognition?

syllables and suffixes to r-controlled vowels and the schwa . Each skill explainer has a detailed description of how to teach each skill, along with lesson plans, decodable text s, practice activities, and assessments. This continuum displays the many phonological awareness and phonics skills that all students need to master in order to become confident and fluent readers. We can help you teach all of them. Get started now .

The word recognition continuum offers a framework for teaching the foundation skills that make up phonological awareness and phonics.

Onset-Rime Skill Explainer

Word Recognition
Phonological Awareness

Decodable Text Student Web Example from Lunch at the Fish Shack

How to Use Decodable Texts

Dictate text using Speech Recognition

On Windows 11 22H2 and later, Windows Speech Recognition (WSR) will be replaced by voice access starting in September 2024. Older versions of Windows will continue to have WSR available. To learn more about voice access, go to Use voice access to control your PC & author text with your voice .

You can use your voice to dictate text to your Windows PC. For example, you can dictate text to fill out online forms; or you can dictate text to a word-processing program, such as WordPad, to type a letter.

Dictating text

When you speak into the microphone, Windows Speech Recognition converts your spoken words into text that appears on your screen.

To dictate text

Say "start listening" or click the Microphone button to start the listening mode.

Open the program you want to use or select the text box you want to dictate text into.

Say the text that you want dictate.

Correcting dictation mistakes

There are several ways to correct mistakes made during dictation. You can say "correct that" to correct the last thing you said. To correct a single word, say "correct" followed by the word that you want to correct. If the word appears more than once, all instances will be highlighted and you can choose the one that you want to correct. You can also add words that are frequently misheard or not recognized by using the Speech Dictionary.

To use the Alternates panel dialog box

Do one of the following:

To correct the last thing you said, say "correct that."

To correct a single word, say "correct" followed by the word that you want to correct.

In the Alternates panel dialog box, say the number next to the item you want, and then "OK."

Note: To change a selection, in the Alternates panel dialog box, say "spell" followed by the number of the item you want to change, and then "OK."

To use the Speech Dictionary

Say "open Speech Dictionary."

Do any of the following:

To add a word to the dictionary, click or say Add a new word , and then follow the instructions in the wizard.

To prevent a specific word from being dictated, click or say Prevent a word from being dictated , and then follow the instructions in the wizard.

To correct or delete a word that is already in the dictionary, click or say Change existing words , and then follow the instructions in the wizard.

Note: Speech Recognition is available only in English, French, Spanish, German, Japanese, Simplified Chinese, and Traditional Chinese.

Need more help?

Want more options.

Explore subscription benefits, browse training courses, learn how to secure your device, and more.

Microsoft 365 subscription benefits

Microsoft 365 training

Microsoft security

Accessibility center

Communities help you ask and answer questions, give feedback, and hear from experts with rich knowledge.

Ask the Microsoft Community

Microsoft Tech Community

Windows Insiders

Microsoft 365 Insiders

Find solutions to common problems or get help from a support agent.

Online support

Was this information helpful?

Thank you for your feedback.

Search Menu
Sign in through your institution
Browse content in Arts and Humanities
Browse content in Archaeology
Anglo-Saxon and Medieval Archaeology
Archaeological Methodology and Techniques
Archaeology by Region
Archaeology of Religion
Archaeology of Trade and Exchange
Biblical Archaeology
Contemporary and Public Archaeology
Environmental Archaeology
Historical Archaeology
History and Theory of Archaeology
Industrial Archaeology
Landscape Archaeology
Mortuary Archaeology
Prehistoric Archaeology
Underwater Archaeology
Zooarchaeology
Browse content in Architecture
Architectural Structure and Design
History of Architecture
Residential and Domestic Buildings
Theory of Architecture
Browse content in Art
Art Subjects and Themes
History of Art
Industrial and Commercial Art
Theory of Art
Biographical Studies
Byzantine Studies
Browse content in Classical Studies
Classical Literature
Classical Reception
Classical History
Classical Philosophy
Classical Mythology
Classical Art and Architecture
Classical Oratory and Rhetoric
Greek and Roman Papyrology
Greek and Roman Archaeology
Greek and Roman Epigraphy
Greek and Roman Law
Late Antiquity
Religion in the Ancient World
Digital Humanities
Browse content in History
Colonialism and Imperialism
Diplomatic History
Environmental History
Genealogy, Heraldry, Names, and Honours
Genocide and Ethnic Cleansing
Historical Geography
History by Period
History of Emotions
History of Agriculture
History of Education
History of Gender and Sexuality
Industrial History
Intellectual History
International History
Labour History
Legal and Constitutional History
Local and Family History
Maritime History
Military History
National Liberation and Post-Colonialism
Oral History
Political History
Public History
Regional and National History
Revolutions and Rebellions
Slavery and Abolition of Slavery
Social and Cultural History
Theory, Methods, and Historiography
Urban History
World History
Browse content in Language Teaching and Learning
Language Learning (Specific Skills)
Language Teaching Theory and Methods
Browse content in Linguistics
Applied Linguistics
Cognitive Linguistics
Computational Linguistics
Forensic Linguistics
Grammar, Syntax and Morphology
Historical and Diachronic Linguistics
History of English
Language Evolution
Language Reference
Language Variation
Language Families
Language Acquisition
Lexicography
Linguistic Anthropology
Linguistic Theories
Linguistic Typology
Phonetics and Phonology
Psycholinguistics
Sociolinguistics
Translation and Interpretation
Writing Systems
Browse content in Literature
Bibliography
Children's Literature Studies
Literary Studies (Romanticism)
Literary Studies (American)
Literary Studies (Modernism)
Literary Studies (Asian)
Literary Studies (European)
Literary Studies (Eco-criticism)
Literary Studies - World
Literary Studies (1500 to 1800)
Literary Studies (19th Century)
Literary Studies (20th Century onwards)
Literary Studies (African American Literature)
Literary Studies (British and Irish)
Literary Studies (Early and Medieval)
Literary Studies (Fiction, Novelists, and Prose Writers)
Literary Studies (Gender Studies)
Literary Studies (Graphic Novels)
Literary Studies (History of the Book)
Literary Studies (Plays and Playwrights)
Literary Studies (Poetry and Poets)
Literary Studies (Postcolonial Literature)
Literary Studies (Queer Studies)
Literary Studies (Science Fiction)
Literary Studies (Travel Literature)
Literary Studies (War Literature)
Literary Studies (Women's Writing)
Literary Theory and Cultural Studies
Mythology and Folklore
Shakespeare Studies and Criticism
Browse content in Media Studies
Browse content in Music
Applied Music
Dance and Music
Ethics in Music
Ethnomusicology
Gender and Sexuality in Music
Medicine and Music
Music Cultures
Music and Media
Music and Culture
Music and Religion
Music Education and Pedagogy
Music Theory and Analysis
Musical Scores, Lyrics, and Libretti
Musical Structures, Styles, and Techniques
Musicology and Music History
Performance Practice and Studies
Race and Ethnicity in Music
Sound Studies
Browse content in Performing Arts
Browse content in Philosophy
Aesthetics and Philosophy of Art
Epistemology
Feminist Philosophy
History of Western Philosophy
Metaphysics
Moral Philosophy
Non-Western Philosophy
Philosophy of Language
Philosophy of Mind
Philosophy of Perception
Philosophy of Action
Philosophy of Law
Philosophy of Religion
Philosophy of Science
Philosophy of Mathematics and Logic
Practical Ethics
Social and Political Philosophy
Browse content in Religion
Biblical Studies
Christianity
East Asian Religions
History of Religion
Judaism and Jewish Studies
Qumran Studies
Religion and Education
Religion and Health
Religion and Politics
Religion and Science
Religion and Law
Religion and Art, Literature, and Music
Religious Studies
Browse content in Society and Culture
Cookery, Food, and Drink
Cultural Studies
Customs and Traditions
Ethical Issues and Debates
Hobbies, Games, Arts and Crafts
Natural world, Country Life, and Pets
Popular Beliefs and Controversial Knowledge
Sports and Outdoor Recreation
Technology and Society
Travel and Holiday
Visual Culture
Browse content in Law
Arbitration
Browse content in Company and Commercial Law
Commercial Law
Company Law
Browse content in Comparative Law
Systems of Law
Competition Law
Browse content in Constitutional and Administrative Law
Government Powers
Judicial Review
Local Government Law
Military and Defence Law
Parliamentary and Legislative Practice
Construction Law
Contract Law
Browse content in Criminal Law
Criminal Procedure
Criminal Evidence Law
Sentencing and Punishment
Employment and Labour Law
Environment and Energy Law
Browse content in Financial Law
Banking Law
Insolvency Law
History of Law
Human Rights and Immigration
Intellectual Property Law
Browse content in International Law
Private International Law and Conflict of Laws
Public International Law
IT and Communications Law
Jurisprudence and Philosophy of Law
Law and Society
Law and Politics
Browse content in Legal System and Practice
Courts and Procedure
Legal Skills and Practice
Primary Sources of Law
Regulation of Legal Profession
Medical and Healthcare Law
Browse content in Policing
Criminal Investigation and Detection
Police and Security Services
Police Procedure and Law
Police Regional Planning
Browse content in Property Law
Personal Property Law
Study and Revision
Terrorism and National Security Law
Browse content in Trusts Law
Wills and Probate or Succession
Browse content in Medicine and Health
Browse content in Allied Health Professions
Arts Therapies
Clinical Science
Dietetics and Nutrition
Occupational Therapy
Operating Department Practice
Physiotherapy
Radiography
Speech and Language Therapy
Browse content in Anaesthetics
General Anaesthesia
Neuroanaesthesia
Clinical Neuroscience
Browse content in Clinical Medicine
Acute Medicine
Cardiovascular Medicine
Clinical Genetics
Clinical Pharmacology and Therapeutics
Dermatology
Endocrinology and Diabetes
Gastroenterology
Genito-urinary Medicine
Geriatric Medicine
Infectious Diseases
Medical Toxicology
Medical Oncology
Pain Medicine
Palliative Medicine
Rehabilitation Medicine
Respiratory Medicine and Pulmonology
Rheumatology
Sleep Medicine
Sports and Exercise Medicine
Community Medical Services
Critical Care
Emergency Medicine
Forensic Medicine
Haematology
History of Medicine
Browse content in Medical Skills
Clinical Skills
Communication Skills
Nursing Skills
Surgical Skills
Medical Ethics
Browse content in Medical Dentistry
Oral and Maxillofacial Surgery
Paediatric Dentistry
Restorative Dentistry and Orthodontics
Surgical Dentistry
Medical Statistics and Methodology
Browse content in Neurology
Clinical Neurophysiology
Neuropathology
Nursing Studies
Browse content in Obstetrics and Gynaecology
Gynaecology
Occupational Medicine
Ophthalmology
Otolaryngology (ENT)
Browse content in Paediatrics
Neonatology
Browse content in Pathology
Chemical Pathology
Clinical Cytogenetics and Molecular Genetics
Histopathology
Medical Microbiology and Virology
Patient Education and Information
Browse content in Pharmacology
Psychopharmacology
Browse content in Popular Health
Caring for Others
Complementary and Alternative Medicine
Self-help and Personal Development
Browse content in Preclinical Medicine
Cell Biology
Molecular Biology and Genetics
Reproduction, Growth and Development
Primary Care
Professional Development in Medicine
Browse content in Psychiatry
Addiction Medicine
Child and Adolescent Psychiatry
Forensic Psychiatry
Learning Disabilities
Old Age Psychiatry
Psychotherapy
Browse content in Public Health and Epidemiology
Epidemiology
Public Health
Browse content in Radiology
Clinical Radiology
Interventional Radiology
Nuclear Medicine
Radiation Oncology
Reproductive Medicine
Browse content in Surgery
Cardiothoracic Surgery
Gastro-intestinal and Colorectal Surgery
General Surgery
Neurosurgery
Paediatric Surgery
Peri-operative Care
Plastic and Reconstructive Surgery
Surgical Oncology
Transplant Surgery
Trauma and Orthopaedic Surgery
Vascular Surgery
Browse content in Science and Mathematics
Browse content in Biological Sciences
Aquatic Biology
Biochemistry
Bioinformatics and Computational Biology
Developmental Biology
Ecology and Conservation
Evolutionary Biology
Genetics and Genomics
Microbiology
Molecular and Cell Biology
Natural History
Plant Sciences and Forestry
Research Methods in Life Sciences
Structural Biology
Systems Biology
Zoology and Animal Sciences
Browse content in Chemistry
Analytical Chemistry
Computational Chemistry
Crystallography
Environmental Chemistry
Industrial Chemistry
Inorganic Chemistry
Materials Chemistry
Medicinal Chemistry
Mineralogy and Gems
Organic Chemistry
Physical Chemistry
Polymer Chemistry
Study and Communication Skills in Chemistry
Theoretical Chemistry
Browse content in Computer Science
Artificial Intelligence
Computer Architecture and Logic Design
Game Studies
Human-Computer Interaction
Mathematical Theory of Computation
Programming Languages
Software Engineering
Systems Analysis and Design
Virtual Reality
Browse content in Computing
Business Applications
Computer Games
Computer Security
Computer Networking and Communications
Digital Lifestyle
Graphical and Digital Media Applications
Operating Systems
Browse content in Earth Sciences and Geography
Atmospheric Sciences
Environmental Geography
Geology and the Lithosphere
Maps and Map-making
Meteorology and Climatology
Oceanography and Hydrology
Palaeontology
Physical Geography and Topography
Regional Geography
Soil Science
Urban Geography
Browse content in Engineering and Technology
Agriculture and Farming
Biological Engineering
Civil Engineering, Surveying, and Building
Electronics and Communications Engineering
Energy Technology
Engineering (General)
Environmental Science, Engineering, and Technology
History of Engineering and Technology
Mechanical Engineering and Materials
Technology of Industrial Chemistry
Transport Technology and Trades
Browse content in Environmental Science
Applied Ecology (Environmental Science)
Conservation of the Environment (Environmental Science)
Environmental Sustainability
Environmentalist Thought and Ideology (Environmental Science)
Management of Land and Natural Resources (Environmental Science)
Natural Disasters (Environmental Science)
Nuclear Issues (Environmental Science)
Pollution and Threats to the Environment (Environmental Science)
Social Impact of Environmental Issues (Environmental Science)
History of Science and Technology
Browse content in Materials Science
Ceramics and Glasses
Composite Materials
Metals, Alloying, and Corrosion
Nanotechnology
Browse content in Mathematics
Applied Mathematics
Biomathematics and Statistics
History of Mathematics
Mathematical Education
Mathematical Finance
Mathematical Analysis
Numerical and Computational Mathematics
Probability and Statistics
Pure Mathematics
Browse content in Neuroscience
Cognition and Behavioural Neuroscience
Development of the Nervous System
Disorders of the Nervous System
History of Neuroscience
Invertebrate Neurobiology
Molecular and Cellular Systems
Neuroendocrinology and Autonomic Nervous System
Neuroscientific Techniques
Sensory and Motor Systems
Browse content in Physics
Astronomy and Astrophysics
Atomic, Molecular, and Optical Physics
Biological and Medical Physics
Classical Mechanics
Computational Physics
Condensed Matter Physics
Electromagnetism, Optics, and Acoustics
History of Physics
Mathematical and Statistical Physics
Measurement Science
Nuclear Physics
Particles and Fields
Plasma Physics
Quantum Physics
Relativity and Gravitation
Semiconductor and Mesoscopic Physics
Browse content in Psychology
Affective Sciences
Clinical Psychology
Cognitive Psychology
Cognitive Neuroscience
Criminal and Forensic Psychology
Developmental Psychology
Educational Psychology
Evolutionary Psychology
Health Psychology
History and Systems in Psychology
Music Psychology
Neuropsychology
Organizational Psychology
Psychological Assessment and Testing
Psychology of Human-Technology Interaction
Psychology Professional Development and Training
Research Methods in Psychology
Social Psychology
Browse content in Social Sciences
Browse content in Anthropology
Anthropology of Religion
Human Evolution
Medical Anthropology
Physical Anthropology
Regional Anthropology
Social and Cultural Anthropology
Theory and Practice of Anthropology
Browse content in Business and Management
Business Ethics
Business History
Business Strategy
Business and Technology
Business and Government
Business and the Environment
Comparative Management
Corporate Governance
Corporate Social Responsibility
Entrepreneurship
Health Management
Human Resource Management
Industrial and Employment Relations
Industry Studies
Information and Communication Technologies
International Business
Knowledge Management
Management and Management Techniques
Operations Management
Organizational Theory and Behaviour
Pensions and Pension Management
Public and Nonprofit Management
Strategic Management
Supply Chain Management
Browse content in Criminology and Criminal Justice
Criminal Justice
Criminology
Forms of Crime
International and Comparative Criminology
Youth Violence and Juvenile Justice
Development Studies
Browse content in Economics
Agricultural, Environmental, and Natural Resource Economics
Asian Economics
Behavioural Finance
Behavioural Economics and Neuroeconomics
Econometrics and Mathematical Economics
Economic History
Economic Methodology
Economic Systems
Economic Development and Growth
Financial Markets
Financial Institutions and Services
General Economics and Teaching
Health, Education, and Welfare
History of Economic Thought
International Economics
Labour and Demographic Economics
Law and Economics
Macroeconomics and Monetary Economics
Microeconomics
Public Economics
Urban, Rural, and Regional Economics
Welfare Economics
Browse content in Education
Adult Education and Continuous Learning
Care and Counselling of Students
Early Childhood and Elementary Education
Educational Equipment and Technology
Educational Strategies and Policy
Higher and Further Education
Organization and Management of Education
Philosophy and Theory of Education
Schools Studies
Secondary Education
Teaching of a Specific Subject
Teaching of Specific Groups and Special Educational Needs
Teaching Skills and Techniques
Browse content in Environment
Applied Ecology (Social Science)
Climate Change
Conservation of the Environment (Social Science)
Environmentalist Thought and Ideology (Social Science)
Natural Disasters (Environment)
Social Impact of Environmental Issues (Social Science)
Browse content in Human Geography
Cultural Geography
Economic Geography
Political Geography
Browse content in Interdisciplinary Studies
Communication Studies
Museums, Libraries, and Information Sciences
Browse content in Politics
African Politics
Asian Politics
Chinese Politics
Comparative Politics
Conflict Politics
Elections and Electoral Studies
Environmental Politics
European Union
Foreign Policy
Gender and Politics
Human Rights and Politics
Indian Politics
International Relations
International Organization (Politics)
International Political Economy
Irish Politics
Latin American Politics
Middle Eastern Politics
Political Behaviour
Political Economy
Political Institutions
Political Theory
Political Methodology
Political Communication
Political Philosophy
Political Sociology
Politics and Law
Politics of Development
Public Policy
Public Administration
Quantitative Political Methodology
Regional Political Studies
Russian Politics
Security Studies
State and Local Government
UK Politics
US Politics
Browse content in Regional and Area Studies
African Studies
Asian Studies
East Asian Studies
Japanese Studies
Latin American Studies
Middle Eastern Studies
Native American Studies
Scottish Studies
Browse content in Research and Information
Research Methods
Browse content in Social Work
Addictions and Substance Misuse
Adoption and Fostering
Care of the Elderly
Child and Adolescent Social Work
Couple and Family Social Work
Direct Practice and Clinical Social Work
Emergency Services
Human Behaviour and the Social Environment
International and Global Issues in Social Work
Mental and Behavioural Health
Social Justice and Human Rights
Social Policy and Advocacy
Social Work and Crime and Justice
Social Work Macro Practice
Social Work Practice Settings
Social Work Research and Evidence-based Practice
Welfare and Benefit Systems
Browse content in Sociology
Childhood Studies
Community Development
Comparative and Historical Sociology
Economic Sociology
Gender and Sexuality
Gerontology and Ageing
Health, Illness, and Medicine
Marriage and the Family
Migration Studies
Occupations, Professions, and Work
Organizations
Population and Demography
Race and Ethnicity
Social Theory
Social Movements and Social Change
Social Research and Statistics
Social Stratification, Inequality, and Mobility
Sociology of Religion
Sociology of Education
Sport and Leisure
Urban and Rural Studies
Browse content in Warfare and Defence
Defence Strategy, Planning, and Research
Land Forces and Warfare
Military Administration
Military Life and Institutions
Naval Forces and Warfare
Other Warfare and Defence Issues
Peace Studies and Conflict Resolution
Weapons and Equipment

The Oxford Handbook of Deaf Studies, Language, and Education, Volume 1 (2nd edn)

< Previous chapter
Next chapter >

27 Speech Perception and Spoken Word Recognition

Lynne E. Bernstein, Department of Speech and Hearing Sciences George, Washington University, Washington, DC

Edward T. Auer Jr. Department of Speech-Language-Hearing, University of Kansas Lawrence, KS

Published: 18 September 2012
Cite Icon Cite
Permissions Icon Permissions

Speech is an important mode of communication for many people with hearing loss, even for those whose hearing loss is profound. This chapter focuses on spoken communication by adults with severe or profound hearing loss. It describes several fundamental issues in speech perception and spoken word recognition, such as the use of amplification even with profound hearing loss, enhanced lipreading abilities associated with deafness, and the role of the lexicon in speech perception. The chapter describes how the lexical structure of words can assist in compensating for reduced access to speech information. Although lipreaders must frequently contend with ambiguous segmental information, many words in English nevertheless maintain distinct perceptual patterns that can be used for accurate lipreading. The chapter also describes results of a study that sought to compare word age of acquisition estimates in deaf versus hearing adults. Subjective judgments showed delayed but generally similar order of word acquisition and much greater reliance on orthography for word learning among the deaf participants. A review of some results on audiovisual speech perception and speech perception with vibrotactile stimuli underscores the importance of bimodal speech information. Individuals with severe or profound hearing loss can greatly benefit from being able to see a talker along with hearing reduced auditory information or even feeling vibrotactile information. Overall, this chapter demonstrates that speech information can withstand extreme stimulus degradation and still convey the talker’s intended message. Multimodal integration and lexical structure assist in overcoming effects of hearing loss, and speech is frequently an important communication mode for deaf individuals.

Speech is an important mode of communication for many people with hearing loss, even with losses at severe (60–89 dB HL) or profound (>90 dB HL bilaterally) levels. Individuals with hearing losses of these magnitudes occupy positions on a continuum between relying exclusively on spoken language and relying exclusively on manual language. In the case of spoken language, speech perception can depend totally on heard speech at one extreme and on seen speech (lip-reading/speechreading 1 ) at the other. In addition, communication conditions can determine where on the continuum an individual is at any particular time. For example, some of the students at Gallaudet University who relied on manual language in their classrooms and elsewhere on campus reported reliance on spoken language for communicating with their hearing friends, families, and the public (Bernstein, Demorest, & Tucker, 1998 ).

This chapter focuses on spoken communication by adults with severe or profound hearing loss, although it includes relevant discussion of results from studies involving adults with mild to moderate hearing losses or with normal hearing. The chapter describes several fundamental issues in speech perception and spoken word recognition and reviews what is known about these issues in relationship to adults with severe-to-profound hearing loss.

Speech Perception

When talkers produce speech, their articulatory gestures typically produce acoustic and optical signals that are available to perceivers. The auditory and visual perceptual systems must categorize the linguistically relevant speech information in the speech signals. The physical acoustic and optical forms of speech have a hierarchical structure. The segmental consonants and vowels comprise subsegmental features. Those features can be described in articulatory terms such as place of articulation (e.g., bilabial, dental, alveolar), manner of articulation (e.g., stop, liquid, vocalic, nasal), and voicing (voiced, unvoiced) (Catford, 1977 ). 2 The speech segments are used by language combinatorially to form morphemes (minimal units of linguistic analysis such as “un,” “reason,” “able” in “unreasonable”), which in turn combine to form words. Language differs from other animal communication systems in its generativity, not only to produce infinitely many different sentences out of a set of words but also to generate new words by combining the finite set of segmental consonants and vowels within a particular language.

That consonants and vowels are structurally key to the generation of word forms has also suggested that they are key to the perception of words. However, discovering how perceivers recognize the consonant and vowel segments in the speech signals produced by talkers has not proved straightforward and has not yet been fully accomplished (e.g., Fowler, 1986 ; Liberman & Whalen, 2000 ; Nearey, 1997 ). The reason for this difficulty is that the speech segments are not produced like beads on a string, and so do not appear as separate units in the auditory (Liberman, 1982 ) or optical stimulus. The speech articulators—the lips, tongue, velum, and larynx—produce speech gestures in a coordinated and overlapping manner that results in overlapping segmental information. The speech production gestures change the overall shape of the vocal tract tube, and those shapes are directly responsible for the resonances (formants/concentrations of energy) of the speech signal (Stevens, 1998 ). However, different vocal tract shapes can produce signals that are perceived as the same segment, further complicating matters.

Numerous experiments have been conducted using synthesized, filtered, and edited speech waveforms to isolate the parts of the speech signal that are critical to speech perception. Although it is not yet completely known how auditory perceptual processes analyze acoustic speech signals, it is known that listeners are remarkably capable of perceiving the linguistically relevant information in even highly degraded signals (Iverson, Bernstein, & Auer, 1998 ; Remez, 1994 ; Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995 ). The questions of importance here are what auditory information can be obtained by individuals with severe or profound hearing loss, and how speech perception is affected by individual hearing loss configurations. Work on these problems began decades ago with examining how speech perception with normal hearing is affected by various manipulations such as filtering. For example, Miller and Nicely ( 1955 ) showed that perception of place of articulation (e.g., /b/ versus /d/ versus /g/) information depends greatly on the frequencies above 1000 Hz, but voicing (e.g., /b/ versus /p/) is well preserved only with frequencies below 1000 Hz. The manner feature (e.g., /m/ versus /b/) involves the entire range of speech frequencies and appears to be less sensitive to losses in the higher frequencies.

Auditory-only Speech Perception by Listeners with Impaired Hearing

As level of hearing loss increases, access to auditory speech information decreases. At severe or profound levels of hearing loss, hearing aids can help overcome problems with audibility of speech sounds for some individuals, particularly when listening conditions are not noisy. Amplification systems are designed to restore audibility by boosting intensity in regions of the spectrum affected by the loss. Unfortunately, when hearing loss is severe or profound, simply increasing the amplitude of the signal frequently does not restore the listener’s access to auditory speech information: At those levels of hearing loss, the speech information that can be perceived auditorily is typically highly degraded due to distortion caused by the damage in the listener’s auditory system. High sound-pressure levels required to amplify speech adequately to compensate for severe or profound hearing loss levels result in additional signal distortion (Ching, Dillon, & Byrne 1998 ). On the other hand, it is difficult to generalize across individuals. Results with hearing aids vary, and many different factors could be involved in how well a hearing aid ameliorates the effects of the hearing loss. These factors include the specific type of hearing loss (e.g., the specific frequencies and the magnitude of the loss for those frequencies), and factors involving central brain processing of the auditory information, including word knowledge, experience listening to the talker, and other discourse factors.

Specific speech features are affected at different levels of hearing loss. Boothroyd ( 1984 ) conducted a study of 120 middle- and upper-school children in the Clarke School for the Deaf in Northampton, Massachusetts. The children’s hearing losses, measured in terms of pure-tone averages in decibels of hearing level (dB HL) ranged between 55 and 123 dB HL. The children were tested using a four-alternative, forced-choice procedure for several speech segment contrasts. The results showed that as the hearing losses increased, specific types of speech contrasts became inaudible, but information continued to be available even with profound losses. After correcting for chance, the point at which scores fell to 50% was 75 dB HL for consonant place, 85 dB HL for initial consonant voicing, 90 dB HL for initial consonant continuance, 100 dB HL for vowel place (front-back), and 115 dB HL for vowel height. Boothroyd thought these might be conservative estimates of the children’s listening abilities, given that their hearing aids might not have been optimized for their listening abilities.

Ching et al. ( 1998 ) reported on a study of listeners with normal hearing and listeners with hearing losses across the range from mild to profound. They presented sentence materials for listening under a range of filter and intensity level conditions. Listeners were asked to repeat each sentence after its presentation. Under the more favorable listening conditions for the listeners with severe or profound losses, performance scores covered the range from no words correct to highly accurate (approximately 80–90% correct). That is, having a severe or profound hearing loss was not highly predictive of the speech identification score, and some listeners were quite accurate in repeating the sentences. In general, the majority of the listeners, including listeners with hearing losses in the range of 90–100 dB HL (i.e., with profound losses), benefited from amplification of stimuli for the frequencies below approximately 2800 Hz. (Telephones present frequencies in a range only up to approximately 3200 Hz, supporting the conclusion that perceiving frequencies up to 2800 could be very useful.)

Turner and Brus ( 2001 ) were interested in the finding that when hearing loss is greater than 40–80 dB HL for the higher frequencies of speech, very little benefit is achieved by increasing the amplification of those higher frequencies, and, in some cases, the amplification actually results in lower performance. However, amplification of lower frequency regions does seem to provide benefit. They hypothesized that there might be an interaction between effects due to the frequency regions for which hearing loss occurred and the types of speech information the listeners were able to perceive, depending on amplification characteristics. Listeners who had hearing losses from mild to severe were asked to identify consonant-vowel and vowel-consonant nonsense syllables that were low-pass filtered at the cutoff frequencies of 560, 700, 900, 1120, 1400, 2250, and 2800 Hz. That is, only the frequencies below the cutoff were in the stimuli.

A main question for Turner and Brus ( 2001 ) was whether amplification of the lower frequencies of speech was helpful regardless of the level of hearing loss; affirmative findings were obtained across listeners and filter conditions. Turner and Brus also analyzed their data to determine how the speech features of manner, voicing, and place were independently affected by the filtering conditions and the degree of hearing loss. The manner feature referred to the distinction between consonants that are stops (e.g., /b, d, g/) versus fricatives (e.g., /f, s, z/), versus affricates (e.g., /j, č /), versus liquids (e.g., /l, r/). For this feature, performance generally improved as the filter cutoff allowed more frequencies into the stimuli. The voicing feature referred to the distinction between voiced (e.g., /b, d, g/) and voiceless (e.g., /p, t, k/) consonants. This feature was transmitted well to all the listeners, even when the low-pass filter cutoff was at its lowest levels, and even for the listeners with the more severe losses. That is, the voicing cue is robust to extreme limitations above the low frequency range of audible speech. The place feature referred to the position in the vocal tract where the consonant occlusion is formed (e.g., /b/ is formed by closure of the lips and /k/ is formed by closure of the back portion of the tongue against the velum). This feature was most sensitive to addition of higher frequencies and was most sensitive to the degree of hearing loss. Listeners with the more extreme losses were unable to benefit much as additional higher frequencies were allowed into the stimulus.

In general, Turner and Brus ( 2001 ) confirmed the Ching et al. ( 1998 ) findings, suggesting that listeners with severe or profound hearing loss benefit from amplification of the lower frequencies of speech. Nevertheless, amplification for those with severe or profound hearing losses does not restore speech perception accuracy to normal levels.

As the level of hearing loss increases, and/or in environmental noise increases, people with severe or profound hearing losses typically must rely on being able to see visual speech information to augment or substitute for auditory speech information. However, the older literature on lipreading did not necessarily encourage the view that visual information is a good substitute for auditory information. Estimates of the upper extremes for the accuracy of lipreading words in sentences have been as low as 10–30% words correct (Rönnberg, 1995 ; Rönnberg, Samuelsson, & Lyxell, 1998 ). Estimates of the ability to perceive consonants and vowels via lipreading alone have varied across studies and the particular stimuli used. Those studies typically involved presentation of a set of nonsense syllables with varied consonants or varied vowels and a forced-choice identification procedure. In general, consonant identification was reported to be less than 50% correct (e.g., Owens & Blazek, 1985 ), and vowel identification is reported to be somewhat greater than 50% correct (e.g., Montgomery & Jackson, 1983 ).

Statements in the literature by several authors asserted that the necessity to rely on visible speech due to hearing loss does not result in enhanced lipreading performance (e.g., Summerfield, 1991 ), and that lipreading in hearing people is actually better than in deaf people due to auditory experience in the former (Mogford, 1987 ). Furthermore, several authors have asserted that lip-readers can only perceive visemes (e.g., Fisher, 1968 ; Massaro, 1987 , 1998 ). That is, they have said that consonant categories of speech are so highly ambiguous to lip-readers that they can only distinguish broadly among groups of consonants, those broad groups referred to as visemes . Finally, some estimates of how words appear to lipreaders have suggested that approximately 50% of words in English are visually ambiguous with other words (Berger, 1972 ; Nitchie, 1916 ).

To investigate some of these generalizations, Bernstein, Demorest, and Tucker ( 2000 ) conducted a study of lipreading in 96 hearing students at the University of Maryland and in 72 college students at Gallaudet University with 60 dB HL or greater bilateral hearing losses. All of the Gallaudet students reported English as their native language and the language of their family, and they had been educated in a mainstream and/or oral program for 8 or more years. Seventy-one percent of the students had profound hearing losses bilaterally. Sixty-two percent had hearing losses by age 6 months. The participants were asked to lipread nonsense syllables in a forced-choice procedure and isolated words and sentences in an open set procedure. The stimuli were spoken by two different talkers who were recorded on laser video disc.

Results of the study revealed a different picture of lipreading from that of previous studies. Across all the performance measures in the study, deaf college students were significantly more accurate than were the hearing adults. Approximately 65–75% of the deaf students outperformed 75% of the hearing students. The entire upper quartile of deaf students’ scores was typically above the upper quartile of hearing students’ scores. For example, one sentence set produced upper quartile scores of percent correct words ranging between 44 and 69% for the hearing students and ranging between 73 and 88% for the deaf students. When the results were investigated in terms of the perceptual errors that were made during lipreading of sentences, the deaf students were far more systematic than the hearing students: when deaf students erred perceptually, they were nevertheless closer to being correct than were the hearing students. When the nonsense syllable data were analyzed in terms of the subsegmental (subphonemic) features perceived, the results showed that the deaf students perceived more of the features than did the hearing students. Finally, among those deaf students with the highest performance were ones with profound, congenital hearing losses, suggesting that visual speech perception had been the basis for their acquisition of knowledge of spoken language, and that reliance on visible speech can result in enhanced perceptual ability.

Auer and Bernstein ( 2007 ) carried out a follow-up normative study of lipreading in a group of 112 adults with early-onset deafness and 220 adults with normal hearing. All but two of the deaf adults reported that their hearing loss occurred prior to age three years. The lipreading test they received was 30 sentences from Bernstein et al. ( 2000 ), chosen because the sentences were sensitive to individual differences in lipreading. The accuracy of the deaf participants varied from poor to excellent with a mean of 43.55% words correct. The accuracy of the adults with normal hearing also varied but with a mean of 18.57%. Most notably, Auer and Bernstein examined effects sizes for the difference between deaf and hearing participants’ scores in their study, the Bernstein et al. ( 2000 ) study, and one by Mohammed et al. ( 2005 ). Effect size was substantial for all studies, and in Auer and Bernstein the effect size was 1.69, which suggests that the average prelingually deaf lipreader will score above 95% of all hearing lipreaders.

Bernstein, Demorest, et al. ( 1998 ) investigated possible correlations between lipreading performance levels in the Bernstein et al. ( 2000 ) study and other factors that might affect or be related to visual speech perception. They examined more than 29 variables in relationship to the deaf students’ identification scores on nonsense syllables, isolated words, and isolated sentences. The broad categories of factors that they investigated included audiological variables, parents’ educational levels, home communication practices, public communication practices, self-assessed ability to understand via speech, self-assessed ability to be understood via speech, and scores on the Gallaudet University English Placement Test. The parents’ educational levels were not correlated with lipreading scores. Neither were most of the audiological variables, such as when the hearing loss occurred, when it was discovered, or its level.

Important variables related to lipreading scores included (1) frequency of hearing aid use, which was generally positively correlated with lipreading scores, such that the more frequently the hearing aid was used the more accurate the student’s lipreading (Pearson r s ranged from. 350 to. 384); 3 (2) communication at home with speech, which was correlated with better lipreading scores ( r s ranged from. 406 to. 611); (3) self-assessed ability to be understood via speech in communication with the general public ( r s ranged from. 214 to. 434); and (4) the reading subtest of the English Placement Test ( r s ranged from. 257 to. 399).

Regression analyses were used to investigate the best predictors of lipreading scores among the variables that produced significant correlations. Only three factors survived the analysis as the significant predictors for scores on words and sentences: self-assessed ability to understand the general public, communication at home with speech, and an English Placement Test score. In fact, the multiple R values obtained from the analysis were quite high, ranging from. 730 to. 774 for scores on lipreading words and sentences. That is, more than 50% of the variance in the scores was accounted for by the three best factors. To summarize, lipreading ability was highly related to experience communicating successfully via speech and was also related to the ability to read.

Spoken Word Recognition

The frequent focus on perception of the segmental consonants and vowels in the speech perception literature might leave the reader with the impression that perception of speech terminates in recognition of the speech segments. For example, some researchers theorize that perception of spoken language involves perceptual evaluation of subsegmental units to categorize the consonant and vowel segments at an abstract level (e.g., Massaro, 1998 ). Recognition of words would then depend on assembling the abstract segmental categories and matching them to the segmental patterns of words stored in long-term memory. According to this view, perception terminates at the level of recognizing segments. However, research on spoken word recognition suggests that perception extends to the level of lexical processing.

Abundant evidence has been obtained showing that the speed and ease of recognizing a spoken word is a function of both its phonetic/stimulus properties (e.g., segmental intelligibility) and its lexical properties (e.g., “neighborhood density,” the number of words an individual knows that are perceptually similar to a stimulus word, and “word frequency,” an estimate of the quantity of experience an individual has with a particular word) (Lahiri & Marslen-Wilson, 1991 ; Luce, 1986 ; Luce & Pisoni, 1998 ; Luce, Pisoni, & Goldinger, 1990 ; Marslen-Wilson, 1992 ; McClelland & Elman, 1986 ; Norris, 1994 ).

“Segmental intelligibility” refers to how easily the segments (consonants and vowels) are identified by the perceiver. This is the factor that segmental studies of speech perception are concerned with. Word recognition tends to be more difficult when segmental intelligibility is low and more difficult for words that are perceptually similar to many other words (see below). This latter factor shows that perception does not terminate at the level of abstract segmental categories. If perception did terminate at that level, it would be difficult to explain stimulus-based word similarity effects. Word recognition tends to be easier for words that are or have been experienced frequently. This factor might be related to perception or it might be related to higher level decision-making processes. All of these factors have the potential to be affected by a hearing loss.

General Theoretical Perspective

Theories in the field of spoken word recognition attempt to account for the effects of phonetic/stimulus properties and lexical properties by positing perceptual (bottom-up) activation of multiple word candidates. Activation is a theoretical construct in perception research but is thought to be directly related to activation of relevant neuronal structures in the brain. The level of a word’s bottom-up (i.e., stimulus-driven) activation is a function of the similarity between the word’s perceptual representation and that of candidate word forms stored in long-term memory (e.g., Luce, 1986 ; Luce, Goldinger, Auer, & Vitevitch, 2000 ; Luce & Pisoni, 1998 ; Marslen-Wilson, 1987 , 1990 ; McClelland & Elman, 1986 ; Norris, 1994 ). Once activated, candidate word forms compete for recognition in memory (Luce, 1986 ; Luce & Pisoni, 1998 ; Marslen-Wilson, 1992 ; McClelland & Elman, 1986 ; Norris, 1994 ). In addition to bottom-up stimulus information, recognition of a word is influenced by the amount and perhaps the type of previous experience an individual has had with that word (Goldinger, 1998 ; Howes, 1957 ). It is important to emphasize here that the long-term memory representations of stimulus word forms are hypothesized to be similar to the perceptual information and therefore different from memory representations for other types of language input (e.g., fingerspelling), as well as different from abstract knowledge about words (e.g., semantics; McEvoy, Marschark, & Nelson, 1999 ).

An implication of the view that the perceptual word information is used to discriminate among words in the mental dictionary (lexicon) is that successful word recognition can occur even when the speech signal is degraded. This is because recognition can occur even when the speech signal contains only sufficient information to discriminate among the word forms stored in the mental lexicon: Access to complete information would be unnecessary to select the correct word in the mental lexicon. For example, an individual with hearing loss might distinguish the consonants /p/, /t/, and /k/ from the other segments in English but might not distinguish within this set. For this individual, the word “parse” could still be recognized because “tarse” and “karse” do not occur as words in English. That is, words are recognized within the context of perceptually similar words, and therefore intelligibility is a function of both segmental intelligibility as well as the distribution of word forms in the perceiver’s mental lexicon.

Visually Identifying Words with Reduced Speech Information

One fundamental question is what effect reduced speech information, such as the information available to the lipreader, has on the patterns of stimulus words that are stored in the mental lexicon. Nitchie ( 1916 ) and Berger ( 1972 ) investigated the relationship between reduced segmental intelligibility and the distribution of word forms for individuals with profound hearing losses who relied primarily on visible speech for oral communication. They argued that as a result of low consonant and vowel accuracy during lipreading, approximately 50% of words in English that sound different lose their distinctiveness, that is, they become homophenous/ambiguous with other words.

Auer and Bernstein ( 1997 ) developed computational methods to investigate this issue for lipreading and any other degraded perceptual conditions for speech. They wondered to what extent words lost their distinctive information when lipread—that is, how loss of distinctiveness would interact with the word patterns in the mental dictionary. For example, even though /b/, /m/, and /p/ are perceptually similar to the lipreader, English has only the word, “bought,” and not the words “mought” and “pought.” So “bought” remains a distinct pattern as a word in English, even for the lipreader.

Specifically, the Auer-Bernstein method incorporates rules to transcribe words so that only the segmental distinctions that are estimated to be perceivable are represented in the transcriptions. The rules comprise mappings for which one symbol is used to represent all the phonemes that are indistinct to the lipreader. 4 Then the mappings are applied to a computer-readable lexicon. For example, /b/ and /p/ are difficult to distinguish for a lipreader. So, words like “bat” and “pat” would be transcribed to be notationally identical using a new common symbol like B (e.g., “bat” is transcribed as BAT, and “pat” is transcribed as BAT). Then the transcribed words are sorted so that words rendered identical (no longer notationally distinct) are grouped together. The computer-readable lexicon used in these modeling studies was the PhLex lexicon. PhLex is a computer-readable phonemically transcribed lexicon with 35,000 words. The words include the 19,052 most frequent words in the Brown corpus (a compilation of approximately 1 million words in texts; Kucera & Francis, 1967 ).

Auer and Bernstein ( 1997 ) showed that when all the English phonemes were grouped according to the confusions made by average hearing lipreaders (i.e., the groups /u, , ә r/, /o,  /, /I, i, e,, æ/, / /, /, aI, ә ,,, –, j/, /b, p, m/, /f, v/, /l, n, k, ŋ , g, h/, /d, t, s, z/, /w, r/, / ð , q /, and /∫ , t ∫ ,, d/), 54% of words were still distinct across the entire PhLex lexicon. With 19 phoneme groups, approximately 75% of words were distinct, approximating an excellent deaf lipreader. In other words, small perceptual enhancements lead to large increases in lipreading accuracy.

In addition to computational investigations of the lexicon, lexical modeling provides a method for generating explicit predictions about word identification accuracy. For example, Mattys, Bernstein, and Auer ( 2002 ) tested whether the number of words that a particular word might be confused with affects lipreading accuracy. Deaf and hearing individuals who were screened for above-average lipreading identified visual spoken words presented in isolation. The prediction that words would be more difficult, if there were more words with which they might be confused, was born out: Word identification accuracy decreased as a negative function of increased number of words estimated to be similar to the lipreader. Also, words with higher frequency of occurrence were easier to lipread.

In another related study, Auer ( 2002 ) applied the neighborhood activation model (NAM) of auditory spoken word recognition (Luce, 1986 ; Luce & Pisoni, 1998 ) to the prediction of visual spoken word identification. The NAM can be used to obtain a value that predicts the relative intelligibility of specific words. High values are associated with more intelligible words. Deaf and hearing participants identified visual spoken words presented in isolation. The pattern of results was similar across the two participant groups. The obtained results were significantly correlated with the predicted intelligibility scores (hearing: r =. 44; deaf: r = 48). Words with many neighbors were more difficult to identify than words with few neighbors. One question that might be asked is whether confusions among words really depend on the physical stimuli as opposed to their abstract linguistic structure. Auer correlated the lipreading results with results predicted on the basis of phoneme confusion patterns from identification of acoustic speech in noise, a condition that produces different patterns of phoneme confusions from those in lipreading. When the auditory confusions replaced the visual confusions in the computational model, the correlations were no longer significant. This result would be difficult to understand if visual spoken word recognition were based on abstract phoneme patterns and not on the visual speech information.

Auditorily Identifying Words Under Conditions of Hearing Loss

The NAM has also been used to investigate auditory spoken word recognition in older listeners (52–84 years of age) with mild to moderate hearing loss (Dirks, Takayanagi, Moshfegh, Noffsinger, & Fausti, 2001 ). Words were presented for identification from word lists that varied the factors of neighborhood density (word form similarity), mean neighborhood frequency (frequency of occurrence of words in the neighborhood), and word frequency. All of the factors were significant in the results. Overall, high-frequency words were identified more accurately than low-frequency words. Words in low-density neighborhoods (few similar neighbors) were recognized more frequently than words in high-density neighborhoods. Words in neighborhoods of words that were generally low in frequency were recognized more accurately than words in neighborhoods of words that were generally high in frequency. The pattern of results was overall essentially similar to results with a different group of listeners with normal hearing. However, the difference between best and worst conditions for listeners with hearing losses (20 percentage points) was greater than for listeners with normal hearing (15 percentage points). This difference among listeners suggests that lexical factors become more important as listening becomes more difficult. Although the participants in this study had mild to moderate hearing losses, the study suggests that the processes of spoken word recognition are substantially similar across listeners.

In a related study, characteristics of the listeners included hearing loss versus normal hearing and native versus non-native listeners to English (Takayanagi, Dirks, & Moshfegh, 2002). Participants were 20 native English listeners with normal hearing, 20 native English listeners with hearing loss, 20 nonnative listeners with normal hearing, and 20 nonnative listeners with hearing loss. Hearing losses were bilateral and mild to moderate. In this study, there were two groups of words, ones with high word frequency and in low-density neighborhoods (easy words), and ones with low word frequency and in high-density neighborhoods (hard words). Familiarity ratings were obtained on each of the words from each of the participants to statistically control for differences in long-term language experience. In general, there were significant effects obtained for hearing differences and for native language differences: listeners with normal hearing were more accurate than listeners with hearing losses, and native listeners were more accurate than non-native listeners. Predicted easy words were in fact easier than hard words for all of the listeners. However, the difference between native and nonnative listeners was greater for the easy words than for the hard words. These results suggest that the neighborhood structure affects both native and nonnative listeners, with and without hearing losses. Additional analyses showed that important factors in accounting for the results included the audibility of the words (how loud they had to be to be heard correctly) and also the listener’s subjective rating of their familiarity with each of the words.

Estimating Lexical Knowledge

An individual’s knowledge of words arises as a function of his or her linguistic experience. Several factors related to lexical experience have been demonstrated to have some impact on the word recognition process, including the age at which words are acquired, the form of the language input (e.g., spoken or printed), and the frequency of experience with specific words (as discussed earlier). Prelingually deaf individuals’ linguistic experience varies along all of these factors. Impoverishment in the available auditory information typically leads to delayed acquisition of a spoken language, often resulting in reductions in total exposure to spoken language. Prelingually deaf individuals are also likely to use some form of manual communication as their preferred communication mode, and/or as a supplement to lipreading. Several forms of manual communication can fulfill this role, including a form of English-based signing, American Sign Language (ASL), and cued speech (see Leybaert, Aparicio, & Alegria, this volume). As a result of variation in these experiential factors, the prelingually deaf population comprises individuals who differ dramatically in the quantity and quality of their perceptual and linguistic experience with spoken words.

In this section, some studies are discussed that focused on lexical knowledge in expert lipreaders. The participants were all individuals who reported English as their native language and as the language of their families, were educated in a mainstream and/ or oral program for 8 or more years, and were skilled as lipreaders.

Estimates of the relative quantity of word experience for undergraduates with normal hearing are based on objective word frequency counts based on text corpora (e.g., Kucera & Francis, 1967 ). However, this approach has its detractors, especially for estimating experience with words that occur infrequently in the language (Gernsbacher, 1984 ). Furthermore, the approach is clearly insensitive to individual differences that may occur within or between populations of English language users with different lexical experience.

An alternative to using objective counts to estimate word experience is to collect subjective familiarity ratings by having participants rate their familiarity with each word using a labeled scale. Although several sources of knowledge likely contribute to these ratings, general agreement exists that familiarity partly reflects quantity of exposure to individual words. Auer, Bernstein, and Tucker ( 2000 ) compared and contrasted familiarity ratings collected from 50 hearing and 50 deaf college students. Judgments were made on a labeled scale from 1 (never seen, heard, or read the word before) to 7 (know the word and confident of its meaning). The within-group item ratings were similar ( r =.90) for the two participant groups. However, deaf participants consistently judged words to be less familiar than did hearing participants.

Another difference between the groups emerged upon more detailed analysis of the ratings within and across participant groups. Each participant group was split into 5 subgroups of 10 randomly selected participants. Mean item ratings for each subgroup were then correlated with those of the other nine subgroups (four within a participant group and five between). The correlation coefficients were always highest within a participant group. That is, deaf participants used the familiarity scale more like other deaf participants than like hearing participants. The results suggested that despite the global similarity between the two participant groups noted above, the two groups appear to have experienced different ambient language samples. Thus, these results point to the importance of taking into account individual experiential differences in studies of spoken word recognition.

Another factor in the developmental history of an individual’s lexicon is the age at which words are acquired. The age of acquisition (AOA) effect—faster and more accurate recognition and production of earlier acquired words—has been demonstrated in hearing participants using several measures of lexical processing (for a review, see Morrison & Ellis, 1995 ). Ideally, AOA for words would be based on some objective measure of when specific words were learned. However, AOA is typically estimated by the subjective ratings of adults. These ratings have been shown to have both high reliability among raters and high validity when compared to objective measures of word acquisition (Gilhooly & Gilhooly, 1980 ).

Auer and Bernstein ( 2002 ) investigated the impact of prelingual hearing loss on AOA. Fifty hearing and 50 deaf participants judged AOA for the 175 words in form M of the Peabody Picture Vocabulary Test-Revised (PPVT; Dunn & Dunn, 1981 ) using an 11-point scale labeled both with age in years and a schooling level. In addition, the participants rated whether the words were acquired through speech, sign language, or orthography.

The average AOA ratings for stimulus items were highly correlated across participant groups ( r =.97) and with the normative order in the PPVT ( r = 95 for the deaf group, and r =.95 for the hearing group), suggesting that the groups rated the words as learned in the same order as the PPVT assumes. However, the two groups differed in when (∼ 1.5 years difference on average), and how (hearing: 70% speech and 30% orthography; deaf: 38% speech, 45% orthography, 17% sign language) words were judged to have been acquired. Interestingly, a significant correlation ( r =.43) was obtained in the deaf participant group between the percent words correct on a lipreading screening test and the percentage of words an individual reported as having been learned through spoken language, with the better lipreaders reporting more words learned through spoken language. Taken together, the results suggested that despite global similarity between the two participant groups, they have learned words at different times and through different language modes.

Bimodal Speech Perception

The preceding sections reveal that individuals with severe or profound hearing losses can potentially obtain substantial speech information from auditory-only or visual-only speech stimuli. That visual speech can substantially enhance perception of auditory speech has been shown with listeners having normal hearing and hearing losses (e.g., Grant, Walden, & Seitz, 1998 ; Sumby & Pollack, 1954 ).

Estimates of how audiovisual speech stimuli can improve speech perception have been obtained from children and adults with hearing losses. Lamoré, Huiskamp, van Son, Bosman, and Smoorenburg ( 1998 ) studied 32 children with pure-tone average hearing losses in a narrow range around 90 dB HL. They presented the children with consonant-vowel-consonant stimuli and asked them to say and write down exactly what they heard, saw, or heard and saw. Extensive analyses of the results were carried out, but of particular interest here were the mean scores for totally correct responses in the auditory-only, visual-only, and audiovisual conditions. When the children were subdivided into groups according to their pure-tone averages, the group with the least hearing losses (mean 85.9 dB HL) scored 80% correct auditory-only, 58% visual-only, and 93% audiovisual. The group with the greatest hearing losses (mean 94.0 dB HL) scored 30% auditory-only, 53% visual only, and 74% audiovisual. The audiovisual combination of speech information was helpful at both levels, but especially for those with the greater hearing loss.

Grant et al. ( 1998 ) presented auditory, visual, and audiovisual sentence stimuli to adult listeners from across a range of hearing losses from mild to severe. Overall, sentence scores were audiovisual, 23–94% key words correct; audio only, 5–70% key words correct; and visual only, 0–20% key words correct. Every one of the listeners was able to improve performance when the stimuli were audiovisual. This was true even when the lipreading-only stimuli resulted in 0% correct scores. Benefit from being able to see the talker was calculated for each participant (benefit = (AV−A)/(100−A); A = audio only, AV = audiovisual). Across individuals, the variation was large in the ability to benefit from the audiovisual combinations of speech information: the mean benefit was 44% with a range from 8.5–83%.

That even highly degraded auditory information can provide substantial benefit in combination with lipreading has also been shown in adult listeners with normal hearing. Breeuwer and Plomp ( 1984 ) presented spoken sentences visually in combination with a range of processed auditory signals based on speech. Lipreading scores for the sentences were approximately 18% words correct. One particularly useful auditory signal combined with lipreading was a 500-Hz pure tone whose amplitude changed as a function of the amplitude in the original speech around that frequency. When this signal was combined with lipreading, the mean score for the audiovisual combination was 66% percent words correct. When the same stimulus was then combined with another pure tone at 3160 Hz, also changing in amplitude as a function of the amplitude changes in the original speech around that frequency, performance rose to a mean of 87% words correct. For neither type of auditory signal alone would there likely have been any words correctly identified. These results demonstrate that being able to hear even extremely limited speech information can be effective, as long as it is combined with visual speech.

Moody-Antonio and colleagues (Moody-Antonio et al., 2005 ) carried out a study of prelingually deaf adults who obtained cochlear implants as adults. The cochlear implant directly stimulates the auditory nerve to restore auditory function, but prelingually deaf adults have not been considered good candidates for cochlear implants. The participants in the study were tested using difficult isolated sentence materials. Their auditory perception was least accurate, followed by their lipreading. However, for almost all of the participants, the combination of lipreading and auditory information resulted in substantial improvements in sentence perception. In several cases, they obtained scores that were greater than the sum of the lipreading alone and the auditory alone scores. This study showed that even individuals with life-long profound hearing loss can benefit from audiovisual speech information, when the audio is derived from a cochlear implant.

Vibrotactile Cues

Under certain conditions, a hearing aid could provide useful vibrotactile information that could combine with seeing speech. Frequencies in the range of the voice pitch (approximately between 70 and 300 Hz) can be perceived by vibrotactile perception (Cholewiak & Collins, 1991 ). When hearing loss is profound, hearing aids must operate at high output levels that result in perceptible mechanical vibration (Bernstein, Tucker, & Auer, 1998 ). Boothroyd and Cawkwell ( 1970 ; see also Nober, 1967 ) studied the problem of distinguishing vibrotactile from auditory perception in adolescents with hearing losses. They found that sensation thresholds below 100 dB HL for frequencies as high as 1000 and even 2000 Hz might be attributable to detection of mechanical vibration of the skin rather than acoustic vibration.

Perception of information for voicing might be obtained via a hearing aid through mechanical stimulation of the skin and might account for why some individuals with profound hearing losses obtain benefit from their hearing aids when communicating via speech. That voicing information can combine effectively with lipreading has been demonstrated in a number of studies. For example, Boothroyd, Hnath-Chisolm, Hanin, and Kishon-Rabin ( 1988 ) presented an acoustic signal derived from the voice pitch in combination with sentences presented visually to hearing participants. The mean visual-only sentence score was 26% words correct, and the audiovisual sentence score was 63%. Furthermore, we and others have demonstrated, using custom vibrotactile devices, that lipreading can be enhanced when voice fundamental frequency information is presented as vibration patterns on the skin, although the vibrotactile studies have generally failed to produce the same impressive gains obtained with analogous auditory signals and hearing participants (Auer, Bernstein, & Coulter, 1998 ; Eberhardt, Bernstein, Demorest, & Goldstein, 1990 ; Boothroyd, Kishon-Rabin, & Waldstein, 1995 ).

Recently, in a functional magnetic resonance imaging (fMRI) study, Auer and colleagues (Auer, Bernstein, Sungkarat, & Singh, 2007 ) showed that prelingually deaf adults with lifelong hearing aid experience show activation of their primary auditory cortex when vibrotactile patterns derived from speech are presented to their hand. These results are consistent with the suggestion that high power hearing aids deliver vibrotactile stimulation to the skin of the ear canal, thereby affording the brain to learn vibrotactile patterns, and effecting changes in areas of the auditory cortex.

Summary and Conclusions

Speech information can undergo extreme degradation and still convey the talker’s intended message. This fact explains why severe or profound hearing loss does not preclude perceiving a spoken language. Studies reviewed above suggest that listeners with hearing loss can profit from even minimal auditory information, if it is combined with visual speech information. Some individuals with profound hearing loss are able to perform remarkably well in auditory-only conditions and/or in visual-only conditions. However, the performance level that is achieved by any particular individual with hearing loss likely depends on numerous factors that are not yet well understood, including when their hearing loss occurred, the severity and type of the loss, their family linguistic environment, and their exposure to language (including their relative reliance on spoken vs. manual language).

Early research on speech perception in hearing people focused on perception of the segmental consonants and vowels. More recently, research has revealed the importance of perceptual processes at the level of recognizing words. The studies reviewed above suggest the possibility that factors at the level of the lexicon might interact in complex ways with specific hearing loss levels. A complete understanding of the effectiveness of speech perception for individuals with hearing loss will require understanding relationships among the configuration of the hearing loss, the ability to amplify selected frequency regions, and the distinctiveness of words in the mental lexicon. These complex relationships will, in addition, need to be considered in light of developmental factors, genetic predispositions, linguistic environment, linguistic experience, educational and training opportunities, effects of multisensory (i.e., visual and vibrotactile, in addition to auditory) stimulus conditions, and cultural conditions.

The terms “lipreading” and “speechreading” are sometimes used interchangeably and sometimes used to distinguish between, respectively, visual-only speech perception and audiovisual speech perception in people with hearing losses. We have used both terms for visual-only speech perception. In this chapter, “lipreading” refers to perception of speech information via the visual modality.

The place distinction concerns the position in the vocal tract at which there is critical closure during consonant production. For example, /b/ is a bilabial due to closure of the lips, and /d/ is a dental due to the closure of the tongue against the upper teeth. Manner concerns the degree to which the vocal tract is closed. For example, /b/ is a stop because the tract reaches complete closure. But /s/ is a fricative because air passes through a small passage. Voicing concerns whether or not and when the vocal folds vibrate. For example, /b/ is produced with vocal fold vibration almost from its onset, and /p/ is produced with a delay in the onset of vibration.

This correlation could have arisen because, at Gallaudet University, students who used their hearing aids more frequently were also more reliant on speech communication. That is, hearing aid use was a proxy in this correlation for communication preference/skill.

A phoneme is a consonant or vowel of a language that serves to distinguish minimal word pairs such as /b/ versus /p/ in “bat” versus “pat.”

Auer, E. T., Jr. ( 2002 ). The influence of the lexicon on speechread word recognition: Contrasting segmental, and lexical distinctiveness. Psychonomic Bulletin and Review, 9, 341–347.

Auer, E. T., Jr., & Bernstein, L. E. ( 1997 ). Speechreading and the structure of the lexicon: Computationally modelling the effects of reduced phonetic distinctiveness on lexical uniqueness. Journal of the Acoustical Society of America, 102(6), 3704–3710.

Auer, E. T., Jr., & Bernstein, L. E. ( 2002 ). Estimating when and how words are acquired: A natural experiment examining effects of perceptual experience on the growth the mental lexicon . Manuscript submitted for publication.

Google Scholar

Google Preview

Auer, E. T., Jr., & Bernstein, L. E. ( 2007 ). Enhanced visual speech perception in individuals with early onset hearing impairment. Journal of Speech, Hearing, and Language Research, 50(5), 1157–1165.

Auer, E. T., Jr., Bernstein, L. E., & Coulter, D. C. ( 1998 ). Temporal and spatio-temporal vibrotactile displays for voice fundamental frequency: An initial evaluation of a new vibrotactile speech perception aid with normal-hearing and hearing-impaired individuals. Journal of the Acoustical Society of America, 104, 2477–2489.

Auer, E. T., Jr., Bernstein, L. E., Sungkarat, W., & Singh, M. ( 2007 ). Vibrotactile activation of the auditory cortices in deaf versus hearing adults. NeuroReport, 18(7), 645–648.

Auer, E. T., Jr., Bernstein, L. E., & Tucker, P. E. ( 2000 ). Is subjective word familiarity a meter of ambient language? A natural experiment on effects of perceptual experience. Memory and Cognition, 28(5), 789–797.

Berger, K. W. ( 1972 ). Visemes and homophenous words. Teacher of the Deaf, 70, 396–399.

Bernstein, L. E., Demorest, M. E., & Tucker, P. E. ( 1998 ). What makes a good speechreader? First you have to find one. In R. Campbell, B. Dodd, & D. Burnham (Eds.), Hearing by eye II: The psychology of speechreading and auditory-visual speech (pp. 211–228). East Sussex, UK: Psychology Press.

Bernstein, L. E., Demorest, M. E., & Tucker, P. E. ( 2000 ). Speech perception without hearing. Perception and Psychophysics, 62, 233–252.

Bernstein, L. E., Tucker, P. E., & Auer, E. T., Jr. ( 1998 ). Potential perceptual bases for successful use of a vibrotactile speech perception aid. Scandinavian Journal of Psychology, 39(3), 181–186.

Boothroyd, A. ( 1984 ). Auditory perception of speech contrasts by subjects with sensorineural hearing loss. Journal of Speech and Hearing Research, 27, 134–143.

Boothroyd, A., & Cawkwell, S. ( 1970 ). Vibrotactile thresholds in pure tone audiometry. Acta Otolaryngologica, 69, 381–387.

Boothroyd, A., Hnath-Chisolm, T., Hanin, L., & Kishon-Rabin, L. ( 1988 ). Voice fundamental frequency as an auditory supplement to the speechreading of sentences. Ear and Hearing, 9, 306–312.

Boothroyd, A., Kishon-Rabin L., & Waldstein, R. ( 1995 ). Studies of tactile speechreading enhancement in deaf adults. Seminars in Hearing, 16, 328–342.

Breeuwer, A., & Plomp, R. ( 1984 ). Speech reading supplemented with frequency-selective sound pressure information. Journal of the Acoustical Society of America, 76, 686–691.

Catford, J. C. ( 1977 ). Fundamental problems in phonetics . Bloomington: Indiana University Press.

Ching, T. Y. C., Dillon, H., & Byrne, D. ( 1998 ). Speech recognition of hearing-impaired listeners: Predictions from audibility and the limited role of high-frequency amplification. Journal of the Acoustical Society of America, 103, 1128–1139.

Cholewiak, R., & Collins, A. ( 1991 ). Sensory and physiological bases of touch. In M. A. Heller & W. Schiff (Eds.), The psychology of touch . Hillsdale, NJ: Lawrence Erlbaum Associates.

Dirks, D. D., Takayanagi, S., Moshfegh, A., Noffsinger, P. D., & Fausti, S. A. ( 2001 ). Examination of the neighborhood activation theory in normal and hearing-impaired listeners. Ear and Hearing, 22,1– 13.

Dunn, L. M. & Dunn, L. M ( 1981 ). Peabody Picture Vocabulary Test-Revised . Circle Pines, MN: American Guidance Service.

Eberhardt, S. P., Bernstein, L. E., Demorest, M. E., & Goldstein, M. H. ( 1990 ). Speechreading sentences with single-channel vibrotactile presentation of voice fundamental frequency. Journal of the Acoustical Society of America, 88, 1274–1285.

Fisher, C. G. ( 1968 ). Confusions among visually perceived consonants. Journal of Speech and Hearing Research, 11, 796–804.

Fowler, C. A. ( 1986 ). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14, 3–28.

Gernsbacher, M. A. ( 1984 ). Resolving 20 years of inconsistent interactions between lexical familiarity and orthography, concreteness, and polysemy. Journal of Experimental Psychology: General, 113, 256–281.

Gilhooly, K. J., & Gilhooly, M. L. M. ( 1980 ). The validity of age-of-acquisition ratings. British Journal of Psychology, 71, 105–110.

Goldinger, S. D. ( 1998 ). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105(2), 251–279.

Grant, K. W., Walden, B. E., & Seitz, P. F. ( 1998 ). Auditory-visual speech recognition by hearing-impaired subjects: Consonant recognition, sentence recognition, and auditory-visual integration. Journal of the Acoustical Society of America, 103, 2677–2690.

Howes, D. H. ( 1957 ). On the relation between the intelligibility and frequency of occurrence of English words. Journal of the Acoustical Society of America, 29, 296–305.

Iverson, P., Bernstein, L. E., & Auer, E. T., Jr. ( 1998 ). Modeling the interaction of phonemic intelligibility and lexical structure in audiovisual word recognition. Speech Communication, 26(1–2), 45–63.

Kucera, H., & Francis, W. ( 1967 ). Computational analysis of present-day American English . Providence, RI: Brown University.

Lahiri, A., & Marslen-Wilson, W. ( 1991 ). The mental representation of lexical form: A phonological approach to the recognition lexicon. Cognition, 38, 245–294.

Lamoré, P. J. J., Huiskamp, T. M. I., van Son, N. J. D. M. M., Bosman, A. J., & Smoorenburg, G. F. ( 1998 ). Auditory, visual, and audiovisual perception of segmental speech features by severely hearing-impaired children. Audiology, 37, 396– 419.

Liberman, A. M. ( 1982 ). On finding that speech is special. American Psychologist, 37, 148–167.

Liberman, A. M., & Whalen, D. H. ( 2000 ). On the relation of speech to language. Trends in Cognitive Sciences, 4, 187–196.

Luce, P. A. ( 1986 ). Neighborhoods of words in the mental lexicon (Research on Speech Perception, Technical Report No. 6). Bloomington: Speech Research Laboratory, Department of Psychology, Indiana University.

Luce, P. A., Goldinger, S. D., Auer, E. T., Jr., & Vitevitch, M. S. ( 2000 ). Phonetic priming, neighborhood activation, and PARSYN. Perception and Psychophysics, 62(3), 615–625.

Luce, P. A., & Pisoni, D. B. ( 1998 ). Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19, 1–36.

Luce, P. A., Pisoni, D. B., & Goldinger, S. D. ( 1990 ). Similarity neighborhoods of spoken words. In G. T. M. Altmann (Ed.), Cognitive models of speech processing (pp. 122–147). Cambridge, MA: MIT Press.

Marslen-Wilson, W. D. ( 1987 ). Functional parallelism in spoken word recognition. Cognition, 25, 71– 102.

Marslen-Wilson, W. D. ( 1990 ). Activation, competition, and frequency in lexical access. In G. T. M. Altmann (Ed.), Cognitive models of speech processing (pp. 148–172). Cambridge, MA: MIT Press.

Marslen-Wilson, W. D. ( 1992 ) Access and integration: Projecting sound onto meaning. In W. D. Marslen-Wilson (Ed.), Lexical representation and process (pp. 3–24). Cambridge, MA: MIT Press.

Massaro, D. W. ( 1987 ). Speech perception by ear and eye: A paradigm for psychological inquiry . Hillsdale, NJ: Lawrence Erlbaum Associates.

Massaro, D. W. ( 1998 ). Perceiving talking faces: From speech perception to a behavioral principle. Cambridge, MA: Bradford Books.

Mattys, S., Bernstein, L. E., & Auer, E. T., Jr. ( 2002 ). Stimulus-based lexical distinctiveness as a general word recognition mechanism. Perception and Psychophysics, 64, 667–679.

McClelland, J. L., & Elman, J. L. ( 1986 ). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86.

McEvoy, C., Marschark, M., & Nelson, D. L. ( 1999 ). Comparing the mental lexicons of deaf and hearing individuals. Journal of Educational Psychology, 19, 312–320.

Miller, G. A., & Nicely, P. E. ( 1955 ). An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27, 338–352.

Mogford, K. ( 1987 ). Lip-reading in the prelingually deaf. In B. Dodd & R. Campbell (Eds.), Hearing by eye: The psychology of lip-reading (pp. 191–211). Hillsdale, NJ: Lawrence Erlbaum Associates.

Mohammed, T., Campbell, R., MacSweeney, M., Milne, E., Hansen, P., & Coleman, M. ( 2005 ). Speechreading skill and visual movement sensitivity are related in deaf speechreaders. Perception, 34(2), 205–216.

Montgomery, A. A., & Jackson, P. L. ( 1983 ). Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America, 73, 2134–2144.

Moody-Antonio, S., Takayanagi, S., Masuda, A., Auer, J. E. T., Fisher, L., & Bernstein, L. E. ( 2005 ). Improved speech perception in adult prelingually deafened cochlear implant recipients. Otology and Neurotology, 26, 649–654.

Morrison, C. M., & Ellis, A. W. ( 1995 ). Roles of word frequency and age of acquisition in word naming and lexical decision. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21(1), 116–133.

Nearey, T. M. ( 1997 ). Speech perception as pattern recognition. Journal of the Acoustical Society of America, 101, 3241–3254.

Nitchie, E. B. ( 1916 ). The use of homophenous words. Volta Review, 18, 83–85.

Nober, E. H. ( 1967 ). Vibrotactile sensitivity of deaf children to high intensity sound. Larynoscope, 78, 2128–2146.

Norris, D. ( 1994 ). Shortlist: A connectionist model of continuous word recognition. Cognition, 52,189– 234.

Owens, E., & Blazek, B. ( 1985 ). Visemes observed by hearing impaired and normal hearing adult viewers. Journal of Speech and Hearing Research, 28, 381–393.

Remez, R. E. ( 1994 ) A guide to research on the perception of speech. In M. A. Gernsbacher (Ed.), Handbook of psycholinguistics (pp. 145–172). San Diego, CA: Academic Press.

Rönnberg, J. ( 1995 ). Perceptual compensation in the deaf and blind: Myth or reality? In R. A. Dixon & L. Bäckman (Eds.), Compensating for psychological deficits and declines (pp. 251–274). Mahwah, NJ: Lawrence Erlbaum Associates.

Rönnberg, J., Samuelsson, S., & Lyxell, B. ( 1998 ). Conceptual constraints in sentence-based lipreading in the hearing-impaired. In R. Campbell, B. Dodd, & D. Burnham (Eds.), Hearing by eye: II. The psychology of speechreading and auditory-visual speech (pp. 143–153). East Sussex, UK: Psychology Press.

Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., & Ekelid, M. ( 1995 ). Speech recognition with primarily temporal cues. Science, 270, 303–304.

Stevens, K. N. ( 1998 ). Acoustic phonetics . Cambridge, MA: MIT Press.

Sumby, W. H., & Pollack, I. ( 1954 ). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212–215.

Summerfield, Q. ( 1991 ). Visual perception of phonetic gestures. In I. G. Mattingly & M. Studert-Kennedy (Eds.), Modularity and the motor theory of speech perception (pp. 117–137). Hillsdale, NJ: Lawrence Erlbaum Associates.

Takayanagi, S., Dirks, D. D., & Moshfegh, A. ( in press ). Lexical and talker effects on word recognition among native and non-native normal and hearing-impaired listeners. Journal of Speech, Language, Hearing Research , 45 , 585–597.

Turner, C. W., & Brus, S. L. ( 2001 ). Providing low and mid-frequency speech information to listeners with sensorineural hearing loss. Journal of the Acoustical Society of America, 109, 2999–3006.

About Oxford Academic
Publish journals with us
University press partners
What we publish
New features
Open access
Institutional account management
Rights and permissions
Get help with access
Accessibility
Advertising
Media enquiries
Oxford University Press
Oxford Languages
University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Cookie settings
Cookie policy
Privacy policy
Legal notice

This Feature Is Available To Subscribers Only

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

IMAGES

Speech Recognition in Python [Learn Easily & Fast]
Speech Recognition: Everything You Need to Know in 2023
Speech Recognition AI: What is it and How Does it Work
speech recognition Archives
[2023 Updated!] What is Speech Recognition & Guide With Python
Speech Recognition: Definition, Importance and Uses

VIDEO

What are parts of speech// Word classes by Kamran Kmml Khan// Grammar insights
Teaching Students to use Speech to Text aka Dictation
motivational speech word quote #moyivation #motivational #motivationalvideo #words #short#viral
Same Exact Speech Every Year, Word For Word #shorts #trending #viral #youtubeshorts #funny
Interchange of Parts of Speech / Word formation @RahmatsBasicEnglish
Word Recognition Assessment 6.1

COMMENTS

What Is Speech Recognition?
This speech recognition software had a 42,000-word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words. While speech technology had a limited vocabulary in the early days, it is utilized in a wide number of industries today, such as automotive, technology, and healthcare.
Speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT).It incorporates knowledge and research in the computer ...
What is Speech Recognition?
Speech recognition or speech-to-text recognition, is the capacity of a machine or program to recognize spoken words and transform them into text. Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence, etc. In this article, we are going to discuss every point about speech recognition.
27 Spoken Word Recognition
Abstract. Spoken word recognition is the study of how lexical representations are accessed from phonological patterns in the speech signal. That is, we conventionally make two simplifying assumptions: Because many fundamental problems in speech perception remain unsolved, we provisionally assume the input is a string of phonemes that are the output of speech perception processes, and that the ...
What is Speech Recognition?
voice portal (vortal): A voice portal (sometimes called a vortal ) is a Web portal that can be accessed entirely by voice. Ideally, any type of information, service, or transaction found on the Internet could be accessed through a voice portal.
Spoken Word Recognition
Spoken Word Recognition. ... First, listeners must integrate current speech sounds with previously heard speech in recognizing words. This motivates a hierarchy of representations that temporally integrate speech signals over time localized to anterior regions of the STG. ... The lexicon is a memory component, a mental dictionary, containing ...
Speech Recognition: Everything You Need to Know in 2024
Speech recognition, also known as automatic speech recognition (ASR), speech-to-text (STT), and computer speech recognition, is a technology that enables a computer to recognize and convert spoken language into text. Speech recognition technology uses AI and machine learning models to accurately identify and transcribe different accents ...
Speech Recognition Definition
Speech recognition is the capability of an electronic device to understand spoken words. A microphone records a person's voice and the hardware converts the signal from analog sound waves to digital audio. The audio data is then processed by software , which interprets the sound as individual words.
Spoken Word Recognition: A Focus on Plasticity
Psycholinguists define spoken word recognition (SWR) as, roughly, the processes intervening between speech perception and sentence processing, whereby a sequence of speech elements is mapped to a phonological wordform. After reviewing points of consensus and contention in SWR, we turn to the focus of this review: considering the limitations of theoretical views that implicitly assume an ...
Speech Recognition
Speech recognition is concerned with converting the speech waveform, an acoustic signal, into a sequence of words. Today's best-performing approaches are based on a statistical modelization of the speech signal. This chapter provides an overview of the main topics addressed in speech recognition: that is, acoustic-phonetic modelling, lexical ...
2 Spoken Word Recognition
There is a long list of factors that are known to influence the speed and accuracy with which a spoken word is recognized. Over 35 years ago Cutler (1981) provided a list of several important factors that were known at the time to influence spoken word recognition including the frequency with which the word occurs in the language, the length of the word, the grammatical part of speech of the ...
Word Recognition
Word Recognition. In visual word recognition, a whole word may be viewed at once (provided that it is short enough), and recognition is achieved when the characteristics of the stimulus match the orthography (i.e., spelling) of an entry in the mental lexicon. Speech perception, in contrast, is a process that unfolds over time as the listener ...
How Does Speech Recognition Work? (9 Simple Questions Answered)
Speech recognition is the process of converting spoken words into written or machine-readable text. It is achieved through a combination of natural language processing, audio inputs, machine learning, and voice recognition. Speech recognition systems analyze speech patterns to identify phonemes, the basic units of sound in a language.
Automatic Speech Recognition Definition
Automatic Speech Recognition (ASR), also known as speech-to-text, is the process by which a computer or electronic device converts human speech into written text. This technology is a subset of computational linguistics that deals with the interpretation and translation of spoken language into text by computers. It enables humans to speak ...
Chapter 8
Summary. PREVIEW. This chapter introduces some key issues involved in the process of recognising spoken words. You will learn that spoken word recognition: is affected by how often and how recently words have been encountered, and by their similarity to other words. Introduction. Words are often regarded as the basic building-blocks of language.
How does speech recognition software work?
Seeing speech. Speech recognition programs start by turning utterances into a spectrogram:. It's a three-dimensional graph: Time is shown on the horizontal axis, flowing from left to right; Frequency is on the vertical axis, running from bottom to top; Energy is shown by the color of the chart, which indicates how much energy there is in each frequency of the sound at a given time.
Speech Recognition
Speech recognition is concerned with converting the speech waveform, an acoustic signal, into a sequence of words. Today's most performant approaches are based on a statistical modellization of the speech signal. This chapter provides an overview of the main topics addressed in speech recognition, that is acoustic-phonetic modelling, lexical ...
Reading Universe
The word recognition section of the Reading Universe Taxonomy and is where we break each word recognition skill out, because, in the beginning, we need to spend significant time teaching skills in isolation, so that students can master them. If you'd like a meatier introduction to the two sets of skills children need to read, we've got a one ...
Word recognition
Word recognition is a manner of reading based upon the immediate perception of what word a familiar grouping of letters represents. This process exists in opposition to phonetics and word analysis, as a different method of recognizing and verbalizing visual language (i.e. reading). [8]
Spoken word recognition in a second language: The importance of
One of the examples that illustrate the interaction between speech perception and word recognition difficulties is the /ɹ/-/l/ English contrast that is misperceived by Japanese listeners ().For instance, Cutler et al. (2006) suggested that the Japanese /r/ phoneme is activated for both L2-English /l/ and /ɹ/ (i.e. single-category assimilation) when L1-Japanese listeners are presented with ...
Dictate text using Speech Recognition
You can also add words that are frequently misheard or not recognized by using the Speech Dictionary. To use the Alternates panel dialog box. Open Speech Recognition by clicking the Start button , clicking All Programs, clicking Accessories, clicking Ease of Access, and then clicking Windows Speech Recognition.
Speech Perception and Spoken Word Recognition
It describes several fundamental issues in speech perception and spoken word recognition, such as the use of amplification even with profound hearing loss, enhanced lipreading abilities associated with deafness, and the role of the lexicon in speech perception. The chapter describes how the lexical structure of words can assist in compensating ...
Add, Delete, Prevent, and Edit Speech Dictionary Words in Windows 10
B) Right click or press and hold on the Speech Recognition notification area icon on the taskbar, and click/tap on Open the Speech Dictionary. 2. Click/tap on Change existing words. (see screenshot below) 3. Click/tap on Edit a word. (see screenshot below) 4.

What is Speech Recognition?

What is speech recognition in a Computer?

Key Features of Speech Recognition

Speech Recognition Algorithms

1. Hidden Markov Models (HMM)

2. Natural language processing (NLP)

3. Deep Neural Networks (DNN)

4. End-to-End Deep Learning

What is Automatic Speech Recognition?

What is Dragon speech recognition software?

What is a normal speech recognition threshold?

Speech Recognition Use Cases

What is Speech Recognition?- FAQs

Is speech recognition secure?

What is speech recognition in AI?

How accurate is speech recognition technology?

Please Login to comment...

Improve your Coding Skills with Practice

What kind of Experience do you want to share?

Speech Recognition: Everything You Need to Know in 2024

What is speech recognition?

What are the features of speech recognition systems?

What are the different speech recognition algorithms?

Speech recognition vs voice recognition

What are the challenges of speech recognition with solutions?

Acoustic Challenges:

Linguistic Challenges:

Technical/System Challenges:

13 speech recognition use cases and applications

Customer Service and Support

Sales and Marketing:

Automotive:

Healthcare:

Technology:

Further reading

External Links

Next to Read

Related research

Top 11 Voice Recognition Applications in 2024

Speech Recognition

Test Your Knowledge

Tech Factor

Sign up for the free TechTerms Newsletter

How Does Speech Recognition Work? (9 Simple Questions Answered)

What is Natural Language Processing and How Does it Relate to Speech Recognition?

Automatic Speech Recognition

What is Automatic Speech Recognition?

How Does Automatic Speech Recognition Work?

Challenges in Automatic Speech Recognition

Applications of Automatic Speech Recognition

The Future of Automatic Speech Recognition

Out of credits

Speech recognition software

What is speech?

Why is speech so hard to handle?

How do computers recognize speech?

1: Simple pattern matching

2: Pattern and feature analysis

The recognition process

Seeing speech

3: Statistical analysis

4: Artificial neural networks

Speech recognition: a summary

What can we use speech recognition for?

Will speech recognition ever take off?

Mobile revolution?

If you liked this article...

Easy-to-understand

More technical

Rate this page

What Is Word Recognition?

What Does Word Recognition Look Like?

View Video Transcript

How Does Word Recognition Connect to Reading for Meaning?

The Integration of Word Recognition and Language Comprehension

How Children Learn to Read, with Margaret Goldberg

How Can Reading Universe Help You Teach Word Recognition?

Related Content

Onset-Rime Skill Explainer

How to Use Decodable Texts