1, Sound is a wave , Can be heard , Its vibration frequency is 20～20 000 Hz between .
2, The process of speech production
The formation of speech : Air from lungs to throat , Enter the vocal tract through the vocal cord , Finally, sound waves radiate from the mouth , Forming voice .
3, Classification of sound （ concept ： understand + memory ）
Dullness ： Vocal cord tension , When the air flows through, the opening will become a periodic action of opening and closing , Cause periodic excited air flow , as a,o;
( A sound produced by the vibration of a vocal cord ), Including all vowels and some consonants .
voiceless consonant ： Vocal cords fully extended , A narrow passage formed by contraction of a part of the vocal tract , Generate air turbulence , as t,d;
( A sound not produced by vocal cord vibration )
Plosive sound ： Vocal cords fully extended , A part of the vocal tract is completely closed , Once the closing point suddenly opens , Rapid release of air pressure , as b,p.
4, Two important acoustic characteristics of speech —— Pitch frequency and formant （ memory ）
Pitch frequency (F0)： By vocal cord size , Characteristics and tension of vocal cords , Its value is equal to the reciprocal of the time when the vocal cords open and close once （ This is the definition of the pitch period ）. The range of human pitch frequency is 80～
500 Hz about .
Formant (Fn , n=1,2,...)： The channel is a resonator , It amplifies some frequency components of the sound stream and attenuates others , The amplified frequency is called the resonance peak or resonance peak frequency .
5, Formant characteristics ：（ understand ）
Formant is an important acoustic characteristic of a channel . Channel response to an excitation signal , It can be approximately described by a linear system with many pairs of poles . Each pair of poles corresponds to a resonant peak frequency
. The frequency response characteristic of this linear system is called the formant characteristic , It determines the overall profile of the signal spectrum , Or spectral envelope .
The frequency characteristic of speech is mainly determined by the formant . And the formant characteristic of the channel determines the spectrum characteristic of the generated sound , I.e. timbre .
The timbre and distinguishing features of vowels mainly depend on the formant characteristics of the channel . The formant characteristics can be observed from the amplitude frequency characteristics of speech signal spectrum analysis .
6, A complete digital model of speech signal generation ：（ Can draw pictures + Explain the principle and characteristics of each part of the model ）
We can regard speech signal as quasi periodic sequence or random noise sequence as the output of the excitation linear non shift system , This model can be divided into three parts ： Incentive model , Channel model , Radiation model
A complete digital model of speech signal （ a key ）
One , Incentive model
a. Dullness stimulation ： When the air flows through the tight vocal cords , Impulse vocal cords produce vibration , To form periodic pulses at the glottis , And use it to stimulate the vocal tract .
Because the pulse train is similar to the triangle pulse , Therefore, the unit sampling sequence string with the pitch period as the period is used as the excitation
b. Unvoiced encouragement ： The vocal cords relax without vibration , Air flows directly into the vocal tract through the glottis .
Due to voiceless , The channel is blocked to form turbulence , The excitation can be simulated as random white noise
Two , Channel model
a. Acoustic tube model ： The sound channel is regarded as a system composed of several pipes of different cross-sectional areas in series
b. Formant model ： Channel as a resonator , The resonance peak is the resonance frequency of the cavity
Suitable for general monophonic , It is considered that the channel is a series of second-order resonators , Using an all pole model
It is suitable for unusual vowels and most consonants , When these sounds are uttered, the sound cavity has anti resonance characteristics , Zero point must be added to the model to weaken the harmonic
Vibration strength , So we should consider the pole zero model
We can automatically switch series or parallel channels according to the needs of pronunciation , In addition, the parallel part has a direct path , The amplitude control factor is AB
This is for some phonemes with flat spectral characteristics, such as [f],[p],[b] And consider , To enhance the anti resonance characteristics
Three , Radiation model
The process in which air flow from the vocal tract radiates through the tip of the lip to the ear of the listener , The sound signal will decay , And it has the characteristics of high pass filtering
A first-order digital high pass filter is often used to simulate
Model summary ：
This model is not the most complete one , Because it's not applicable to some sounds, such as fricative sound in voiced sound , This kind of sound has two kinds of incentives: voiced sound and pure sound , Instead of simple superposition , We can use more accurate models to simulate these sounds .
2. Gain control in digital model of speech generation （ Yes Av or AN） Represents the sound intensity of the output voice ;
The time-varying linear system is mainly used to simulate the characteristics of the channel ;
3. Two basic problems in digital speech processing , Speech analysis and speech synthesis , It's all based on this model ;
4. Features of this digital model ：
System parameters are fixed —— short-time analysis ;
All pole property —— Zero point can be approximated by many Poles ;
Excitation source and channel are independent of each other —— Suitable for most digital voice processing .
7, Narrow band , Definition of broad band spectrogram , characteristic （ profound understanding ）
Spectrogram ： Spectrum of speech signal , The abscissa of spectrogram is time , Ordinate is frequency
Narrow band spectrogram ： The generation of spectrogram is based on Fourier transform , When we use a longer analysis window （ about 20ms, The corresponding bandwidth is about 45 Hz） Time , High frequency resolution obtained , The components of resonance can be seen in the spectrum .
Showing equidistant black and white horizontal lines on the spectrogram , The distance between them is the fundamental frequency (F0).
Broad band spectrogram ： If less sampling points are used in conversion calculation （ Analysis window approx 3ms , Corresponding bandwidth is about 300 Hz） No resonance component can be seen in the spectrum ,
There is no equidistant black and white on the spectrogram . Low frequency resolution , On the contrary, the resolution on the time axis is higher , Visible vertical lines .
In frequency domain , The energy concentration is the resonance peak , On the spectrogram, it's where the color is darker .
On vowels , Strong voice , The fundamental frequency and resonance frequency of the vocal cord , The resonance peak can also be seen clearly , Energy is concentrated at low frequencies .
If it's a consonant , And the vocal cords don't vibrate , We can't see the resonance frequency . Usually the consonant has a small pitch , The color looks lighter , And the energy is concentrated in the high frequency .
If you are in a neutral position without voice , On the spectrogram , There's a gap .