[음성신호9.5] - STFT, Filter bank

따옹 2023. 5. 10. 20:46

댕쩌는 블로그 발견

https://sanghyu.tistory.com/37

STFT(Short-Time Fourier Transform)와 Spectrogram의 python구현과 의미

음성신호처리에서 아주 기본적인 feature로 spectrogram이 존재한다. spectrogram을 많이 쓰지만 왜 짧은 시간으로 나눠서 Fourier transform을 하는 지에 대해 생각하지 않고 쓰는 경우가 많다. Python에서 함

sanghyu.tistory.com

Short-Time Fourier Analysis

Short-Time Fourier Transform

▫ Short-time windows

▫ Quasi-stationary

Quasi-stationary 음성

원래 본질은 non-statinary 하지만

short time으로 짧게 자른 구간에서는 stationary하다고 가정할 수 있음

음성을 분석할 땐 2~30ms의 짧은 구간으로 자르고

그 구간에서는 stationary 하다고 할 수 있음

특정한 모델 파라미터가 없고 매트릭스를 곱하는 과정이므로

▫ Nonparametric

▫ No model-based

▫ Cf. model-based methods: Linear prediction, homomorphic filtering

해석하는 모델 툴과 coeff

Short-Time Analysis

frame으로 short time 섹션을 프레이밍 한다

원래 -무한대~ 무한대 를 윈도윙 fun을 곱해서 short time 을 만들어냄

하지만 음성에서 -무한대~무한대는 의미가 없으므로 짧은 구간으로 나누어서 봄

Fourier Transform View

- w[m] non-zero [0, Nw-1] : analysis window, analysis filter

- f_n[m] = x[m] w[n-m] : short-time section of x[m] at time n

flipping된 것을 다시 만드는 것은

-m을 n 만큼 평행이동 하고

원하는 위치에서 윈도잉함

n 윈도우의 위치

ω : FTF으로 얻어진 frequency 성분(컨티니우스 베리어블)

X(n, ω)시간과 주파수의 이차원 데이터를 얻어낼 수 있음

Short-Time Fourier Transform

magenitude 원점을 중심으로 even 평션

n0만큼 딜레이(평행이동)를 주면

FTF의 경우 : phase에 영향을 줌 하지만

SFTF 의 경우 : +시간축 에 대해 n0만큼 이동함

아랫줄이 추가로 변경됨

Discrete STFT

• Discrete version of the STFT: sampling in the frequency domain

ω대신-> 2ㅠ/k로 바꿔줌( k에 대해 바꿈 )

2π/N : frequency sampling interval

▫ N : frequency sampling factor

• The two-dimensional function |X(n, ω)|2 is called the spectrogram.

시간축과 주파수에 대한 2차원 , |X(n, ω)|2 -> scale magnitude

• Short window (= one pitch period) => wideband spectrogram : vertical striations

짧게 할때(one pitch period), time resolution은 좋지만 주파수 time resolution은 나쁨 => 세로축 줄무늬 생성

• Long window (= 2~3 pitch periods) => narrowband spectrogram : horizontal striations

길게 할때,

Time and Frequency Decimation

https://ko.wikipedia.org/wiki/%EB%8B%A4%EC%9A%B4%EC%83%98%ED%94%8C%EB%A7%81

다운샘플링 - 위키백과, 우리 모두의 백과사전

위키백과, 우리 모두의 백과사전. 다운샘플링(Downsampling)은 데시메이션(decimation)이라고도 한다. 신호처리에서 사용하는 의미는 혼신없는 대역폭축소를 의미한다. 디지털 신호처리에서, decimation

ko.wikipedia.org

decimation은 신호의 샘플링 레이트를 줄이는 과정

해당 wave form을 L만큼 shift 할 때)(overwrap) -> windowing

In practice, decimation of X(n, k) intime by a factor Lyields X(nL,k) -> Frame shift by L

FTP결과

1차원 -> 2차원 데이터 됨

window 개수마다 기둥 하나씩 생성

타임 resolution -> 1/L 데이터 양이 줄어들지만 Y축 데이터 양은 늘어남

Time-Frequency Resolution Tradeoffs (1)

플리핑 -> -

n 만큼 shift -> phase 따라붙음 (e^(-jωn))

time domain 곱셈 -> 주파수에서 컴볼루션 됨

처음에는 x[m]을 얻고 싶었지만 non-stationary하기 때문에 쪼개고 window function 을 곱해준 결과

그렇기 때문에 x[m]의 왜곡이 일어남, 최종적으로는 X(n, ω)이 x[m]과 유사하길 바람

X[m]과 유사한 X(ω)를 얻어내기 위해선 impluse 와 가까워야 함

X(ω) impluse와 가까워진다 - > time domain에서 constant value(~무한대-무한대 값이 일정해야함) => 시간을 한정해야하는 이유가 없어짐

이런 딜레마를 해결하기 위해서

적절한 trade-off 관계를 같은 window 옵션을 선택해야함

time domain constant -> time에서 resolution은 가장 나쁜 것(모든 시간을 다 avg취해야 하므로)

이때 주파수 resolution 최대

반대로,

주파수에서 이 구간에 resolution을 떨어뜨리면 time domaim에서 resolution이 상대적으로 좋아짐

둘 다 동시에 최대로 좋게할 수 있는 윈도우는 없으므로 적절히 해야함

windowing을 좁게하면(time resolution을 좋게함) 했을때의 FTP 결과

더 뾰족하게 나오는데,

(A)time resolution 이 너무 낮아서 주파수 resolution이 안 좋아서 퍼진거임

(b)실제로 time resolution을 어느정도 포기하면 주파수 resolution을 좋게 만들어서

적절한 주파수 해석이 가능해짐

훨씬 resolution을 좋게(선명하게) 얻을 수 있음

그렇다면, 많이 넓어지게 되면?

양 쪽 주파수 성분, 가운데 주파수 성분이 많이 달라서 3갈래로 섞여서 보이는 결과를 얻게 됨

따라서 마지막 그림의 이유로 windowing 시간을 제한해야함

하지만 너무 과도하게 시간을 제한하면 resolution 입장에서 좋지 않다

(a)길게 했을 때, 각각의 주파수 성분이 좋게 잘 얻어질 수 있음 (b)좁게 하면 퍼져나옴

Constant-Q Analysis/Synthesis

STFT : Time-frequency resolution trade-offs

▫ Uncertainty principle

time, 주파수 사이 : 하나가 좋으면 하나가 나빠짐

▫ Short window : good temporal resolution but poor frequency resolution

▫ Long window : good frequency resolution but poor time resolution

• Constant-Q resolution

▫ Time resolution increases and frequency resolution decreases with increasing frequency.

주파수가 올라갈 때 한 주기가 짧아짐 => 주파수를 분석하는 데 있어서 필요한 time의 길이가 좁아져도 된다

-> 주파수 올라갈 때, time resolution을 크게 증가시켜도 무리가 없다.

▫ Higher frequency, wider bandwidth

▫ This is the concept of the wavelet transform.

이런 개념을 충실이 이행한 것

Wavelet Transform (1)

short time transform : 어떤 주파수 대역이든 간에 똑같은 window 사이즈로 transform 하므로 주파수에 대한 resolution이 같음(time도 같음)

하지만 주파수가 올라감에 따라 주파수 resolution을 감소시키려면

주파수가 높을때는 타임 resolution을 좋게하는 대신에 주파수 resolution을 희생

이런 방식이 natural signal에 오히려 적합

Wavelet transform can be thought of as a collage of pieces of spectrograms based on different analysis window and thus different time-frequency resolutions

Wavelet Transform (2)

smooth 한 wave form 이 있고, 노이즈가 껴있음(중간) 고주파 노이즈,

노이즈를 제외한 웨이브 폼은 2개의 프리퀀시가 섞여있음

하지만 STFT으로 이렇게 적절히 해석하긴 힘듦

두개의 신호를 구별하기 위해 시간을 짧게 => 주파수 나빠짐

그럼 두 개의 신호가 섞여서 나타남

타임 resolution을 희생하면 점점 오른쪽으로 바뀜

하지만 해석할 수 없으므로 wavelet이 필요함

Continuous Wavelet Transform (CWT) (1)

scaling factor a.(타임, 주파수 resolution을 연동시켜서 그거에 대한 resolution을 결정하는 값이 됨)

1/√a : energy normalization factor

폭이 달라짐에 따라 에너지를 일정하게 norm하게 시켜주는 factor임

Continuous Wavelet Transform (CWT) (2)

a> 1 : a가 커지면 양 옆으로 커짐 -> 타임 나빠지고 주파수 좋아짐

a < 1 : 타임은 좋아졌는데, 주파수 resolution 나빠짐

다시 생각해보면, x(t)를 associate wavelet을 가지고 컨볼루션한 꼴

1/√a h*(-tow/a) : wavelet

다시 복원 가능

Inverse transform

where a smaller scale acorresponds to wider-bandwidth filters.

•We refer to |Xw(τ,a)|2as the scalogram.

•The scale is analogous to frequency.

•The x(t) can be recovered from a superposition of wavelets, hτ,a(t), i.e., the inverse continuous wavelet transform (ICWT)

w(t) e^(jw_ot) ; hamming window 의 엔벨롭에 주파수(w_0에 대해 바뀜) 들어가 있음

a를 곱하므로 linear가 아니고 non-uniform 한 주파수 성분이 됨, 세로축간격 에바

Discrete Wavelet Transform

콘티뉴스에 비해서 달라지는 거 (scale, shift)

• Discretize scale and shift.(a, 타우)

Dyadic (or Octave) Sampling

a_m (스케일)을 생각해봤을 때,2배씩 계속 바꾸어나감

For each scale am = 2^m for m = 1, 2, 3, ...

we shift at τn = na_m for n = 1, 2, 3, .

a가 두배가 되면 tim resolution이 2배 안 좋아지고 frequency resolution 2배 좋아짐

Comparison of STFT and DWT

STFT : 주파수 도메인에서 모두 균일, 타임의 길이도 모두 동일했음 => 윈도우 function 동일했다

시간, 주파수에 대해서 주파수에 관계없이 항상 일정한 time resolution 을 갖는다

cf) DWT : 2배씩 커짐, 주파수 바뀌고 time resolution 도 떨어짐

Implementation of DWT (1)

• Discrete wavelet transform (DWT)

입력신호 * h

C라는 결과가 나옴

C를 가지고 다시 h를 곱해서, 두번 sum 하면 원래 time 도메인 신호를 구함

DWT can be implemented by an iterative cascade of identical stages, each stage consisting of a lowpass[by P(ω)] and highpass[by Q(ω)] decomposition of the signal followed by 2:1 downsampling.

To recover (invert) the original signal, a similar iterative structure can be used.

Implementation of DWT (2)

고주파, 저주파로 쪼개는 filter bank 씀

이렇게 쪼개지면 밴드가 두개가 돼서

프리퀀시도 1/2 줄일 수 있음

다시 얘를 반으로 쪼갬

이걸 반복

up-sampling -> 복원 과정을 거쳐서 원래 신호를 복원할 수 있음

그럼 합성이랑 inverse랑 똑같은 거임?

Gamma-Tone Filter (1)

The envelope of the impulse response is a gamma function.

• The gamma-tone filter (GTF) attempts to realize the tuning curves of the auditory filters.

각 밴드별 신호를 얻을 때

필터 뱅크의 주파수 가 감마 신호와 유사

달팽이관의 밴드별 신호를 구별해내는 역할

필트뱅크를 얻기 위한 주파수가 감마톤 필터임 = >달팽이관의 밴드 신호를 얻기위한 filter bank를 모델링

감마 엠벨롭 안에 톤이 들어가서 곱해져있는 형태

tone이 붙으면 sinsoid 가 실제로 우리가 얘기한 감마톤 필터임

Frequency response of various forms of GTF

주파수 분석을 하게 되면 이렇게 나옴

고주파 쪽은 매우 가파름

저주파 매우 완만

비대칭성을 나타내고 있음

수학적으로 심플하게 나타내기 위해 이것을 극단적으로 끌고 간 것들 AFGF, OZGF

Gamma -Tone Filter (3)

All -pole gamma -tone filter (APGF)

▫ By discarding the zeros of the original GTF

제로가 없어서 pole로만 이루어져 있음 -> behave가 심플

Low -frequency tail is constant(unaffected by the BW)

구현이 심플해서 이걸 사용

N=2가 되는 특수한 케이스 Patterson symmetric filter(이런 게 있다)

Mel-Frequency Filter Bank

0-1kHz: linearly-spaced center frequency (100Hz)

• 1kHz-8kHz: Mel-scaled center frequency

• Triangular window for speech recognition

달팽관을 각 주파수 밴드별로 주파수 를 모델링해서

실제로는 주파수에 대해서 쭉 펼쳐보면

저주파 좁고 고주파 높음 -> non uniform한 필터 뱅크를 갖게됨

그걸 모델링 하는 방법

박, 멜 스케일 씀 -> 이때 멜 쓰겠다

편의상 센터 주파수가 멕시멈, 이웃한 센터 주파수가 제로가 되는, 반씩 오버랩되는 이런 모양으로 삼각 윈도의 필터 뱅크로 구성하는 게 일반적임

0-1kHz: linearly-spaced center frequency (100Hz)

•1kHz-8kHz: Mel-scaled center frequency

•Triangular window for speech recognition

Summary

Filter bank

▫ Time-frequency resolution trade-offs

▫ Short window : good temporal resolution but poor frequency resolution

▫ Long window : good frequency resolution but poor time resolution

•Wavelet transform

▫ Low frequency region: High frequency resolution, low temporal resolution

▫ High frequency region: Low frequency resolution, high temporal resolution

위와 같은 조건들을 일부러 가져감

▫ Dyadic (or octave) sampling

▫ Nonuniform filter bank

달라지기 때문에 nonuniform한 주파수 밴드를 갖게됨

▫ Quadrature mirror filters(QMF)

실제로는 대칭 미러링 돼있음, 이런식의 filter bank를 명명         ▫ Quadrature mirror filters

Introduction to Homomorphic Processing

• Linear filtering can separate signals that are added together and have disjoint spectral content.

두개가 더해져 있을 때, 특정 밴드만 선택하는 거 = >분리했다고 얘기할 수 있음

만약, 더해진 게 아니라,

How can we separate signals convolutionally combined(time 도메인) or multiplied(주파수 도메인) by another signals?

▫ Distortion by a transmission channel or by a flawed recording device : 왜곡되는 성분을 최소화하고 원래 신호만 떼낼 것임

time domain convolution => frequency domain에서는 곱셈

이런 것들을 어떻게 분리할까?

왜곡 성분을 최대한 배제하고 원래 신호만 떼어내면 좋겟다

이런 걸 할 수 있음?

이걸 하기 위해 Homomorphic Processing

Homomorphic Filtering

u_g[n]:source* vocal tract(filter) -> 음성 s[n]

soruce : 개개인의 특성

filter : 실제 음성의 메세지를 나타냄

음성인식을 하고 싶을 때는 최대한 필터를 얻고 싶음

화자 인식 -> source를 얻고 싶음

음성신호를 가지고 따로 둘을 분리해내고 싶다

원래 신호를 복원해냄

h[n] : all-pole filter

Concept

where L represents a linear operator and α a scaling factor.

• A consequence of superposition is the capability of linear systems to separate signals that fall in disjoint frequency bands.

Concept (2)

지금하고 싶은건, 떼어내기 훨씬 어려움 상황에서도 할 수 있는가

In general, a homomorphic system can use different operators also in the output space

Concept (3)

conv 를 덧셈으로 바꾼다고 하면 linear filtering 으로 특정성분만 얻어내는 게 쉬워짐 그리고 역의 방향으로 원래 domain으로 복원가능

이런 식의 연산을 해보자

연산을 통해서 h,p에 대한 덧셈으로 바꿀 수 있으면, p는 직선형태, h는 산 형태

윈도윙을 해서 h만 얻어낼 수 있고,

원래 도메인으로 되돌리면 vocal tract에 해당하는 response를 구할 수 있지 않겠나

를 해보자

Homomorphic Systems for Convolution (1)

• The z-transform and logarithm can convert convolution to addition.

입력신호를 z트랜스폼, log T씌움

z inverse를 함(보상의 개념)

Z^-1까지는 두 개의 성분이 더해져 있음 => linear filtering 으로 하나만 뽑고 , 역연산으로 신호를 복원해냄

Evaluate the z-transform on the unit circle (z=e ^jω), i.e. the Fourier transform.

z transform 을 FTF으로 대체하고, log 취해줌

magnitude, phase 성분이 덧셈으로 나옴

Complex Cepstrum

Since x[n] is real,

magnitude들은 리얼이면서 even function

phase는 imagein 이면서 odd function

real 값이 나와야함

x̂[n] = 원래 신호에 log를 취하고 FTF한 과정

x̂[n] 을 FTF하면 log[x(오메가)] 나옴

Real Cepstrum

이전에는 phase를 계속 끌고 가는데,

Real Cepstrum에서는

phase를 버리고 싶다

magnitude만 남겨놓음, 실제로는 complext cepstrum에 대해서 time을 inversion하면 odd 가 날라감

그리고 그걸 평균내면 phase가 없어지고 magnitude만 남게 됨 => c[n]을 가지고 cepstrum을 얻은 꼴임(complex cepstrum의 even 만 얻어냄)

even만 얻어냄

(주의)

cepstrum 값이 complex가 아니고 X(오메가)를 complex로 남겨놓은 거임

real cepstrum은

real을 갖다 쓰겠다, phase 버림

cepstrum ?

spectrum 원래는 이건데

time domain -> 주파수 domain으로 바꾼것(spectrum)

주파수 도메인 로그 씌우고 다시 inverse frier transform (cepstrum)

실제로는 타임 도메인으로 돌아가지만 log가 껴있어서 온전하게 time도메인이 아닌, 제 3의 도메인으로 넘어감 스펙트럼이 아닌 제 3의 도메인인 캡스트럼 도메인이 됨

n : 타임- > 주파수 -> 로그 -> inverse transform하면 n으로 돌아와야하지만 time domain의 n과 다름

따라서 quefrency 임(frequency를 뒤집어)

Cepstrum

(위)complex cepstrum

(아래)real cepstrum