Self-supervised speech features encode both content and speaker information.
Recent work introduced an SVD-based factorisation that decomposes these features into a shared content matrix capturing temporal variation and speaker-specific transformations capturing static speaker characteristics.
However, how information is organised within these components remains unclear.
In this paper, we investigate how the dimensions of WavLM-factorised content and speaker subspaces correlate with speech characteristics such as pitch, intensity, and voicing.
We find that leading dimensions in the content space primarily capture intensity, higher-order formants, and voicing, while pitch is encoded in a later dimension.
In contrast, the highest-variance speaker dimension is strongly associated with pitch and gender, with later dimensions capturing high-frequency variation.
Intervention experiments show that manipulating these dimensions enables targeted control of speech characteristics for speech synthesis.
Furthermore, modifying the content and speaker representations jointly provides fine-grained control over characteristics such as pitch and intensity.
For the following demonstrations, we have selected to control the pitch, intensity and F2 formant. This subset of characteristics is chosen only for illustration purposes.
Content subspace control
We now demonstrate the result of changing a content dimension of an utterance to control specific speech characteristics.
All utterances are from Librispeech's dev-clean and test-clean datasets.
Pitch modification (Content dimension 12)
Male Speaker
Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviations lower
3 Standard deviations lower
Intensity modification (Content dimension 2)
Male Speaker
Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviation lower
3 Standard deviations lower
F2 modification (Content dimension 3)
Male Speaker
Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviation lower
3 Standard deviations lower
Speaker subspace control
We now demonstrate the result of changing a speaker dimension of an utterance to control specific speech characteristics.
All utterances are from Librispeech's dev-clean and test-clean datasets.
Pitch modification (Speaker dimension 1)
Male Speaker
Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviations lower
3 Standard deviations lower
Intensity modification (Speaker dimension 2)
Male Speaker
Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviation lower
3 Standard deviations lower
F2 modification (Speaker dimension 7)
Male Speaker
Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviation lower
3 Standard deviations lower
Simultaneous subspace control
We now demonstrate the result of changing the content and speaker dimensions of an utterance in tandem to control specific speech characteristics.
Note: This is not an exhaustive list of modifications, only a subset for each characteristic is provided for illustration.
All utterances are from Librispeech's dev-clean and test-clean datasets.