Interpreting Content and Speaker Characteristics in Factorised Self-Supervised Subspaces

Authors anonymised for IEEE SLT double blind review process

Self-supervised speech features encode both content and speaker information. Recent work introduced an SVD-based factorisation that decomposes these features into a shared content matrix capturing temporal variation and speaker-specific transformations capturing static speaker characteristics. However, how information is organised within these components remains unclear. In this paper, we investigate how the dimensions of WavLM-factorised content and speaker subspaces correlate with speech characteristics such as pitch, intensity, and voicing. We find that leading dimensions in the content space primarily capture intensity, higher-order formants, and voicing, while pitch is encoded in a later dimension. In contrast, the highest-variance speaker dimension is strongly associated with pitch and gender, with later dimensions capturing high-frequency variation. Intervention experiments show that manipulating these dimensions enables targeted control of speech characteristics for speech synthesis. Furthermore, modifying the content and speaker representations jointly provides fine-grained control over characteristics such as pitch and intensity.

For the following demonstrations, we have selected to control the pitch, intensity and F2 formant. This subset of characteristics is chosen only for illustration purposes.

Content subspace control

We now demonstrate the result of changing a content dimension of an utterance to control specific speech characteristics. All utterances are from Librispeech's dev-clean and test-clean datasets.

Pitch modification (Content dimension 12)

Male Speaker Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviations lower
3 Standard deviations lower

Intensity modification (Content dimension 2)

Male Speaker Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviation lower
3 Standard deviations lower

F2 modification (Content dimension 3)

Male Speaker Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviation lower
3 Standard deviations lower

Speaker subspace control

We now demonstrate the result of changing a speaker dimension of an utterance to control specific speech characteristics. All utterances are from Librispeech's dev-clean and test-clean datasets.

Pitch modification (Speaker dimension 1)

Male Speaker Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviations lower
3 Standard deviations lower

Intensity modification (Speaker dimension 2)

Male Speaker Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviation lower
3 Standard deviations lower

F2 modification (Speaker dimension 7)

Male Speaker Female Speaker
3 Standard deviations higher
2 Standard deviation higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
2 Standard deviation lower
3 Standard deviations lower

Simultaneous subspace control

We now demonstrate the result of changing the content and speaker dimensions of an utterance in tandem to control specific speech characteristics. Note: This is not an exhaustive list of modifications, only a subset for each characteristic is provided for illustration. All utterances are from Librispeech's dev-clean and test-clean datasets.

Pitch modification

Male Speaker Female Speaker
Original utterance
Content dimension: 2 std higher, Speaker dimension: 1 std higher
Content dimension: 1 std higher, Speaker dimension: 2 std lower
Content dimension: 2 std lower, Speaker dimension: 2 std higher

Intensity modification

Male Speaker Female Speaker
Original utterance
Content dimension: 2 std higher, Speaker dimension: 1 std higher
Content dimension: 1 std higher, Speaker dimension: 2 std lower
Content dimension: 2 std lower, Speaker dimension: 2 std higher

F2 modification

Male Speaker Female Speaker
Original utterance
Content dimension: 2 std higher, Speaker dimension: 1 std higher
Content dimension: 1 std higher, Speaker dimension: 2 std lower
Content dimension: 2 std lower, Speaker dimension: 2 std higher