The History of Crossover Art
- NIC2001, Nordic Interactive Conference in Copenhagn

Ekkofisk (2001) Technical description of the technique of sound generation,
kindly provided by Trond Lossius

From: "Trond Lossius" <lossius@bek.no>
To: "fuchs-eckermann" <mfuchs@welho.com>
Sent: Wednesday, November 07, 2001 2:01 PM
Subject: Re: ekko fisk

Hi!
We're pleased that you enjoyed Ekkofisk, and it is very nice of you to let
us know.

The singing voice synthesis was made using Max/MSP, but I did not use the
Ircam chant program. I'll try to describe briefly how it was done, and
provide some references for articles we based our work on.

Pitch:
------
The pitch at any one time is the sum of four different factors, all
calculated in a continous (fractional) MIDI pitch domain, and then
translated to frequencies at sampling rate:
- note pitch. This is determined by mapping of vertical coordinates onto a
scale. The scale used is changed every 60 secs. Range is determined
separately for the two fishes, one is mapped onto a male voice range, the
other onto a female voice range. The extrem high and low registers are left
out as the formant synthesis tends to sound strained and unnatural in those
regions.
- portamento: Within a legato phrase, a portamento lasting for about 300 ms.
is implemented at note shifts. The portamento is implemented as a tanh
(hyperbolic tan) function in the pitch domain.
- vibrato: For each note a vibrato is implemented. Vibrato depth is delayed
untill the transition time of the note (formant adjustment, volume
adjustment, protamento) is done, and is then controlled by a gaussian curve.
this means that we need to determine note duration before it starts.
- 1/f fluctuation. A slow 1/f fluctuation is used to imitate pitch drift as
recommended by John Chowning.

Control of harmonic content
---------------------------
Once frequency is determined, it is used to control an oscilator. This
oscilator has a harmonic content so that upper partials relates to the
fundamental frequency with a -6 dB roll of per octave. This signal imitates
the audio signal created by the glottal chords.

Next the signal is filtered by five parallell resonant bandpass filters to
implement formants. For this we use the Max object "resonators~" develloped
at CNMAT.

Formant properties are controlled using values from research at Ircam as
part of the Chant program. For the male voice we use the bass and tenor
values, and interpolates between these depending on the actual pitch. For
the female voice we use the alto and soprano values. Interpolation is done
in such a manner that the bass or alto formant values are dominant at low
pitches, the tenor and soprano values at high pitches. This creates a more
convincing voice than using static formant values.

Furthermore the formant changes from one note to the next. the CNMAT object
does not enable audio rate interpolation, so we instead do the interpolation
in a series of descreete steps at approx. every 10th ms. This does not sound
entirely convincing, but on the other hand it some times has the interesting
side effect of creating what sounds like consonants of a special language
that only fishes speak.

Control of amplitude and envelope
---------------------------------
The overall amplitude of each note depends on the distance between the
fishes. A transition from one note to the next is implemented using a tanh
curve in the logaritmic (dB) level domain, and then mapped to linear
amplitude domain. In addition a possible swell is implemented using a
gaussian curve.

For the first and last notes of the phrases special care has to be taken to
ensure that the phrase starts and end in a convincing way. The end of the
phrase I believe we were able to handle quite convincingly, but I'm not as
sure about the start of the phrase. I'd proberbly have to do analysis
(envelope tracking) of recorded singing to imrove this.

Control of phrase structure
---------------------------
In order to get convincing phrase structures, I had to implement four
different scenarios of what note to produce:
-first note
-middle note
-last note
-pause

First order Markov chains were used to ensure proper phrasing:
- first can only be succeded by middle or last
- middle can only be succeded by middle or last
- last can only be succeded by pause
- pause can only be succeded by pause or first

The Markov transition tables were modified in real time to impose two
additional conditions:
- if the fishes are close together, they sing, if far apart they keep quiet
and play bell-like sounds instead.
- phases shouldn't be "to long". This was implemented using fuzzy logics.

Spatialisation:
---------------
This was done according to the research on vector-based amplitude panning by
the Finnish researcher Ville Pulkki.

That's about it, I guess. The implementation was based on a number of
articles and books. Here are the more important ones:

C. Dodge & T.A. Jerse (1997): Computer Music. Synthesis, Composition, and
performance. 2nd edition. Schirmer Books. (Ch. 6 covers filters, ch. 7
covers voice synthesis, ch. 11 Markov Chains)

R. Dobson (2000): Designing Legato Instruments in Csound. In "The Csound
book" edited by R. Boulanger. Formant values can be found in one of the
appendicies.

V. Pulkki (2000). Generic panning tools for MAX/MSP². Proceedings of
International Computer Music Conference 2000. pp. 304-307.

Tristan Jehan, Adrian Freed, Richard Dudas (1999): Musical Applications of
New Filter Extensions to Max/MSP. Proceedings of International Computer
Music Conference 1999.

If there is anything else I can do, please let me know.

Yours,
Trond

______________________________________
From: "Trond Lossius" <lossius@bek.no>
To: "fuchs-eckermann" <mfuchs@welho.com>
Sent: Wednesday, November 07, 2001 2:05 PM
Subject: Re: ekko fisk

...and one last thing: The sound produced sound quit dry and synthetic, and
requires some sort of reverb to come alive. In Amsterdam the room is "wet"
enough to produce the decired reverb, in Copenhagen we had to use a reverb
unit.

Trond