Ever wanted to make a voicebank in a certain language, but can't find any reclists? Or maybe you have found some reclists, but you're not pleased with how they're organized.
In this tutorial, I'll be going over some useful techniques when for writing your own reclists. This tutorial is meant for intermediate users; I will assume that you at least know the basics of the software. If there are any terms you want defined that aren't in the glossary, let me know.
Let's start by taking a look at a few different types of voicebanks. "C" stands for "consonant", and "V" stands for "vowel".
A language like Japanese that doesn't necessarily need codas can be made CV or VCV, but languages like Spanish which do have them are better off in CVVC, VCCV, or even VCVVC.
- CV -- The most basic type, operating only with onsets^1 and nuclei^2.
- VCV -- A slightly more complex style. Still only uses onsets and nuclei, but with the addition of including the "fade out" from the previous nucleas before the current onset.
- CVVC -- Essentially an expansion of a CV voicebank, with the addition of coda^3 samples.
- VCCV -- An alternate version of CVVC with focus on the transition from coda to onset. Often includes "VCV" style vowels.
- VCVVC -- A VCV voicebank that also includes codas. Only works for a few languages, because many of them would result in voicebanks too large to be used in UTAU.
For the purpose of this tutorial, I'll be using Japanese and Spanish for my examples, as they are not very complicated and are phonetically similar.
Obviously, in order to figure out which voicebank type will be best suited for your reclist, you're going to have to analyze the language you'll be working in, which brings us to our next section.
Basic Phonetics and Phonology
The key to making any voicebank is to understand the phonetics^4 and phonology^5 of the language you're working in.
First, take everything you know about the alphabet, and toss it out the window. Written language is full of all sorts of verbal ambiguity. Take, for example, the words "curtain" and "certain." The "c" makes a completely different sound, but the "ur" and the "er" are pronounced the same. The faster you sever the association between spoken language and written language in your head, the easier it will be to look at things phonetically.
So then, how do we represent the sounds on paper? Well, that's why we have phonetic alphabets, where one symbol correlates to only one sound. Most commonly, we use the International Phonetic Alphabet (IPA).
Some languages, like Japanese, have their own phonetic alphabets, but IPA is still applicable.
Phonetic sounds are typically separated into two categories: consonants, which are produced with the mouth, and vowels, which are produced by the vocal chords.
Consonants are defined by their place of articulation (lips, teeth, alveolar ridge, etc.) and manner of articulation (is airflow blocked or continuous, do the vocal chords vibrate, etc.). They follow the naming convention of vocalization + place + manner. For example, /k/ is a voiceless velar plosive, /n/ is an alveolar nasal, /v/ is a voiced labiodental fricative.
These sound kind of intimidating and scientific, but they're actually pretty basic.
Place of Articulation
- Voiced -- the vocal chords vibrate when the consonant is produced
- Voiceless -- the vocal chords do not vibrate when the consonant is produced
Manners of Articulation
- Bilabial -- the two lips are pressed together
- Labiodental -- the top teeth are against bottom lip
- Dental -- the tongue tip is against the top teeth
- Alveolar -- the tongue tip is towards the alveolar ridge
- Postalveolar -- the tongue tip is towards the area right behind the alveolar ridge
- Retroflex -- the tongue tip is slightly further back than above
- Palatal -- the center of the tongue is raised towards the palate
- Velar -- the back of the tongue is raised towards the roof of the mouth
- Uvular -- the tongue is held back towards the uvula
- Pharyngeal -- the tongue is held back and lowered towards the pharynx
- Glottal -- occurs completely in the glottis
*cannot be voiceless
- Nasal* -- air exits through the nose rather than the mouth
- Plosive -- also known as "stops"; airflow is completely obstructed
- Fricative -- airflow is continuous through the mouth
- Affricate -- a plosive which transitions immediately into a fricative
- Trill* -- airflow is rapidly released and obstructed; caused by vibrating part of the mouth
- Tap / flap* -- airflow is quickly obstructed then unobstructed
- Lateral fricative -- a fricative that involves the tongue moving
- Approximate* -- also known as "semi-vowels" or "liquids"; the articulation is minimal
- Lateral approximate* -- an approximate that involves the tongue moving
Vowels, on the other hand, are defined by the position of the tongue, the openness of the mouth, and the tenseness of the vocal chords. Pretty straightforward.
But individual sounds aren't the only thing to consider. That's where phonology comes in; you have to understand how each sound works with the others.
One part of this is the environment in which a sound occurs. For instance, in Spanish, a plosive like /k/ can be followed by the trill /r/ in the onset, but not the coda, and the fricative /s/ will never be followed by it.
Similarly, some sounds may become others all together, such as the trill /r/ becoming the tap [ɾ] in rapid speech. This makes [ɾ] an allophone of /r/, a distinction not important for understanding speech, but is something that should be included in your voicebank to improve its naturalness.
Now, you don't have to go through and figure out all the sounds used in the language yourself; oftentimes, other people have already done that for you, and a quick Google or Wikipedia search for "[language] phonology" will typically yield a list of phonemes and environments.
Infrequent assimilated sounds and dialectical allophones are ignored for the purpose of simplicity.
Vowels: i ɯ e o a
Syllabic consonants^6: n̩
Onsets: m n p b t d k g ɸ s z ɕ h ts tɕ dʑ ɾ j w
Onset clusters^7: mj nj pj bj kj gj hj ɾj
Vowels: i u e o a
Diphthongs^8: ej ew oj ow aj aw
Onsets: m n p b t d k g f θ s ʃ x tʃ r ɾ j w l
Onset clusters: mj mw nj nw pr pj pw pl br bj bw bl tr tj tw dr dj dw kr kj kw kl gr gj gw gl fr fj fw fl θj θw sp st sk sj sw sl ʃj ʃw xj xw tʃj tʃw rj rw ɾj ɾw lw lj
Codas: m n p b t d k g f θ s ʃ x tʃ r ɾ j w l
Coda clusters: ps bs ts ds ks gs ms ns rs ɾs ls
Now that we've got our list of sounds, we have a small problem: a lot of the IPA symbols can't be read in UTAU! Not to mention they can be a pain to type without memorizing all of the ALT codes. To solve this, we can use the unicode-friendly version of IPA: X-SAMPA.
As I mentioned earlier, some languages have their own phonetic systems, which generally only consist of standard roman characters, so these can also be used. However, unless you're working with Japanese, where romaji is the community standard (aside from hiragana), I recommend using X-SAMPA, because it's pretty universally accepted, and is much more consistent and versatile than a lot of the homebrew phonetic systems I've seen.
For the purpose of this tutorial, I will be using X-SAMPA in place of romaji or hiragana like I normally would.
So, let's take a moment to convert our phoneme lists.
Vowels: i M e o a
Syllabic consonants: n=
Onsets: m n p b t d k g p\ s z s\ h ts ts\ dz\ 4 j w
Onset clusters: mj nj pj bj kj gj hj 4j
A few X-SAMPA characters, like \, =, and ?, will interfere with the encoding, so let's change a few of them to their allophonic equivalents. For the syllabic consonant, we can simply type it twice, which is a common notation for this anyways.
Vowels: i M e o a
Syllabic consonants: nn
Onsets: m n p b t d k g f s z S h ts tS dZ 4 j w
Onset clusters: mj nj pj bj kj gj hj 4j
Vowels: i u e o a
Diphthongs^8: ej ew oj ow aj aw
Onsets: m n p b t d k g f T s S x tS r 4 j w l
Onset clusters: mj mw nj nw pr pj pw pl br bj bw bl tr tj tw dr dj dw kr kj kw kl gr gj gw gl fr fj fw fl Tj Tw sp st sk sj sw sl Sj Sw xj xw tSj tSw rj rw 4j 4w lw lj
Codas: m n p b t d k g f T s S x tS r 4 j w l
Coda clusters: ps bs ts ds ks gs ms ns rs 4s ls
Some symbols are technically repeated, but UTAU will read uppercase and lowercase letters as separate characters, so we do not need to change those. However, if two filenames are the same (i.e. "sa" and "Sa"), you will need to alter one of them, because, though UTAU treats them as different, File Explorer does not. I typically just add a "-" or "_" to the end (i.e "sa" and "Sa_"). This way, their alphabetical placement is not tampered with, but saving one will not overwrite the other. This is especially important if you're recording with Oremo, which will overwrite the previous file without giving you a dialogue box.
Next, it's time to start writing the actual list. It's common practice to put the samples in alphabetical order, but this isn't really necessary as your computer will automatically do that with your files, and UTAU will automatically do that with your oto. I like to do it anyways, though, because its easier to work with than organizing it by vowel position and consonant type as I did in my phoneme lists.
For Japanese CV, all we need to do is pair off each consonant or consonant cluster with each vowel, and record each syllable as a separate .wav file.
a e i M o
4a 4e 4i 4M 4o
4ja 4je 4jM 4jo
ba be bi bM bo
bja bje bjM bjo
da de di dM do
dZa dZe dZi dZM dZo
Notice that I did not include any /ji/ samples. This is because, while they are phonetically possible, they simply don't exist in the Japanese language.
For Spanish CVVC, it is a very similar process, except now we also include the diphthongs, and the coda consonant on the syllable.
a e i o uFor the codas and coda clusters that don't have an equivalent onset, I tend to write them as "c_vc", where v is the vowel, c is the consonant or cluster. You don't actually record the "c_"; it's there to keep that sample with the rest of the files for the same consonant. It's just for organization's sake.
aj ej oj
aw ew ow
bab beb bib bob bub
bajb bejb bojb
bawb bewb bowb
bja bje bjo bju
bla ble bli blo blu
blaj blej bloj
blaw blew blow
bs_abs bs_ebs bs_ibs bs_obs bs_ubs
We would oto the sample "bab" under the aliases [ba] and [ab], and the sample "bs_abs" as simply [abs]. Consult a CVVC oto guide for more information on that.
It is also common to oto the vowel endings. You would handle this in the same manner as any other VC, and would typically note them as [(v) -] or [(v) R]. This is particularly important for diphthongs.
VCV and VCVVC Reclists
Let's say that you want to write a reclist for a VCV or VCVVC bank. You could write it all out as individual files, but that would get really big, really fast, and is overall inefficient, and, personally, I'm all about efficiency!
This brings us to our next topic: strings.
Strings are, in short, lists of syllables that are all recorded into the same .wav file, generally with no pauses between them. Strings help you record larger voicebanks in less time and use less filespace, and help you avoid redundant recordings.
For example, if you are writing a Japanese VCV reclist, rather than recording [- ka] [a ka] and [a ke] as separate files, wouldn't it be easier to combine them all into one "ka-ka-ke" string? This will yield the same end result in UTAU itself, and save you a lot of hassle in the long run.
So, how does one go about writing VCV strings? Well, first off, you want to make sure you cover all possible nucleus combinations. What I mean is, you'll want a sample for [a a] [a e] [a i] [a o] [a M] [a nn] [e a] [e i]... and so forth.
Rather than going through trial and error and racking up unnecessary frustration for yourself, here's a trick:
1. Count the number of nuclei. In Japanese, there are six.
a e i M o nn2. Start with one nucleus, and, starting with the next one in the list, alternate back and forth to each of them until you reach the total number of nuclei.
3. Repeat this for all other nuclei. When you reach the end of the list, wrap back around to the front.
a-e-a-i-a-M4. Double the first nucleus in each string.
And there you have it; all of the nuclei pairs are covered.
Note: it is common when writing strings in roman characters to use "_" to indicate a pause in speech, and "-" to indicate no pause. Some people opt to not use the "-" and instead write it all together like "aaeaiaM". It's up to your discretion.
As for the rest of the strings, you could simply insert the consonant or cluster into the string like so:
But, this results in a few unnecessary samples, and the string that begins with "nn" will be separated from the others in your voicebank. To combat this, we can use the nuclei-pairing trick again, only without the "nn" sample. This gives us:
However, even though we do not need the [(v) nn] samples in these strings, we still need the [nn (cv)] ones, so we can just tack on an "nn-(cv)" to the end of each string.
Even further, for the /ji/ issue, we can do another nuclei-pairing without "nn" or "i", and then add those back to the end of the strings.
bja-bja-bje-bja-bjM-i-bja-nn-bjaWe do not oto the [(v) i] or [(v) nn] samples of these strings.
Maximum efficiency, minimum redundancy.
Unlike Japanese, Spanish does not have any syllabic consonants, so we don't need to worry about including those. We do, however, need to take codas into account.
We can handle these one of two ways.
First, we can simply add VCs to the end of each applicable string, and include the "c_vc" samples shown earlier:
OR, for more ambitious voicebanks, we can take all possible coda-onset transitions into consideration.
...We would oto these new samples like [ab ba], [ab sa], and [abs dra].
Because of the VV samples, you don't technically have to add the diphthongs, as you could get "ai" by inputting [- a] [a i] on the same note in UTAU. Though, if you wanted to, you could adjust your strings accordingly:
But those strings are awfully long, aren't they? Let's split them in half to make for easier recording. Make sure when you split a string to add the previous nucleus to the new string. In this case, it doesn't split exactly even, but it works out.
You would write the rest of the reclist in the same manner as before, treating the diphthongs the same as the monophthongs.
Finally, let's tackle VCCV. It's kind of like a psuedo-VCVVC for languages, like English, which are too large to have a proper one.
Oftentimes, the vowels are either oto'd in CV style, with the addition of the "transitional" vowel like [a a] found in VCV or to simply include all the VV strings you would find in a VCV bank. Vowel endings are nearly always included as well.
As for consonants, those are handled a bit differently. Rather than including every nuclei-onset or nuclei-coda transition as typical in VCV and VCVVC, they simply include a CV sample with a "transitional onset", a VC sample with a "transitional coda", and a nucleus-onset transition.
The transitional onsets are created by including a vowel right before them, and the transitional codas are created by including a plosive, typically /k/, right after them. This creates a much smoother sound in contrast to CV and CVVC, which only capture the onsets and codas as they occur in word-initial^9 and word-final^10 position, respectively.
Obviously, a consonant that does not appear in a coda will not need the transitional coda sample, nor will one which does not appear in an onset need the transitional onset sample.
We use /k/ as the placeholder onset because plosives are always preceded by a short moment of silence, and most languages do not have many velar consonants, so there is a more distinct transition. In the case of strings for velar consonants /k/ /g/ /ŋ/ /x/ etc., we often use /t/ as the placeholder instead.
As for the nucleus-onset transition, we use the first (v) and second (c) oto'd as [(v) (c)] in a very similar manner to a typical VCV sample. Consult a VCCV oto guide for more info on that.
You would oto the sample "ba-bab-kab" as [- ba] [a b] [ba] [ab] and [ab -].
I hope you found this helpful, but if there is anything that wasn't clear or that you would like me to elaborate on, please let me know!
1^Onset -- the consonant(s) that occurs at the beginning of a syllable. The onset may be nonexistant, but will never be a vowel. EX: /k/ in /kæt/, /kɹ/ in /kɹæft/
2^Nucleas -- the "center" of the syllable; a syllable cannot exist without one. This is typically a vowel. EX: /æ/ in /kæt/, /n̩/ in /bʌʔn̩/
3^Coda -- the consonant(s) that occurs at the end of a syllable. The coda may be nonexistant, but will never be a vowel. EX /t/ in /kæt/, /ft/ in /kɹæft/
4^Phonetics -- the study of sound in speech.
5^Phonology -- the study of the relationships between different sounds in speech.
6^Syllabic consonant -- a consonant which acts as the nucleus for a syllable. EX: /n̩/ in /bʌʔn̩/
7^Consonant cluster -- occurs when more than one consonant is present in the onset or the nucleus. EX: /kɹ/ and /ft/ in /kɹæft/
8^Diphthong -- occurs when one vowel transitions into another within the same syllable. EX: /aj/ in /kajt/ (can also be written as /ai/)
9^Word initial -- the very first sound in a word. I use this to mean one preceded by silence, specifically.
10^Word final -- the very last sound in a word. I use this to mean one followed by silence, specifically.