SOLVED:This is my first UTAU and i'm doing everything "right" but it sounds like a robot.... Why?

Kurara

Momo's Minion
Thank you so much for your fast replies ! ^w^
Ok, so here is Circus Monster sang by my Utau~ https://clyp.it/sjc42iov
( you can hear the "shi" "yo" "mu".. They made my ears bleed ;v; )
Oh and, pardon me if I made any grammatical mistake, I'm French ^^'
 
Last edited:

na4a4a

Outwardly Opinionated and Harshly Critical
Supporter
Defender of Defoko
Utau is very finicky.

The general run down is this:
Avoid any sort of non-voice ambient noise. Even something extraordinarily low or high in frequency will distort the render. Whether you can hear it or not in the original recording.
Reverb and echo from your room will make any vb instantaneously robotic most of the time, try to flip your bed over amd record behind that or something, or if you have a fully stocked closet. Note that even a full closet may not be enough and can still have echo or even worse, sound boxy.

You will need a semi-decent mic, even a Logitech headset is an upgrade from a built in mic but will still give you iffy quality. Something like a Nady usb-1c or usb-1cx would be good (amd affordable) and push you towards the limits of what a usb mic will give quality-wise.

Popfilter, any blowing from consonants will cause distortion. popfilters are $5.
Don't put a sock over your mic as a pop filter, it won't help and the lint can damage the mic.

otos will make you cry. i could write pages of text on otoing, it's such a hassle.

There are a lot of things that can be at play. These are all general tips that anyone can use.
 

Milk

i like bad things
Supporter
Defender of Defoko
Thank you so much for your fast replies ! ^w^
Ok, so here is Circus Monster sang by my Utau~ https://clyp.it/sjc42iov
( you can hear the "shi" "yo" "mu".. They made my ears bleed ;v; )
Oh and, pardon me if I made any grammatical mistake, I'm French ^^'
Hmmmm. Well, It sounds like the oto might be off, first of all. I could fix it for you if you want me to?
Also, it sounds like her .frq files might be a bit wonky, or maybe her samples were recorded as stereo tracks? It could just be the resampler you're using, too. (sorry if none of this makes sense lol)
But the tone is nice! It just needs some work is all.
Btw, you're English is good! No complaints here ^^
 

na4a4a

Outwardly Opinionated and Harshly Critical
Supporter
Defender of Defoko
also don't upload to vocaroo, they compress the audio poorly and the quality is really low.
 

Kurara

Momo's Minion
Utau is very finicky.

The general run down is this:
Avoid any sort of non-voice ambient noise. Even something extraordinarily low or high in frequency will distort the render. Whether you can hear it or not in the original recording.
Reverb and echo from your room will make any vb instantaneously robotic most of the time, try to flip your bed over amd record behind that or something, or if you have a fully stocked closet. Note that even a full closet may not be enough and can still have echo or even worse, sound boxy.

You will need a semi-decent mic, even a Logitech headset is an upgrade from a built in mic but will still give you iffy quality. Something like a Nady usb-1c or usb-1cx would be good (amd affordable) and push you towards the limits of what a usb mic will give quality-wise.

Popfilter, any blowing from consonants will cause distortion. popfilters are $5.
Don't put a sock over your mic as a pop filter, it won't help and the lint can damage the mic.

otos will make you cry. i could write pages of text on otoing, it's such a hassle.

There are a lot of things that can be at play. These are all general tips that anyone can use.
Ok! So.. The problem is from my mic? I'll see if I can afford a better mic and all, thanks !
[doublepost=1481920015][/doublepost]
Hmmmm. Well, It sounds like the oto might be off, first of all. I could fix it for you if you want me to?
Also, it sounds like her .frq files might be a bit wonky, or maybe her samples were recorded as stereo tracks? It could just be the resampler you're using, too. (sorry if none of this makes sense lol)
But the tone is nice! It just needs some work is all.
Btw, you're English is good! No complaints here ^^

The oto? Oh, I thought I did good about it x) but it's hard to use ;-; I would love some help, yes ! :D
.frq files ? holy crap, I didn't heard about it ._.
About the stero tracks, yeah, some of them are actually stereo, and others mono
Ok, cool ! :D I really want to improve my Utau so I'll do my best~
Haha, thanks ! c:
[doublepost=1481920066][/doublepost]
also don't upload to vocaroo, they compress the audio poorly and the quality is really low.
Ok, thanks for your advice! c;
 

Obakebaka

Momo's Minion
Have you selected all of the notes in the UST and set the region properties to 0 modulation? That tends to fix most pitch problems even with lower quality mics. Like others have said the otos do need to be corrected so that they are on time. Also pitchbend. Pitchbending in the ust is what really gives a vocal track "life". Not overboard, but subtle enough to mimic how a human would sound singing it. It would be great to get a better mic, but they do have a problem of picking up more background noise as I've been finding out trying to record my banks with a Yeti vs using a laptop mic. If you want the higher quality you have to put in more work editing each sample so that they're clean enough.
 
  • Like
Reactions: Halo

Kurara

Momo's Minion
Have you selected all of the notes in the UST and set the region properties to 0 modulation? That tends to fix most pitch problems even with lower quality mics. Like others have said the otos do need to be corrected so that they are on time. Also pitchbend. Pitchbending in the ust is what really gives a vocal track "life". Not overboard, but subtle enough to mimic how a human would sound singing it. It would be great to get a better mic, but they do have a problem of picking up more background noise as I've been finding out trying to record my banks with a Yeti vs using a laptop mic. If you want the higher quality you have to put in more work editing each sample so that they're clean enough.

No, I didn't do that.. I didn't wanted to use the tools without knowning what they do 'x'
So, I guess that my oto is the big problem here, ok.. I'll fix it, but I don't really knwo how to do it proprely, same with pitchbending
Ugh.. I hope mics arn't too expensive ;-; (mine costed 15€ [idk how it is in £/$])
 

na4a4a

Outwardly Opinionated and Harshly Critical
Supporter
Defender of Defoko
A euro is about equal to a US dollar and the British pound is about a quarter higher in value than that.

Mics in Europe can be hard to discuss here sincd the majority of us are from the Americas (mostly north, so US and Canada), I've actually helped several people outside the us pick a mic out so if you want i can try to help you if you really wanted.

Generally if you can't afford anything more than 15 euros though you will not find anything to upgrade to. Mics that aren't total garbage start at around $50-$60
 

Truly

Teto's Territory
Defender of Defoko
I can maybe explain a few things that might help you make better choices when recording/configuring your bank.

The way (I believe) UTAU works is that, it takes your recorded syllables and lines them up as you lay them out on the piano roll, then "vocodes" them (as in, uses a vocoder) with the "resampler" as the carrier wave, on the correct pitch.
So there's two sides to the process:
1) One side deals with the samples. This includes recording them, and blending them together in the piano roll ("crossfading," which is what your OTO.INI is for). Manipulating the samples is what will make your bank sound "natural" or not.
2) The other side deals with the vocoder. This includes choosing pitch on the piano roll (as well as using "pitch bend" functions), and the Resampler. Since laymen generally can't create their own resampler/carrier wave, and would know what would be best even if they could, you'll probably just select from what you can find.

I won't go into recording tips, since a dozen and a half people here can give you those if you ask. Both good and bad advice float around these forums, the vocal-synth community, and the population of all musicians. What's important to understand, however, is that any noise that makes it into the recordings will also be picked up and pitch-changed by the resampler/vocoder! So if there is a lot of background noise, then that noise will be played along with your sample at what ever pitch you put into the piano roll. If there is a cat meowing in the background of your sample, it will be pitched and played too.

As for blending them together, I don't have a lot of experience configuring oto.ini files for UTAU, and you're better off asking the people here. But I will say that if your preutterance ("Attack") on a sample is too long, you'll feel some lag on that sample-- that's what I'm hearing in your demonstration linked above, @Kurara.

Your preutterance and (post-utterance? I've forgotten what UTAU calls it) determine how much of a sample crossfade with it's neighbors by default. You might even be able to get by without using an oto.ini at all, and just crossfading all syllables by hand, but that kind of defeats the purpose of UTAU, and would be a pain not only for you, but also for anyone else who might deign to use your bank.

Laying notes out on the piano roll is pretty straightforward, whether you're using a prefab UST or are creating one yourself. What might be less straightforward is how to use pitch-bends effectively. It isn't needed on every note, and often is best applied as very subtle predictions used right before notes, or little relaxed bits at the end of long notes-- like a pop singer who slides into the pitch of every line instead of properly jumping right onto the "correct" pitch. My voice instructor would have have had a stroke if he heard me telling you to slide into pitches! But that's the method for pop music.

As for the resampler, you can think of it as the "vocal cords" of your UTAU. Yes, the inflection and personality all come from you, the recording artist of the voicebank's samples, but the tone of it comes from what resampler you choose. In reality, it is just a waveform that moves to whatever pitches you've laid out on the piano roll. It plays faster for higher notes, and slower for lower notes, just like your real vocal cords vibrate faster or slower dependent on pitch. Each resampler has a slightly different composition (which is to say, the carrier wave of the vocoder is a slightly different wave), and in practice this will highlight and draw out different parts of your samples, create different overtones, and just generally have different sounds.

So, if you're just using a glorified vocoder, why does the range of a voicebank matter? Well, a vocoder will resonate with the samples you're using... rather, the farther away the pitch of the vocoder (that is, on the piano roll), from the pitch of your samples (the pitch you recorded at), the less strength the product will have. If you match the vocoder pitch and the recorded sample pitch, it will sound almost natural-- like it's your singing, but through someone else's vocal cords.

(of course, tell me if I'm dead wrong, I may misunderstand how UTAU works... orz)
 
Last edited:

Kurara

Momo's Minion
A euro is about equal to a US dollar and the British pound is about a quarter higher in value than that.

Mics in Europe can be hard to discuss here sincd the majority of us are from the Americas (mostly north, so US and Canada), I've actually helped several people outside the us pick a mic out so if you want i can try to help you if you really wanted.

Generally if you can't afford anything more than 15 euros though you will not find anything to upgrade to. Mics that aren't total garbage start at around $50-$60
Aww.. crap, I can't afford these mics.. Maybe later, then ;3;
 

Tomato Hentai

dont call me a veggie
Defender of Defoko
-snip because just seeing this above where i'm writing is distracting-
I'm sorry that this is very off-topic, but colouring a few words and sentences here and there the way you did is EXTREMELY distracting and is making it very very hard for me to even read what it is you're saying. I did pick up a bit of what you were saying, so I'll try to respond to that.
The UTAU resampler isn't a vocoder, it's just a resampler. A vocoder analyzes speech and can produce more sounds than just what you throw at it, and it does a lot more than just that too. A resampler simply takes what you throw at it and pitches it up or down in a certain way, or changes how breathy it is, et cetera.
But I'm just going off of memory here, so I could be wrong, too.
 
Last edited:

na4a4a

Outwardly Opinionated and Harshly Critical
Supporter
Defender of Defoko
preutterence should always line up with the end of the consonant for CV and VCV

the preutterence determines what is played before the actual note, hence "pre".

the preutterence also determines the space in which you can place an overlap. the overlap determines how large a crossfade can be.

The rules change for every type of consonant. here are some general rules.
These will probably make no sense without images but im on a phone right now.

samples should have free space before them to allow for proper configuration, cutting out this silence will make your voicebank robotic and choppy. If needed you can use audacity to add 100ms of silence before them to give yourself room to work.

hard and plosive-like consonants (k,t,p,d,g,d,b)
overlap: ~35
overlap should reside BEFORE the consonant, in the middle of the silence/free space.​
preutterence:
place as needed to create proper timing without affecting overlap.​
left cut off/left blank:
instead pf moving overlap (which should stay around 35) you can move the left cut off around to change it's position.
Fricatives and fluids (s, sh, z, m, n, l, f)
For these samples you do not necessarily need the silence before the sample.
overlap: 35-40
place INTO the sample. not before hand. you want to blend the consonant into the previous vowel.​
left cut off/left blank:
cut off the silence before the consonant in most cases. since you are blending the consonant into the previous note you don't need it. BE SURE you are not cutting off any of the consonant.
vowels: try overlap 65 and preutterence 70.

consonants like "ch" and "ts":
While ch is more of a fricative, you will usually want to oto it more like a hard consonant/plosive. depends on how long it was sustained, of it's really long then you can try otoing it like a fricative.
"ts" starts with a "t" for all intents and purposes. oto it like a hard consonant/prosive.

Overlap should always be placed before the preutterence, never after. this reduces the chance of envelope glitches.
err...im skipping over a lot but typing on my phone is a hassle.
 
  • Like
Reactions: Truly

Kurara

Momo's Minion
I can maybe explain a few things that might help you make better choices when recording/configuring your bank.

The way (I believe) UTAU works is that, it takes your recorded syllables and lines them up as you lay them out on the piano roll, then "vocodes" them (as in, uses a vocoder) with the "resampler" as the carrier wave, on the correct pitch.
So there's two sides to the process:
1) One side deals with the samples. This includes recording them, and blending them together in the piano roll ("crossfading," which is what your OTO.INI is for). Manipulating the samples is what will make your bank sound "natural" or not.
2) The other side deals with the vocoder. This includes choosing pitch on the piano roll (as well as using "pitch bend" functions), and the Resampler. Since laymen generally can't create their own resampler/carrier wave, and would know what would be best even if they could, you'll probably just select from what you can find.

I won't go into recording tips, since a dozen and a half people here can give you those if you ask. Both good and bad advice float around these forums, the vocal-synth community, and the population of all musicians. What's important to understand, however, is that any noise that makes it into the recordings will also be picked up and pitch-changed by the resampler/vocoder! So if there is a lot of background noise, then that noise will be played along with your sample at what ever pitch you put into the piano roll. If there is a cat meowing in the background of your sample, it will be pitched and played too.

As for blending them together, I don't have a lot of experience configuring oto.ini files for UTAU, and you're better off asking the people here. But I will say that if your preutterance ("Attack") on a sample is too long, you'll feel some lag on that sample-- that's what I'm hearing in your demonstration linked above, @Kurara.

Your preutterance and (post-utterance? I've forgotten what UTAU calls it) determine how much of a sample crossfade with it's neighbors by default. You might even be able to get by without using an oto.ini at all, and just crossfading all syllables by hand, but that kind of defeats the purpose of UTAU, and would be a pain not only for you, but also for anyone else who might deign to use your bank.

Laying notes out on the piano roll is pretty straightforward, whether you're using a prefab UST or are creating one yourself. What might be less straightforward is how to use pitch-bends effectively. It isn't needed on every note, and often is best applied as very subtle predictions used right before notes, or little relaxed bits at the end of long notes-- like a pop singer who slides into the pitch of every line instead of properly jumping right onto the "correct" pitch. My voice instructor would have have had a stroke if he heard me telling you to slide into pitches! But that's the method for pop music.

As for the resampler, you can think of it as the "vocal cords" of your UTAU. Yes, the inflection and personality all come from you, the recording artist of the voicebank's samples, but the tone of it comes from what resampler you choose. In reality, it is just a waveform that moves to whatever pitches you've laid out on the piano roll. It plays faster for higher notes, and slower for lower notes, just like your real vocal cords vibrate faster or slower dependent on pitch. Each resampler has a slightly different composition (which is to say, the carrier wave of the vocoder is a slightly different wave), and in practice this will highlight and draw out different parts of your samples, create different overtones, and just generally have different sounds.

So, if you're just using a glorified vocoder, why does the range of a voicebank matter? Well, a vocoder will resonate with the samples you're using... rather, the farther away the pitch of the vocoder (that is, on the piano roll), from the pitch of your samples (the pitch you recorded at), the less strength the product will have. If you match the vocoder pitch and the recorded sample pitch, it will sound almost natural-- like it's your singing, but through someone else's vocal cords.

(of course, tell me if I'm dead wrong, I may misunderstand how UTAU works... orz)

Okay, so I'll have to work on the blending, the oto.ini and the resambler, isn't it? I hope I'll find some tutos about it, because I don't really understand how this works.. 8_8
Thank you so much for your constructive reply! I'll keep in mind what you've said~
[doublepost=1481928039][/doublepost]
preutterence should always line up with the end of the consonant for CV and VCV

the preutterence determines what is played before the actual note, hence "pre".

the preutterence also determines the space in which you can place an overlap. the overlap determines how large a crossfade can be.

The rules change for every type of consonant. here are some general rules.
These will probably make no sense without images but im on a phone right now.

samples should have free space before them to allow for proper configuration, cutting out this silence will make your voicebank robotic and choppy. If needed you can use audacity to add 100ms of silence before them to give yourself room to work.

hard and plosive-like consonants (k,t,p,d,g,d,b)
overlap: ~35
overlap should reside BEFORE the consonant, in the middle of the silence/free space.​
preutterence:
place as needed to create proper timing without affecting overlap.​
left cut off/left blank:
instead pf moving overlap (which should stay around 35) you can move the left cut off around to change it's position.
Fricatives and fluids (s, sh, z, m, n, l, f)
For these samples you do not necessarily need the silence before the sample.
overlap: 35-40
place INTO the sample. not before hand. you want to blend the consonant into the previous vowel.​
left cut off/left blank:
cut off the silence before the consonant in most cases. since you are blending the consonant into the previous note you don't need it. BE SURE you are not cutting off any of the consonant.
vowels: try overlap 65 and preutterence 70.

consonants like "ch" and "ts":
While ch is more of a fricative, you will usually want to oto it more like a hard consonant/plosive. depends on how long it was sustained, of it's really long then you can try otoing it like a fricative.
"ts" starts with a "t" for all intents and purposes. oto it like a hard consonant/prosive.

Overlap should always be placed before the preutterence, never after. this reduces the chance of envelope glitches.​
err...im skipping over a lot but typing on my phone is a hassle.

Wow, now I'm realizing how bad I am at Utau >< I didn't get all of what you've said, but I'll try fixing my sample, thx c:
 

Truly

Teto's Territory
Defender of Defoko
don't worry so much that you're "bad," just be cognizant of the fact that there's a lot of learn and as you come to be competent and eventually master one area, there are still plenty more areas to deal with.
And chances are, you, like the OP, will struggle with your UTAU sounding "like a robot," for a good, long time.

I guess I should add, you shouldn't be discouraged at that. You should recognize it as an early stage and strive to move past it in due time.
 

수연 <Suyeon>

Your friendly neighborhood koreaboo trash
Supporter
Defender of Defoko
I wouldn't recommend seeking out tutorials - if you do, try to find ones that aren't older than a year. A lot of the ones you'll find will be outdated information and/or poorly explained.

A lot of users here that have been around for a good while are often willing to help in teaching or outright helping you fix whatever problems you face in using the software. Don't let this first attempt at a bank discourage you, however cause 10/10 - the average user's first bank (or even first few banks) usually sucks in all aspects (mic quality, oto, accent) and it may take a good while to see results that you're 100% satisfied with. We've all been there.
 

Kurara

Momo's Minion
don't worry so much that you're "bad," just be cognizant of the fact that there's a lot of learn and as you come to be competent and eventually master one area, there are still plenty more areas to deal with.
And chances are, you, like the OP, will struggle with your UTAU sounding "like a robot," for a good, long time.

I guess I should add, you shouldn't be discouraged at that. You should recognize it as an early stage and strive to move past it in due time.
Alright! I'll do my best !
[doublepost=1481970788][/doublepost]
I wouldn't recommend seeking out tutorials - if you do, try to find ones that aren't older than a year. A lot of the ones you'll find will be outdated information and/or poorly explained.

A lot of users here that have been around for a good while are often willing to help in teaching or outright helping you fix whatever problems you face in using the software. Don't let this first attempt at a bank discourage you, however cause 10/10 - the average user's first bank (or even first few banks) usually sucks in all aspects (mic quality, oto, accent) and it may take a good while to see results that you're 100% satisfied with. We've all been there.

I watch some tutos made by Yuunari, she pretty good at explaning things and her tutos are from this year c:
It's nice to see people on this forum who can help newbies like me! It's really encouraging ! :D
 

Similar threads