There's already AI singing synthesis in the form of NNSVS. To develop a new voicebank you'll need to record yourself singing songs, then label the phonemes, then train the model. Reusing UTAU recordings is not ideal for this, due to the monotone style. The model would capture your pronunciation and tone of voice as well as your singing style.
If you only want the tone of voice, you can use Diff-SVC, which functions more as a voice changing effect than as a synthesizer. To use a diff svc model you'll need to input a reference acapella track, and this could be in any language. Of course, that means it's also going to copy the pronunciation and accent of the reference track. UTAU recordings are still not ideal for training data because it'll have a very limited pitch range. You'll want to sing songs again, or render out various songs with the UTAU bank if you don't mind the resampler sound. No phoneme labelling is required.