I don’t think it’s a feature that serves a purpose
I think it’s a technical limitation of how “sing” works
Basically, apple uses AI machine learning to isolate the vocals from the rest
What I think it does, roughly: Initializing the tensor cores, loading the ai model into memory, feeding the music into the ai, extracting the output data and playing the result in the speaker
All of this does not start instantly, it seems it takes a few seconds to “start”
I think they used the fade-out fade-in to “hide” it the best they could, but I agree it’s a bit unnatural