Pixel phones already have "interpreter mode" - they will translate audio being spoken between two languages. They do this in real time like an interpreter would.
I imagine the additional challenge for what you're asking for is just the audio mixing. The only added step is that you would need to take the "background noise" and mix the new spoken audio back in.