Sort:  

The easiest way is to consider a token a word. That isnt exact but it is enough of a framework.

To give you an idea, Llama3 was trained on 16 trillion tokens.

What I can understand we have to feed a lots of data..

That is the baseline if you want to go from that.