The paper that started all this: Attention is All You Need | Google Research

This links to the original Google research paper that started the current revolution in Large language models. It puts forward the transformer as a better alternative to sequence transduction models. It is a bit dated, but relevant if you want to dive deeper. It is also a bit technical but still informative for a curious beginner.
Read it here: