Building an LLM From Scratch: I Trained Word Embeddings on Dostoevsky. Here’s What I Found.

In my past article I wrote about how I implemented Character Level Tokenization over a very small corpus and understood the most basic and initial phases of NLP and base of LLMs. This time I implemented the next step towards Modern NLP and LLM system “Embeddings” and implemented it from scratch and trained it over my laptop which took 30+ hours. Let’s dive into the concept, mathematics and code that i understood in my process of building LLM from scratch and implementing the embeddings to my corpus of nearly 1 million words. What are Embeddings? Embeddings are just the vector representation or format of each word to make the machine or model understand the word instead of making it memorize it with a unique number given to each unique words in a corpus. For simple breakdown of concept of embeddings, let’s take an example: Text = "dog cat" dog = (0.01, 0.04, 0.2) cat = (0.01, 0.03, 0.2) # dog and cat are given similar vectors because they appear in similar context So, now these two word

Read Original Article →

Source

https://pub.towardsai.net/building-an-llm-from-scratch-i-trained-word-embeddings-on-dostoevsky-heres-what-i-found-b3169c1ae674?source=rss----98111c9905da---4