{"id":12373,"date":"2025-02-17T10:10:21","date_gmt":"2025-02-17T10:10:21","guid":{"rendered":"https:\/\/metaschool.so\/articles\/?p=12373"},"modified":"2025-02-17T10:10:23","modified_gmt":"2025-02-17T10:10:23","slug":"llm-architecture","status":"publish","type":"post","link":"https:\/\/metaschool.so\/articles\/llm-architecture\/","title":{"rendered":"Understanding the LLM Architecture"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_56_1 ez-toc-wrap-left counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title \" >Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/metaschool.so\/articles\/llm-architecture\/#Architecture_of_a_Transformer\" title=\"Architecture of a Transformer\">Architecture of a Transformer<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/metaschool.so\/articles\/llm-architecture\/#Overview\" title=\"Overview\">Overview<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/metaschool.so\/articles\/llm-architecture\/#Training_an_LLM\" title=\"Training an LLM\">Training an LLM<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/metaschool.so\/articles\/llm-architecture\/#Generating_Text_How_LLMs_%E2%80%9CThink%E2%80%9D\" title=\"Generating Text: How LLMs &#8220;Think&#8221;\">Generating Text: How LLMs &#8220;Think&#8221;<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/metaschool.so\/articles\/llm-architecture\/#Challenges_and_Considerations\" title=\"Challenges and Considerations\">Challenges and Considerations<\/a><\/li><\/ul><\/nav><\/div>\n\n<p>Large Language Models (LLMs), like ChatGPT or GPT-4, are AI systems designed to understand and generate human-like text. They power chatbots, translation tools, and content creators. At their core, LLMs are neural networks trained on vast amounts of text data to predict the next word in a sequence. Their magic lies in their ability to capture context, grammar, and even creativity\u2014all thanks to a groundbreaking architecture called Transformers.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<h3 class=\"wp-block-heading\">What are LLMs?<\/h3>\n\n\n\n<p>A large language model is designed to understand and generate human-like text. Built using transformer architecture \u2014 a type of neural network that processes language through self-attention mechanisms \u2014 LLMs can handle vast amounts of text data. Unlike traditional models that analyze input sequentially, LLMs process whole sequences in parallel, enabling faster training and the use of high-performance hardware like GPUs.<\/p>\n\n\n\n<p>These models are trained on extensive datasets, often spanning billions of web pages, books, and other textual content. Through this training, LLMs learn the intricacies of language, grammar, and context, allowing them to perform a wide variety of tasks. These include generating text, translating languages, summarizing content, answering questions, and even assisting in creative and technical tasks like coding.<\/p>\n\n\n\n<p><em>If you want to learn about the basics of LLMs in more detail, check out this comprehensive <a href=\"https:\/\/metaschool.so\/articles\/what-is-llm-large-language-model\">guide<\/a>.<\/em><\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Architecture_of_a_Transformer\"><\/span>Architecture of a Transformer<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Introduced in the 2017 paper <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"noopener\"><em>Attention Is All You Need<\/em><\/a>, the Transformer replaced older models (like RNNs and LSTMs) that struggled with long sentences and slow training. Transformers use attention mechanisms to process entire sequences of text at once, making them faster and more accurate. And most LLMs are built on the transformer architecture. <\/p>\n\n\n\n<p>Let\u2019s break down how they work! <\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Tokenization: Breaking Text into Pieces<\/strong><\/h3>\n\n\n\n<p>Tokenization is the first step in preparing the input text for an LLM. It converts raw text, such as &#8220;Hello, world!&#8221;, into smaller units called&nbsp;<strong>tokens<\/strong>. These tokens can represent whole words (e.g., &#8220;Hello&#8221;), punctuation marks (&#8220;,&#8221; or &#8220;!&#8221;), subwords (parts of words), or even individual characters. <\/p>\n\n\n\n<p>For example, a complex word like &#8220;unhappiness&#8221; might be split into subword tokens like &#8220;un&#8221; and &#8220;happiness&#8221;. This process simplifies how the model processes language, allowing it to handle rare or unfamiliar words by breaking them into recognizable chunks. Without tokenization, the model would struggle with variations in spelling, slang, or technical terms, making this step foundational to its ability to interpret diverse inputs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Embeddings: Turning Words into Numbers<\/strong><\/h3>\n\n\n\n<p>Once the text is tokenized, the model translates each token into a numerical representation called a&nbsp;<strong>word embedding<\/strong>. These embeddings are dense vectors (lists of numbers) that capture semantic meaning. For instance, words like &#8220;king&#8221; and &#8220;queen&#8221; might have similar vectors because they share contextual relationships (e.g., royalty). Similarly, words like &#8220;apple&#8221; and &#8220;juice&#8221; would have some similarity due to their frequent co-occurrence in sentences.<\/p>\n\n\n\n<p>However, embeddings alone aren\u2019t enough to fully understand language. While they capture the meaning of individual words, they don\u2019t account for the order or structure of words in a sentence. For example, the sentences &#8220;The dog chased the cat&#8221; and &#8220;The cat chased the dog&#8221; use the same words but have entirely different meanings. <\/p>\n\n\n\n<p>This is where\u00a0<strong>positional encoding<\/strong>\u00a0comes in. Positional encoding adds information about the position of each token in a sentence using mathematical patterns, often sine and cosine functions. For instance, the word &#8220;dog&#8221; in position 1 of a sentence gets a different positional vector than &#8220;dog&#8221; in position 5. This helps the model understand relationships like &#8220;Who let the dog out?&#8221; vs. &#8220;The dog chased the cat,&#8221; where the position of &#8220;dog&#8221; changes the meaning.<\/p>\n\n\n\n<p>But even with embeddings and positional encoding, the model still needs a way to understand how words in a sentence relate to each other. This is where&nbsp;<strong>self-attention<\/strong>&nbsp;comes into play.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Self-Attention: The &#8220;Context Finder&#8221;<\/strong><\/h3>\n\n\n\n<p>The heart of a Transformer is the&nbsp;<strong>self-attention mechanism<\/strong>, which determines how words in a sentence relate to one another. For every token, the model creates three vectors: <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Query<\/strong>&nbsp;(what the token is &#8220;looking for&#8221;)<\/li>\n\n\n\n<li><strong>Key<\/strong>&nbsp;(what it &#8220;offers&#8221; to others)<\/li>\n\n\n\n<li><strong>Value<\/strong>&nbsp;(the token\u2019s core content)<\/li>\n<\/ol>\n\n\n\n<p>The model then calculates&nbsp;<strong>attention scores<\/strong>&nbsp;between every pair of tokens by comparing their Queries and Keys. These scores decide how much &#8220;focus&#8221; each token receives from others.<\/p>\n\n\n\n<p>For example, in the sentence &#8220;She drank apple juice,&#8221; the word &#8220;juice&#8221; might strongly attend to &#8220;apple&#8221; because their relationship is critical to the meaning. The attention scores are normalized using a softmax function, and the resulting weights are applied to the Value vectors. This creates a weighted sum that represents each token in the context of the entire sentence.<\/p>\n\n\n\n<p>In simpler terms, self-attention allows the model to dynamically highlight relevant words. For instance, when thinking about &#8220;juice,&#8221; the model might emphasize &#8220;apple&#8221; because it\u2019s the most relevant word in that context. This mechanism is what enables Transformers to understand complex relationships in text, making them far more powerful than older models that processed words one at a time without considering their connections.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Multi-Head Attention: Seeing from Multiple Angles<\/strong><\/h3>\n\n\n\n<p>To capture diverse relationships, Transformers use&nbsp;<strong>multi-head attention<\/strong>. Instead of one set of Query\/Key\/Value vectors, the model creates multiple parallel &#8220;heads,&#8221; each with its own set of vectors. Each head learns to focus on different types of relationships\u2014for example, one head might track grammatical structure, another could focus on semantic meaning, and a third might identify contextual nuances. The outputs from all heads are combined into a single representation, enriching the model\u2019s understanding. This parallel processing allows the model to analyze sentences from multiple perspectives simultaneously, much like how humans consider tone, intent, and literal meaning when interpreting language.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Feed-Forward Networks: Processing the Insights<\/strong><\/h3>\n\n\n\n<p>After attention, each token\u2019s updated vector passes through a&nbsp;<strong>feed-forward neural network<\/strong>. This network applies non-linear transformations to the data, refining features and emphasizing certain patterns. For instance, it might amplify signals related to verbs in a sentence or dampen irrelevant details. Unlike attention, which mixes information across tokens, this step operates independently on each token\u2019s vector. Think of it as a final &#8220;polishing&#8221; stage where the model digests the insights gathered from attention and prepares them for deeper layers or output generation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Layer Stacking: Building Depth<\/strong><\/h3>\n\n\n\n<p>Transformers stack multiple layers of attention and feed-forward blocks. Each layer refines the model\u2019s understanding incrementally. In early layers, the model might learn basic grammar or word associations (e.g., &#8220;cat&#8221; and &#8220;dog&#8221; are both animals). Deeper layers handle abstract concepts, like irony or logical reasoning. For example, in the sentence &#8220;The movie was so bad it was good,&#8221; lower layers might parse the literal meaning, while higher layers recognize the sarcasm. This hierarchical processing mimics how humans analyze text\u2014first grasping structure, then nuance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Decoder (for Text Generation)<\/strong><\/h3>\n\n\n\n<p>In models like GPT, the&nbsp;<strong>decoder<\/strong>&nbsp;generates text autoregressively, predicting one token at a time. It uses&nbsp;<strong>masked self-attention<\/strong>, which prevents the model from &#8220;peeking&#8221; at future tokens during training. For instance, when predicting the word &#8220;juice&#8221; in &#8220;She drank apple ___,&#8221; the decoder only attends to &#8220;She,&#8221; &#8220;drank,&#8221; and &#8220;apple&#8221;\u2014not the blank position. This masking ensures the model learns to rely on context from earlier tokens. <\/p>\n\n\n\n<p>During inference, the decoder iterates: it takes the user\u2019s input, generates a token, appends it to the input, and repeats until a complete response is formed. Techniques like temperature scaling or top-k sampling add randomness to avoid repetitive outputs, balancing creativity and coherence.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Overview\"><\/span>Overview<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>This figure shows the Transformer architecture for sequence-to-sequence tasks. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><a href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"711\" height=\"703\" src=\"https:\/\/metaschool.so\/articles\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-13-162154.png\" alt=\"Transformer Model Architecture\" class=\"wp-image-12375\" style=\"width:436px;height:auto\" srcset=\"https:\/\/metaschool.so\/articles\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-13-162154.png 711w, https:\/\/metaschool.so\/articles\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-13-162154-300x297.png 300w\" sizes=\"auto, (max-width: 711px) 100vw, 711px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>On the left is the encoder, which processes input embeddings (with positional encodings) through stacked layers of multi-head self-attention and feed-forward networks, each followed by Add &amp; Norm steps. <\/p>\n\n\n\n<p>On the right is the decoder, which similarly uses masked multi-head self-attention to handle the partially generated sequence, then attends to the encoder\u2019s output, and finally applies a feed-forward network. The final linear layer and softmax produce output probabilities at each time step. <\/p>\n\n\n\n<p>This design removes recurrence, enabling parallel processing while leveraging attention to capture dependencies across positions. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Training_an_LLM\"><\/span><strong>Training an LLM<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Training a Large Language Model is a two-phase process:&nbsp;<strong>pre-training<\/strong>&nbsp;and&nbsp;<strong>fine-tuning<\/strong>. During pre-training, the model learns general language patterns by analyzing vast amounts of text data, often sourced from books, websites, and other publicly available content. This phase involves tasks like predicting the next word in a sentence or filling in masked (hidden) words. <\/p>\n\n\n\n<p>For example, given the sentence &#8220;The cat sat on the ___,&#8221; the model might predict &#8220;mat&#8221; or &#8220;floor.&#8221; The model adjusts its internal parameters (weights) using a&nbsp;<strong>loss function<\/strong>, typically cross-entropy loss, which quantifies the difference between its predictions and the actual text. Over time, this iterative process helps the model grasp grammar, facts, and reasoning skills. <\/p>\n\n\n\n<p>After pre-training, the model undergoes&nbsp;<strong>fine-tuning<\/strong>, where it is trained on smaller, task-specific datasets (e.g., question-answering pairs or conversational data) to specialize its knowledge. This step refines its behavior, ensuring it can follow instructions, avoid harmful outputs, or adapt to niche domains like medical or legal text. The combination of broad pre-training and targeted fine-tuning allows LLMs to balance versatility with precision.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Generating_Text_How_LLMs_%E2%80%9CThink%E2%80%9D\"><\/span><strong>Generating Text: How LLMs &#8220;Think&#8221;<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>When generating text, LLMs work&nbsp;<strong>autoregressively<\/strong>, meaning they produce one token at a time, using each new token as input for the next step. For instance, if a user asks, &#8220;What\u2019s the weather like today?&#8221; the model might start with &#8220;Today\u2019s,&#8221; predict &#8220;weather,&#8221; then append &#8220;is,&#8221; and so on, until it forms a complete response. <\/p>\n\n\n\n<p>To balance creativity and coherence, models use&nbsp;<strong>sampling strategies<\/strong>.&nbsp;<strong>Greedy sampling<\/strong>&nbsp;selects the most probable token at each step, but this can lead to repetitive or rigid outputs. To introduce variety,&nbsp;<strong>temperature scaling<\/strong>&nbsp;adjusts the randomness of predictions\u2014lower values make the model conservative (e.g., &#8220;sunny&#8221;), while higher values encourage riskier choices (e.g., &#8220;partly cloudy with a chance of rainbows&#8221;).&nbsp;<\/p>\n\n\n\n<p><strong>Top-k<\/strong>&nbsp;and&nbsp;<strong>top-p<\/strong>&nbsp;sampling restrict choices to the&nbsp;<em>k<\/em>&nbsp;most likely tokens or a cumulative probability threshold (<em>p<\/em>), respectively. For example, top-k=5 might limit the model to selecting from words like &#8220;sunny,&#8221; &#8220;rainy,&#8221; &#8220;cloudy,&#8221; &#8220;windy,&#8221; or &#8220;stormy&#8221; for weather-related contexts. These techniques ensure the output is both relevant and engaging, mimicking human-like variability without sacrificing logic.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Challenges_and_Considerations\"><\/span><strong>Challenges and Considerations<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Despite their capabilities, LLMs face significant challenges.&nbsp;<\/p>\n\n\n\n<p><strong>Computational cost<\/strong>&nbsp;is a major barrier\u2014training a model like GPT-3 requires thousands of specialized processors (GPUs\/TPUs) and millions of dollars, making it inaccessible to most organizations. Energy consumption and environmental impact are growing concerns too.&nbsp;<\/p>\n\n\n\n<p><strong>Bias and safety<\/strong>&nbsp;issues arise because models learn from human-generated data, which often contains stereotypes, misinformation, or toxic language. For example, a model might associate certain professions with specific genders due to biased training data. Ensuring outputs are safe and unbiased requires careful filtering and ethical guidelines.&nbsp;<\/p>\n\n\n\n<p><strong>Ethical dilemmas<\/strong>&nbsp;also loom large: LLMs can generate convincing fake news, plagiarize content, or replace jobs in writing and customer service. Addressing these challenges involves technical solutions (e.g., bias mitigation algorithms), policy frameworks (e.g., content moderation), and public awareness to balance innovation with responsibility.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Conclusion: The Power and Promise of LLMs<\/strong><\/h3>\n\n\n\n<p>Large Language Models represent a monumental leap in artificial intelligence, combining the Transformer architecture\u2019s ingenuity with vast computational resources to mimic human language understanding and creativity. At their core, LLMs like GPT-4 rely on <strong>tokenization<\/strong> to break text into manageable pieces, <strong>embeddings<\/strong> to capture meaning, <strong>self-attention<\/strong> to analyze context, and <strong>multi-head attention<\/strong> to explore diverse relationships. Layer stacking and feed-forward networks refine these insights, while the <strong>decoder<\/strong> generates coherent, context-aware text through autoregressive processes.<\/p>\n\n\n\n<p>Training LLMs involves two phases: <strong>pre-training<\/strong> on massive datasets to learn general language patterns, and <strong>fine-tuning<\/strong> to specialize for tasks like answering questions or writing code. Despite their capabilities, challenges like <strong>computational costs<\/strong>, <strong>bias<\/strong>, and <strong>ethical risks<\/strong> underscore the need for responsible development.<\/p>\n\n\n\n<p>Ultimately, LLMs are not just tools for generating text\u2014they are gateways to democratizing access to knowledge, creativity, and problem-solving. As these models evolve, balancing innovation with ethical safeguards will be key to unlocking their full potential while ensuring they benefit society as a whole.<\/p>\n\n\n\n<p><strong>Related Reading:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/metaschool.so\/articles\/what-is-llm-large-language-model\">What is LLM in AI? Large Language Models Explained!<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/metaschool.so\/articles\/starcoder\">StarCoder: LLM for Code \u2014 A Comprehensive Guide<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/metaschool.so\/articles\/humaneval-benchmark-for-llm-code-generation\">HumanEval Benchmark: Evaluating LLM Code Generation Capability<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/metaschool.so\/articles\/what-is-langchain-complete-guide-2025\">What is LangChain: Complete Guide 2025<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":19,"featured_media":12379,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"neve_meta_sidebar":"","neve_meta_container":"","neve_meta_enable_content_width":"","neve_meta_content_width":0,"neve_meta_title_alignment":"","neve_meta_author_avatar":"","neve_post_elements_order":"","neve_meta_disable_header":"","neve_meta_disable_footer":"","neve_meta_disable_title":"","footnotes":""},"categories":[344],"tags":[],"class_list":["post-12373","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence"],"_links":{"self":[{"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/posts\/12373","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/users\/19"}],"replies":[{"embeddable":true,"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/comments?post=12373"}],"version-history":[{"count":11,"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/posts\/12373\/revisions"}],"predecessor-version":[{"id":12462,"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/posts\/12373\/revisions\/12462"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/media\/12379"}],"wp:attachment":[{"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/media?parent=12373"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/categories?post=12373"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/metaschool.so\/articles\/wp-json\/wp\/v2\/tags?post=12373"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}