Large Language Model (LLM) Application Architecture

A16z published an article this week focusing on the application architecture of large language models (LLMs): https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/

This article provides a detailed analysis of the systems, tools, and design patterns commonly used by startups and large companies when designing and implementing LLM applications. As a non-technical reader, I attempted to understand the architectural design thinking behind these applications:

The figure above is mainly based on the "in-context learning" design pattern.

So, first we need to understand what "in-context learning" is. You can refer to the relevant entry on Wikipedia: https://en.wikipedia.org/wiki/In-context_learning_(natural_language_processing)

mainly describes how the model predicts or generates subsequent text content based on the given textual context. In other words, the model uses information from the preceding text to produce relevant and appropriate responses. This method mimics human conversational habits: we always base our thoughts and decisions on the context of the conversation to decide what we will say next.

For example, suppose I ask, "How's the weather today?" You might answer based on your location and time, such as "It's sunny today" or "It's raining today." This is using the context (current weather conditions) to respond to my question.

Therefore, "In-context learning" in this context refers to the ability of language models to generate appropriate responses based on the given conversational context.

There are mainly three ways to use large language models (LLMs):

Training your own model from scratch
Fine-tuning based on open-source models
Directly using APIs

Among them, in-context learning mainly applies to the third approach. Although directly using APIs is more convenient compared to training models from scratch or fine-tuning models, the cost of API calls increases exponentially with the length of the prompt, so it becomes very important to use in-context learning methods more efficiently. You can refer to another article by a16z for deeper insights: https://a16z.com/2023/05/25/ai-canon/

This article breaks down the entire workflow into three main steps:

: In this stage, private data needs to be processed and stored (using legal documents as an example) for subsequent retrieval. This usually involves splitting documents into smaller chunks, processing them through embedding models, and then storing the processed data in a special database called a vector database. I have explained this part in detail in one of my previous articles:
), and relevant documents retrieved from the vector database.
Prompt execution/inference: After the prompts are compiled, they are submitted to a pre-trained LLM (large language model) for processing, which includes using proprietary model APIs and open-source or self-trained models. Some developers may also add operational systems at this stage, such as logging, caching, and validation functions. I did not cover this part in my previous articles, but I can write a dedicated piece introducing this aspect later.

Let's analyze these three steps in detail respectively:

Data preprocessing/embedding

may include various formats of documents, such as PDFs, CSVs, or SQL structured data. There are many ways to process and transform these data; some people prefer to use ETL (Extract, Transform, Load) tools like Databricks or Airflow, while others tend to use orchestration frameworks like LangChain and LlamaIndex.

is a method that converts high-dimensional, discrete, or unordered data (such as words, sentences, users, products, etc.) into low-dimensional, continuous, ordered vectors. Many developers choose to directly use OpenAI API, such as the text-embedding-ada-002 model. Some large companies may opt for Cohere, while developers who prefer open source might choose Hugging Face's Sentence Transformers library. https://huggingface.co/sentence-transformers

is a database specifically designed for storing and processing vector data. It effectively stores embedding vectors and supports efficient querying of these embeddings. Many people choose to use Pinecone, while there are also open-source options such as Weaviate, Vespa, and Qdrant, as well as local vector management libraries like Chroma and Faiss, and enhanced OLTP systems like pgvector.

usually refers to the amount or range of input data that the model can reference when generating predictions. For example, in processing text data, the context window may refer to the number of words surrounding the current word. For language models, a larger context window means the model can consider more historical information when generating predictions.

Regarding data preprocessing and embedding, some believe that as the available context window of large models increases, Embeddings may be more integrated into prompts. However, experts hold the opposite view, arguing that as the context window expands, computational costs also increase, and using Embeddings can improve efficiency.

Prompt construction/retrieval

. In addition, there are more advanced prompt engineering techniques, such as Chain-of-Thought (CoT) and Tree of Thoughts (ToT), which can be referenced at https://www.promptingguide.ai/techniques. I will also provide a detailed analysis of these techniques later. These techniques can be used to build chatbots (ChatBot), perform document-based question answering, etc.

In the previous step, we mentioned two orchestration frameworks, LangChain and LlamaIndex, which can play important roles in this step. They abstract many details related to prompt chains; define interfaces with external APIs (including determining when API calls are needed); retrieve contextual data from vector databases; and maintain memory across multiple LLM calls. They also provide templates for many common applications. Their output is one or a series of prompts that will be submitted to the language model. I have provided some basic knowledge about LangChain in past articles, and there will be more detailed sharing in the future.

Prompt execution/inference

Currently, OpenAI leads in the LLM field. Typically, people start building LLM applications with gpt-4 or gpt-4-32k models. However, when the product enters the scaling phase, they usually switch to the gpt-3.5-turbo model, which has lower accuracy but costs only one-fiftieth of gpt-4 and runs faster. Anthropic's Claude also provides an API, with a context window that can reach 100k. Some developers choose to use open-source models, and some cloud services like Databricks, Anyscale, Mosaic, Modal, and RunPod may offer corresponding preset tools and services. Hugging Face and Replicate also provide simple APIs and front-end interaction interfaces for running AI applications.

Compared to proprietary models like OpenAI, open-source LLMs still have gaps, but these gaps are gradually narrowing. For example, Meta's LLaMa, as well as Together, Mosaic, Falcon, and Mistral. Of course, Meta is also preparing to open-source LLaMa2.

usually refer to tools used for monitoring, managing, optimizing, or debugging model operations, which are not widely used among current developers. The following types of tools can all be considered "operational tools":

, such as Weights, Biases, MLflow, PromptLayer, and Helicone, can record model outputs, inputs, and operating states, helping developers understand model behavior.
, such as Guardrails and Rebuff, can evaluate the model's speed and resource usage, helping developers optimize model performance.
, such as Redis, can store model outputs so they can be quickly retrieved when needed, thereby improving the response speed and cost-effectiveness of the application.
, can detect and prevent abuse or attacks on the model, protecting its security and stability.

The non-model parts of LLM applications are usually hosted in the cloud, such as Vercel. Additionally, two new hosting methods are emerging: Steamship provides end-to-end hosting services for developers, including LangChain, multi-tenant data contexts, asynchronous tasks, vector storage, and key management; another direction is that companies like Anyscale and Modal allow developers to host models and Python code in the same place.

AI agent frameworks

and I will demonstrate Reflexio to you in the future if there is an opportunity. However, most AI agent frameworks are currently in the proof-of-concept stage — although they can showcase stunning effects, they cannot reliably and reproducibly complete tasks yet. I will closely monitor their development in the future.

creating products that were previously impossible to achieve. Regardless of which application, AI empowers individual developers to build astonishing results in just a few days, surpassing supervised machine learning projects that require large teams to spend months building. a16z's framework can help us better utilize this powerful tool.