Textual Autograd Mechanics: Computation Graphs in Language Optimization

Wait 5 sec.

Table of LinksAbstract and 1. IntroductionTEXTGRAD: Optimizing AI systems by backpropagating text feedbackResults3.1 Code optimization3.2 Solution optimization by test-time training to improve problem solving3.3 Prompt optimization for reasoning3.4 Molecule optimization3.5 Radiotherapy treatment plan optimization Related work Discussion, Acknowledgements, and References\A. TEXTGRAD Details B. Optimizer Extensions C. Code Optimization D. Solution Optimization E. Prompt Optimization F. Molecule Optimization G. Treatment Plan Optimization2 TEXTGRAD: Optimizing AI systems by backpropagating text feedbackBelow we first describe a example to demonstrate what TEXTGRAD looks like for a system made of two LLM calls, and then give the more general form for arbitrarily complex systems.\Warmup: system with two LLM calls\Example computation graph. In traditional automatic differentiation, we compute gradients that provides a direction that would improve the variable with respect to a downstream loss with the chain-rule. For a simple analog, let us look at the simple system:\ \ \where we denote the free parameter to optimize, the prompt, with green, and we use + to denote concatenation of two strings, and use LLM(x) to give x as a prompt to a language model to collect the response. We will use the chain notation:\ \as another way to denote this system, where we make one LLM call to generate a Prediction for a Question using a Prompt, and another LLM call to evaluate this Prediction.\Example gradient computation. In this example system, to improve the Prompt with respect to the evaluation, we instantiate an analog of the backpropagation [19] algorithm through the following:\ \ \Example optimizer. In standard gradient descent, the current value of the variable is combined with the gradients through subtraction, e.g.:\ \Continuing the gradient-based optimization analogy, we use Textual Gradient Descent (TGD):\ A concrete way to instantiate TGD is the following:\ \The general case\The abstraction readily applies to arbitrarily complex systems. Define a computation graph by\ \where v is a variable in the graph, V is the set of all variables in the graph, and SuccessorsOf returns the successors and PredecessorsOf returns the predecessors of a variable. Generally speaking, the value of v can be unstructured data, such as natural language text or images. For most of the results and exposition in this paper, v is natural language text.\Further, let us have fv as the transformation that consumes a set of variables and produces the variable v. For instance, we can use an LLM or Numerical Simulator as a transformation. Since different functions will have different ways to compute gradients and collect feedback for, we will generally use ∇f to denote the gradient function for a function f . For the sake of exposition, we will omit the subscript when the function is obvious.\The gradients are computed by\ \here we collect the set of gradients from all successors of v. Intuitively, we get feedback from every context in which a variable v was used, and aggregate them.\Equation 11 recursively computes the gradients of the downstream objective with respect to the desired variables v in the graph. The ∇f function takes as input the gradients of L with respect to the successors of a given variable v, the value of the variable v and the successors themselves. Note that the final gradient variable comprises a set of contexts and criticisms for any place a variable was used in.\Finally, to update any desired variable v in the graph, we can use an optimizer:\ \which updates the value of v based on its current value and the gradients. For a computation graph where there are n edges, each iteration of optimization performs at most n additional language model calls to compute gradients (1 call using the gradient operator for each edge in the computation graph). For implementations of operations in TEXTGRAD, see Appendix A.\Objective functions\In numerical optimization and automatic differentiation, the objective function is typically a differentiable function, such as mean squared error or cross entropy. In TEXTGRAD, the objective can be a complex and potentially nondifferentiable function, where the domain and codomain of the function can be unstructured data. This choice adds important generality and flexibility to the framework. For instance, we demonstrate that the objective can be specified in natural language text and computed by prompting a language model (§3.3), an output of a code interpreter running unit tests (§3.1), or outputs of molecular simulation engines (§3.4). For instance, a simple loss function for a code snippet can be the following:\ \where we can use this evaluation signal to optimize the code snippet, powered by the well-documented ability of LLMs to simulate human feedback, self-evaluate, and self-improve [10, 28–33].\Instance vs Prompt Optimization\There are two classes of optimization problems we explore.\In instance optimization, we directly treat a solution to a problem—e.g., a code snippet, the solution to a problem or a molecule—as an optimization variable. For instance, in Equation 13, we have a code instance that we would like to improve at test time. Our framework produces the gradients for and directly optimizes the code variable.\In prompt optimization, the goal is to find a prompt that improves the performance of an LLM across multiple queries for a task. For example, we may want to find a system prompt to an LLM that improve the performance on mathematical reasoning questions (see Section 3.3 for examples). In particular, we want the system prompt to generalize, in contrast to instance optimization where the only goal is to improve the solution for a given query at test time.\Crucially, both types of problems can be solved without hand-crafting the framework.\Optimization Techniques\Automatic differentiation is a strong analogy for TEXTGRAD and provides conceptual support for optimization techniques implemented in the framework. Below, we describe some examples.\Batch Optimization: We implement stochastic minibatch gradient descent [18, 34] for prompt optimization. Specifically, after doing a forward pass with multiple instances in a batch and evaluate individual loss terms, we use the tg.sum function to sum the losses (akin to torch.sum). In the backward pass, gradients on variables through individual loss terms are concatenated, mirroring backpropagating through addition.\Constrained Optimization: We use natural language constraints, building on constrained optimization as an analogy [35]. In particular, we use natural language constraints (e.g., such as ‘The last line of your response should be of the following format: ’Answer: $LETTER’ where LETTER is one of ABCD.”) to guide the behavior of the optimizer. We observe that thanks to instruction-tuning [36, 37], language models can follow these simple constraints, although their reliability can reduce with too many constraints [38, 39].\Momentum: We use the analogy to momentum in gradient descent [40, 41]. When optimizing a variable, the TGD optimizer can optionally see the earlier iterations of the variable when making the update. For more details on optimization techniques that are implemented in TEXTGRAD, see Appendix B.\:::infoAuthors:(1) Mert Yuksekgonul, Co-first author from Department of Computer Science, Stanford University (merty@stanford.edu);(2) Federico Bianchi, Co-first author from Department of Computer Science, Stanford University (fede@stanford.edu);(3) Joseph Boen, Co-first author from Department of Biomedical Data Science, Stanford University (tboen@stanford.edu);(4) Sheng Liu, Co-first author from Department of Biomedical Data Science, Stanford University (shengl@stanford.edu);(5) Zhi Huang, Co-first author from Department of Biomedical Data Science, Stanford University (zhihuang@stanford.edu);(6) Carlos Guestrin, Department of Computer Science, Stanford University and Chan Zuckerberg Biohub (guestrin@stanford.edu);(7) James Zou, Department of Computer Science, Stanford University, Department of Biomedical Data Science, Stanford University, and Chan Zuckerberg Biohub (jamesz@stanford.edu).::::::infoThis paper is available on arxiv under CC BY 4.0 license.:::\