Improving Mathematical Accuracy in LLMs—New Monthly Updates Series − 1
“The irrationality of a thing is no argument against its existence, rather a condition of it.”—Nietzche
Series Introduction
In recent years, the development and deployment of Large Language Models (LLMs) have revolutionized the field of artificial intelligence. These models, such as GPT-3, have shown remarkable capabilities in understanding and generating human-like text across various domains. However, a closer examination reveals that while these models excel in various linguistic tasks, they often struggle when it comes to mathematical reasoning and maintaining a high level of accuracy. Mathematical concepts often demand precise logical reasoning, symbol manipulation, and an understanding of complex relationships between numbers and equations. LLMs tend to struggle with these aspects because they “predict the next word/character” (based on context) with increasing accuracy, which seems to differ from writing rigorous mathematical statements. This seems to be a case of the Goodhart’s Law which states that “when a measure becomes a target, it ceases to be a good measure.” Wherein, the measure being how transformers work, predicting the next word/character/sentence based on given context, and the target being, being able to mathematically and logically manipulate given symbols/data and/or employing the right theorem/axiom (keep into consideration that all its conditions “exactly” satisfy) in order derive information/reach a state previously unknown and now proven.
This necessitates an exploration of “What” “understanding” actually means, or rather “How” “understanding” functions, in hope of imparting similar “logical” abilities to LLMs. Over the next few months, I will be diving into the details of the same, beginning with a literature review of various paradigms used till date, brief discussion on them, and hopefully get to a point where I conduct experiments based on ideas gotten through the journey. Under each header, I would be providing a summary, most containing direct texts from papers/articles and context and/or commentary as and when needed.
Considering the ambiguity and subjectivity in definition of what exactly does “logic”, “Understanding”, “Rationality” mean. I will try to make sure I am very specific while using these words.
Lastly, all discussions, reviews and comments are appreciated because Afterall, this is, but an attempt of a child who never got the answer he wanted of “why” to at least make sense of the “what”.
Month − 1 : August
Before a direct jump to an understanding of SOTA, its essential that one gets a basic idea of the previous paradigms. My journey begin with a basic exploration of such paradigms, something that important to note is that the world doesnt have the computational power for Deep Learning in this time frame, and we are currently in the school of thought called “Symbolic AI”.
Newell and Simon’s Logic Theorist
Newell and Simon’s Logic Theorist was an early computer program developed in the late 1950s that aimed to simulate human problem-solving and deduction using formal logic.
Representation: Logic Theorist used symbolic logic notation to represent mathematical statements, enabling it to manipulate and infer logical relationships between symbols. (Note that here, the “Basis” of the entire problem solver, is this symbolic representation which is similar to First Order Predicate Logic)
Inference Rules: The program employed logical inference rules like modus ponens and modus tollens to draw conclusions from given premises, based on fundamental principles of logic. (Again, an important thing to note here is that “inferences” are drawn based on “assumed truths” (axioms) using methods such as predicate calculus)
Problem Solving: Given axioms and a theorem, Logic Theorist followed a step-by-step process, using logical rules to manipulate symbols and derive conclusions.
Search Strategy: Logic Theorist used a heuristic search algorithm to decide which inference rules to apply, evaluating the usefulness of each rule in determining the logical steps.
Proofs and Learning: Logic Theorist aimed to find a sequence of logical steps from axioms to a theorem, producing a proof. If unsuccessful, it adjusted its strategy based on learned insights. The most interesting thing about this is the fact that here we have a clear notion of what is true and what is not, which of course makes such systems essentially useless when it comes to NLU which requires subjectivity and flexibility, however, this essentially solves the “mathematical accuracy” issue, i.e. it would never output something “wrong” (inconsistent with axioms). However, we essentially come back to the same problem because of which machine learning was invented, this is yet another algorithm which CANNOT learn. Give it the rules, and it would solve everything in the domain of what was fed, however, such is not real life, the “rules” are unknown, furthermore, it fails to contribute to any new advances in mathematics as the “ideas” it can generate are constrained to what it’s rule were.
After reading about the symbolic paradigm, the first question that came to me was that, well, how do humans do math? or more generally how to humans decide ? Drum roll..… We......ehhhh.… dont know. It wasnt surprising to me that human decision making to a very very large extent is “paradoxical” and not understood, the following is a famous example from decision theory that tries to demostrate the ambiguity in how two completely different answers might seem to be “Logical”.
Newcomb’s Problem
Description
Setup: You are presented with two boxes, A and B. Box A contains a visible $1,000 bill. Box B is opaque and either contains nothing or contains a large sum of money, say $1 million.
Choice Point: The Predictor, a highly accurate being with the ability to predict your decisions, has already made its prediction about your choice:
If the Predictor predicted that you will take only Box B (i.e., you’re a “one-boxer”), then Box B is filled with $1 million.
If the Predictor predicted that you will take both Box A and Box B (i.e., you’re a “two-boxer”), then Box B is empty.
Your Decision: You have to decide whether to be a “one-boxer” (take only Box B) or a “two-boxer” (take both Box A and Box B).
Two sides
One-Boxer’s Argument: If the Predictor is indeed accurate, then your decision has already been predicted. In this case, taking only Box B ensures you get the $1 million. If you take both boxes, you’ll end up with only $1,000 because the Predictor predicted you as a two-boxer, leaving Box B empty. Thus, being a one-boxer maximizes your gain. (Evidential Decision Theory)
Two-Boxer’s Argument: Regardless of the Predictor’s prediction, if you choose to be a two-boxer, you’re guaranteed to receive $1,000 from Box A and possibly an additional $1 million from Box B. Taking only Box B, in case the Predictor predicted you as a one-boxer, forfeits the certain $1,000 from Box A. (Causal Decision Theory) I hope you get an idea of the paradoxical nature of “rationality”.
Next I started to read about the “successor” to the symbolic paradigm...
Symbolic and Sub-Symbolic Paradigms (Connectionist AI, Symbolic AI, and the Brain P. Smolensky (1987))
Symbolic Paradigm
“What all this means in the practice of symbolic AI is that goals, beliefs, knowledge, and so on are all formalized as symbolic structures, for example, Lisp lists (Singly Linked List), which are built of symbols, Lisp atoms, which are each capable of being semantically interpreted in terms of the ordinary concepts we use to conceptualize the domain. Thus, in a medical expert system, we expect to find structures like (IF FEVER THEN (HYPOTHESIZE INFECTION)). These symbolic structures are operated on by symbol manipulation procedures composed of primitive operations like concatenating lists, and extracting elements from lists. According to the symbolic paradigm, it is in terms of such operations that we are to understand cognitive processes” The idea is that a complete/large enough and detailed DAG of causality and action could help us understand the world, and function as an intelligent agent. “The symbolic level that implements knowledge structures is alleged to be exact and complete. That means that lower levels are unnecessary for accurately describing cognition in terms of the semantically interpretable elements”
“In the symbolic approach, symbols (atoms) are used to denote the semantically interpretable entities (concepts). These same symbols are the objects governed by symbol manipulations in the rules that define the system. The entities which are capable of being semantically interpreted are also the entities governed by the formal laws that define the system”
Sub-symbolic Paradigm
“The subsymbolic level is an attempt to formalize, at some level of abstraction, the kind of processing which occurs in the nervous system. Many of the details of neural structure and function are absent from the subsymbolic level, and the level of description is higher than the neural level. The precise relationship between the neural and subsymbolic levels is still an open research question; but it seems clear that connectionist systems are much closer to neural systems than are symbolic systems.”
One, A connectionist system, risking oversimplification, is the ancestor of what we now know as Neural Networks, which at the time of write the paper (1987), were computationally not possible. Two, We see that Smolensky here starts to shed light into a possible area of exploration of “reasoning”, that works on a more fundamental level.
“Note that the sub-symbolic paradigm gives an essentially different role to the neural part of the story: neural structures provide the basis (in some suitably abstract sense) of the formalism that gives the precise description of intelligence, while mental structures enter only into approximate descriptions”
This line, to a very large extent, forms the basis of what I believe. The idea being discussed here is that the neural part of the story is essentially the quantum physics (Fundamental cause) to what we observe such as Abstractions, Concepts and ultimately Intelligence (paralleled to Newtonian Physics)
“(In sub symbolic) The semantically interpreted entities are patterns of activation over a large number of units in the system, whereas the entities manipulated by formal rules (which was the case in Symbolic) are the individual activations of cells in the network. The rules take the form of activation passing rules, of essentially different character from symbol manipulation rules. This describes the particular kind of connectionist system where patterns of activity represent concepts, instead of the activation of individual elements in the network. Therefore, the subsymbolic paradigm involves connectionist systems using so-called distributed representations, as opposed to local representations”
“That crucial principle of the sub symbolic level, the Statistical Connection (Best Fit Principle): given an input, connectionist system outputs a set of inferences that, as a whole, give a best fit to the input, in a statistical sense defined by the statistical knowledge stored in the system’s connections. In this vague form, this principle is generally true for connectionist systems. But it is exactly true in a precise sense, at least in an idealized limit, for a certain class of systems in what can be called harmony theory.” :
“While eating at a fancy restaurant, you get a headache. Without effort, you ask the waitress if she could possibly get you an aspirin. How is this plan created? You have never had a headache in a restaurant before.” … What kind of cognitive system is capable of this degree of flexibility?
Suppose that the knowledge base of the system does not consist of a set of scripts like the restaurant script and the headache script. Suppose instead that the knowledge base is a set of knowledge atoms that “configure themselves dynamically in each context” to form tailor-made scripts. This is the fundamental idea formalized in harmony theory.
After gaining brief insights about the symbolic and sub symbolic paradigms, and their respective strengths and weaknesses, it was time to understand about their implication with the buzz word floating around these days… Deep learning!
Reconciling deep learning with symbolic artificial intelligence: representing objects and relations (Marta Garnelo and Murray Shanahan)
Compositionality
“In linguistics, the principle of compositionality asserts that the meaning of a sentence is a function of the meaning of its parts and the way those parts are put together. Compositionality tends to go hand-in-hand with combinatorial structure, which in the case of language means combinatorial syntax — infinitely many grammatically correct sentences can be formed by combining syntactic elements according to recursive rules of composition”
“For an agent confronted by a world that itself exhibits combinatorial structure, a compositional system of representation has the potential to confer the ability to form abstractions and to generalise far beyond its own experience, because representations of familiar objects and relations can enter into novel combinations” I do certainly smell some basis of causality in the idea of compositionality, and on further thought, using the flexible definition of compositionality (form abstractions and to generalise far beyond its own experience) we actually see that it is how humans solve complicated mathematical questions too! We see a lot of patterns and questions, and upon seeing a entirely novel one, based on how well we have “understood” or rather “internalized” our education, we come up with “intuitions”, which for the most part are along the lines of what we studied ; a combinatorial structure of multiple abstractions and learning trying to fit the pieces of the new problem.
“Once trained, the intermediate layers in a deep learning system can be thought of as representations of the training data. However, compositionality is not an inherent property of these learned representations. On the contrary, if the network architecture does not enforce compositionality, and in the absence of any pressure towards learning compositional structure, a gradient descent learning method (such as backpropagation) will tend to produce distributed representations, whose component parts have little or no meaning in isolation.” The line “compositionality is not an inherent property of these learned representations” is something that I feel I really want to look into further, this gives a pretty solid insight as to what might be the underlying difference between the “abstractions” a human forms in their mind as compared to those formed by different stages of the neural network.
This combined with the ideas from harmony theory leads us to interesting realms, we start to see now that Abstractions and Concepts in our mind are neurons that, “configure themselves dynamically in each context”, which means that it is this configuration, who’s compositionality leads to the abstractions. Which essentially means that what one believes, thinks, or well.....(sometimes)---even Feels are certain different abstractions intermingling in a certain way, which leads to us to the conclusion that what we call “logic” as a society, is simply yet another “Learnt” abstraction, wherein a the primary thing “learnt” is the abstraction’s 100% accuracy. Mathematics is yet another tool devised by humans, that we “learn” to use… We are entering philosophical realms here, we are not here to debate if there is an underlying mathematics to the universe but rather what if teaching logic to computers meant teaching them to learn “abstractions” better ?
This concludes my readings for the month, in the upcoming month, I intend to tinker around with the idea of compositionality, understand its nature, measurement and well, can we enforce it? And if we can, ask the very important question, does it even matter ?
Future Plans
Continuing reading the paper Reconciling deep learning with symbolic artificial intelligence: representing objects and relation by Google DeepMind.
Reading about objective measurement of compositionality in representation learning
Researching about the question : Are there ways to enforce Compositionality to produce disentangled representations?
Can a focus on Step-wise finetuning yield better compositionality
Reading and understanding the recent papers on “Tree of Thought”, “Chain of Thought” Well, If only we knew where the Literature Survey would lead us....
On an ending note and risking being slightly extreme, What if “feelings” are (sometimes), but something that are “contextually learnt” ?
I think this lacks justification why the entire approach is a good idea. Improving mathematical accuracy in LLMs seems like a net negative to me for the same reason that generic capability improvements are a net negative.