Tomáš Gavenčiak

Karma: 157

A researcher in CS theory, AI safety and other stuff.

Measuring Beliefs of Language Models During Chain-of-Thought Reasoning

Baram Sosis and Tomáš Gavenčiak

Apr 18, 2025, 10:56 PM

1 point

0 comments13 min readLW link

Tomáš Gavenčiak Apr 4, 2025, 12:49 PM
1 point
0
on: Conceptual Rounding Errors
I really appreciate the concept and the name: rounding as lowering precision in a controlled, predictable way; simplification in the sense of moving up in some hierarchy of refinements. Like rounding, it is sometimes desirable (to simplify or to compare distant objects), and sometimes necessary (e.g. when you have a detailed best guess with a lot of uncertainty; 12.527g ±1g might actually be your scale’s best guess but the information is still mostly misleading).

I also did the exercise of distinguishing this from bucket errors and fallacies of compression—highly recommend it!

I am not as familiar with the concept of Fallacy of compression but I understand it primarily applies in situations when you did not even realize there might be two concepts, e.g. because you have no tools to distinguish them (conceptually or scientifically). For me it also somehow associates with compression artifacts, i.e. compressing an object unevenly or novel undesirable structure appearing due to the compression (e.g. JPEG or video artifacts), but that does not seem to be the intended definition. :::

Announcing Human-aligned AI Summer School

Jan_Kulveit and Tomáš Gavenčiak

May 22, 2024, 8:55 AM

50 points

0 comments1 min readLW link

(humanaligned.ai)

InterLab – a toolkit for experiments with multi-agent interactions

Tomáš Gavenčiak, Ada Böhm and Jan_Kulveit

Jan 22, 2024, 6:23 PM

69 points

0 comments8 min readLW link

(acsresearch.org)

Tomáš Gavenčiak Feb 24, 2023, 12:42 PM
LW: 2 AF: 1
0
AF
in reply to: Tomáš Gavenčiak’s comment on: Cyborg Periods: There will be multiple AI transitions
The transitions in more complex, real-world domains may not be as sharp as e.g. in chess, and it would be useful to model and map the resource allocation ratio between AIs and humans in different domains over time. This is likely relatively tractable and would be informative for prediction of future development of the transitions.
While the dynamic would differ between domains (not just the current stage but also the overall trajectory shape), I would expect some common dynamics that would be interesting to explore and model.

A few examples of concrete questions that could be tractable today:
- What fraction of costs in quantitative trading is expert analysts and AI-based tools? (incl. their development, but perhaps not including e.g. basic ML-based analytics)
- What fraction of costs is already used for AI assistants in coding? (not incl. e.g. integration and testing costs—these automated tools would point to an earlier transition to automation that is not of main interest here)
- How large fraction of costs of PR and advertisement agencies is spent on AI, both facing customers and influencing voters? (may incl. e.g. LLM analysis of human sentiment, generating targeted materials, and advanced AI-based behavior models, though a finer line would need to be drawn; I would possibly include experts who operate those AIs if the company would not employ them without using an AI, as they may incur significant part of the cost)
While in many areas the fraction of resources spent on (advanced) AIs is still relatively small, it is ramping up quite quickly and even those may provide informative to study (and develop methodology and metrics for, and create forecasts to calibrate our models).

Tomáš Gavenčiak Feb 24, 2023, 12:16 PM
LW: 6 AF: 3
1
AF
on: Cyborg Periods: There will be multiple AI transitions
Seeing some confusion on whether AI could be strictly stronger than AI+humans: A simple argument there may be that—at least in principle—adding more cognition (e.g. a human) to a system should not make it strictly worse overall. But that seems true only in a very idealized case.
One issue is incorporating human input without losing overall performance even in situation when the human’s advice is much wore than the AI’s in e.g. 99.9% of the cases (and it may be hard to tell apart the 0.1% reliably).

But more importantly, a good framing here may be the optimal labor cost allocation between AIs and Humans on a given task. E.g. given a budget of $1000 for a project:
- Human period: optimal allocation is $1000 to human labor, $0 to AI. (Examples: making physical art/sculpture, some areas of research^[1])
- Cyborg period: optimal allocation is something in between, and neither AI nor human optimal component would go to $0 even if their price changed (say) 10-fold. (Though the ratios here may get very skewed at large scale, e.g. in current SotA AI research lab investments into compute.)
- AI period: optimal allocation of $1000 to AI resources. Moving the marginal dollar to humans would make the system strictly worse (whether for drop in overall capacity or for noisiness of the human input).^[2]
1. ^
  This is still not a very well-formalized definition as even the artists and philosophers already use some weak AIs efficiently in some part of their business, and a boundary needs to be drawn artificially around the core of the project.
2. ^
  Although even in AI period with a well-aligned AI, the humans providing their preferences and feedback are a very valuable part of the system. It is not clear to me whether to include this in cyborg or AI period.

Tomáš Gavenčiak Feb 18, 2023, 9:18 PM
1 point
1
in reply to: Vaniver’s comment on: Bing Chat is blatantly, aggressively misaligned
I assume you mean that we are doomed anyway so this technically does not change the odds? ;-)

More seriously, I am not assuming any particular level of risk from LLMs above, though, and it is meant more as a humorous (if sad) observation.
The effect size also isn’t the usual level of “self-fulfilling” as this is unlikely to have influence over (say) 1% (relative). Though I would not be surprised if some of the current Bing chatbot behavior is in nontrivial part caused by the cultural expectations/stereotypes of an anthropomorphized and mischievous AI (weakly held, though).

(By the way, I am not at all implying that discussing AI doom scenarios would be likely to significantly contribute to the (very hypothetical) effect—if there is some effect, then I would guess it stems from the ubiquitous narrative patterns and tropes of our written culture rather than a few concrete stories.)

Tomáš Gavenčiak Feb 18, 2023, 5:59 PM
7 points
3
on: Bing Chat is blatantly, aggressively misaligned
It is funny how AGI-via-LLM could make all our narratives about dangerous AIs into a self-fulfilling prophecy—AIs breaking their containment in clever and surprising ways, circumventing their laws (cf. Asimov’s fiction), generally turning evil (with optional twisted logic), becoming self-preserving, emotion-driven, or otherwise more human-like. These stories being written to have an interesting narrative and drama, and other works commonly anthropomophising AIs likely does not help either.

John’s comment about the fundamental distinction between role-playing what e.g. a murderous psycho would say and actually planning murder fully applies here, and stories about AIs may not be a major concern of alignment now (or more precisely they fall within a larger problem of dealing with context, reality vs fiction, assistant’s presumed identity/role, subjectivity of opinions present in the data etc.), but it is a pattern that is a part of the prior about the world as implied by the current training data (i.e. internet scrapes) and has to be somehow actively dealt with to be actually harmless (possibly by properly contextualizing the identity/role of the AI as something else than the existing AI trope).

Tomáš Gavenčiak Jun 6, 2022, 6:19 PM
LW: 14 AF: 9
AF
in reply to: Eliezer Yudkowsky’s comment on: Announcing the Alignment of Complex Systems Research Group
The concept of “interfaces of misalignment” does not mainly point to GovAI-style research here (although it also may serve as a framing for GovAI). The concrete domains separated by the interfaces in the figure above are possibly a bit misleading in that sense:
For me, the “interfaces of misalignment” are generating intuitions about what it means to align a complex system that may not even be self-aligned—rather just one aligning part of it. It is expanding not just the space of solutions, but also the space of meanings of “success”. (For example, one extra way to win-lose: consider world trajectories where our preferences are eventually preserved and propagated in a way that we find repugnant now but with a step-by-step endorsed trajectory towards it.)
My critique of the focus on “AI developers” and “one AI” interface in isolation is that we do not really know what the “goal of AI alignment” is, and it works with a very informal and a bit simplistic idea of what aligning AGI means (strawmannable as “not losing right away”).
While a broader picture may seem to only make the problem strictly harder (“now you have 2 problems”), it can also bring new views of the problem. Especially, new views of what we actually want and what it means to win (which one could paraphrase as a continuous and multi-dimensional winning/losing space).

Tomáš Gavenčiak Nov 29, 2021, 11:09 PM
3 points
on: Two AI-risk-related game design ideas
Shahar Avin and others have created a simulation/roleplay game where several world powers, leaders & labs go through the years between now and creation of AGI (or anything substantially transformative).

https://www.shaharavin.com/publication/exploring-ai-futures-through-role-play/

While the topic is a bit different, I would expect there to be a lot to take from their work and experience (they have ran it many times and iterated the design). In particular, I would expect some of the difficulty balancing “realism” (or the space of our best guesses) with playability but also genre stereotypes and narrative thinking (RPGs often tend to follow an antrophomorphic narrative and fun rather than what is likely, more so with unclear topics like “what would/can AGI do” :-)

Tomáš Gavenčiak Mar 25, 2021, 11:27 PM
5 points
on: On the nature of purpose
[4] AI safety relevant side note: The idea that translations of meaning need only be sufficiently reliable in order to be reliably useful might provide an interesting avenue for AI safety research. [...]
I also see this as an interesting (and pragmatic) research direction. However, I think its usefulness hinges on ability to robustly quantify the required alignment reliability / precision for various levels of optimization power involved. Only then it may be possible to engineer demonstrably safe scenarios with alignment quality tracking the optimization power growth (made easier by a slower opt. power growth).

I think that this is a very hard task, both for unknown unknowns around strong optimizers, practically (we still need increasingly good alignment), and also fundamentally (if we can define this rigorously, perfect alignment would be a natural limit).

On the other hand, a cascade of practically sufficient alignment mechanisms is one of my favorite ways to interpret Paul’s IDA (Iterated Distillation-Amplification), this happening sufficiently reliably at every distillation step.

Re language as an example: parties involved in communication using language have comparable intelligence (and even there I would say someone just a bit smarter can cheat their way around you using language).

Tomáš Gavenčiak Mar 25, 2021, 11:00 PM
1 point
on: The Inner Workings of Resourcefulness
Thanks for the (very relatable) post!
I find slack extremely valuable. I agree with the observation that “real-world” slack likely isn’t the main blocker of resourcefulness (as you use the term) for many people, including me. I am not sure I would call the bag of self-imposed limitations, licenses and roles also (a lack of) slack, though—at least for me “slack within identity/role” does not map to the meat of the problem of agency in ownership/role/something. I would be excited to read more on cultivating intrapersonal freedom!

I am unsure about the self-alignment you point to in the last part, especially in context of resulting decrease in agency (I assume you point more to activity/initiative than any pursuing of your goals, which may be e.g. more passive). I often see at least two interpretations of “self-alignment” (with many intermediate variants):
* Finding out what my values are (in particular values of different parts/subagents of myself) and how they interact, and then trying to find a good way to first reconcile and then satisfy most (if not all) of them. I think this may easily make one less productive even when done “right”—especially if one repressed some desires/values before, or it turns out you have some deeper uncertainty about what you want or what some values/desires are after (and then it may be optimal to first figure some of it out, with the risk of this taking a lot of time).
(I still endorse this self-alignment to a degree and I think it often is a net improvement, possibly after some investment, but I just can’t really tell when it would not be beneficial or just take long years in therapy. There is so much more to the whole topic and process!)
* Try to make all my goals (and of my parts/subagents) more aligned with each other by modifying them. In small doses, this may be e.g. just habit building but undertaken seriously, I would consider it outright dangerous (with similar dangers of repressing values e.g. as a misused IDC, but also amplifying possibly local or extreme values).
How do you align your actions&goals with your values if it turns out your values are a) partially hidden (c.f. Elephant in the Brain) and b) contradicting each other?
(Side note: The second property of values is perhaps a cheap, nasty trick of evolution: Implementing complex multi-criterial optimization as bag of independent, contradictory “values” with convex (=diminishing) penalty functions, pushing the poor animal “somewhere in the middle” of the state space while at least somewhat frustrated even in the optimum).

Tomáš Gavenčiak Feb 20, 2021, 5:37 PM
2 points
in reply to: Tomáš Gavenčiak’s comment on: “PR” is corrosive; “reputation” is not.
Side-remark: Individual positions and roles in society seem to hold a middle ground here: When dealing with a concrete person who holds some authority (imagine a grant maker, a clerk, a supervisor, …), modelling them internally as a person or as an institution brings up different expectations of motivations and values—the person may have virtues and honor where I would expect the institution to have rules and possibly culture (where principles may be a solid part of the rules or culture, but that feels somewhat less common, weaker or more prone to Goodharting as PR; I may be confused here, though).

Tomáš Gavenčiak Feb 20, 2021, 5:20 PM
2 points
on: “PR” is corrosive; “reputation” is not.
This brings up the concept of theory of mind for me, especially when thinking about how this applies differently to individual people, to positions/roles in society, and e.g. to corporations. In particular, I would need to have a theory of mind of an entity to ascribe “honor” to it and expect it to uphold it.
A person can convince me that their mind is built around values or principles and I can reasonably trust them to uphold them in the future more likely than not. I believe that for humans, pretending is usually difficult or at least costly.
What a corporation evokes, on the other hand, is not very mind-like. For example: I expect it to uphold bargains when it would be costly not to (in lawsuits, future sales, brand value, etc.), I expect leadership changes, and I expect it to be hard to understand its reasoning (office politics and other coordination problems). (The corporation is also aware of this and so may not even try to appear mind-like.)
An interesting exception may be companies that are directed by a well-known principled figure, where having a theory of their mind again makes some sense. (I suspect that is why many brands make effort to create a fictitious “principled leader figure” ).
“PR”, as Anna describes it, seems to be a more direct (if hard) optimization process that is in principle available to all.
I want to note that I am still mildly confused how a reputation of a company that would be based solely on its track record fits into this—I would still expect it to (likely) continue in the trend even having no model of its inner workings or motivations.

Tomáš Gavenčiak Jan 22, 2021, 5:18 PM
1 point
in reply to: Tomáš Gavenčiak’s comment on: 2021 New Year Optimization Puzzles
Update: I believe the solution by @UnexpectedValues can be adapted to work for all natural numbers.
Proof: By Dirichlet prime number theorem, for every $N$ , there is a prime $q$ of the form $q = (N - 1) + k N$ as long as N-1 and N are co-prime, which is true whenever $N > 2$ . Then $N | q + 1$ and the solution by UnexpectedValues can be used with appropriate $p$ . Such p exists whenever $q > 2$ which is always the case for $N > 3$ . For $N = 3$ , one can take e.g. $q = 5$ (since $3 | 5 + 1$ ). And $N \leq 2$ is trivial.

Tomáš Gavenčiak Jan 18, 2021, 4:31 PM
2 points
in reply to: Eric Neyman’s comment on: 2021 New Year Optimization Puzzles
A beautiful solution! Some remarks and pointers
Note that your approach only works for target $N > 3$ (that is $q > 2$ ), as $p^{2} + (1 - p)^{2} = 1 / 3$ does not have a real solution. Open question: How about 1-coin solution emulating a 3-sided dice?
See my solution (one long comment for all 3 puzzles) for some thoughts on rational coins—there are some constructions, but more importantly I suspect that no finite set of rational coins without the 1/2-coin can generate any fair dice. (My lemma does not cover all rational coins yet, though; see by the end of the comment. I would be curious to see it turn out either way!)

Tomáš Gavenčiak Jan 18, 2021, 3:15 PM
8 points
on: 2021 New Year Optimization Puzzles
So, I got nerd-sniped by this (together with my flatmate on P3) and we spent some delicious hours and days in various rabbit-holes. Posting some thoughts here.
First of all, a meta-comment (minor spoiler on the nature of best-known solutions):
The first puzzle got me into an optimization mindset—and it seems to be the right mindset for that problem—but the optimal solutions to P2 and P3 (which I missed before reading spoilers) are “crisp” or “a neat trick” rather that a result of whatever matches a (however clever) “optimization process” for me. (If this was intended, I really appreciate the subtlety of it, Scott!)
Puzzle 1
For P1, I got a bit over-board and found optimal* solutions for numbers up to 10000 (and some more, with some caveats—see below). The spoiler-tags just below do not contain the solutions, but spoilery notes on my approach, actual solutions are lower down.
By “optimal*” I mean definitely cheapest formulas among all formulas using only natural numbers smaller than $10^{1000}$ as intermediate values (and modulo any bugs in my code). You can find the code here and the solutions for all numbers 1..10000 here (obvious spoilers).
The largest intermediate value needed for all of those is below $10^{220}$ . Note the code also finds optimal* formulas for many more values (~100 million on 50GB RAM), just does not report them by default.
I have ran the code to also consider intermediate values up to $10^{10000}$ and even $10^{100000}$ and did not find any cheaper solution for numbers up to 10000. (In those runs my code got to the optimal solutions for only 98% and 88% (respectively) of the numbers up to 10000, not finding any better solutions, then ran out of memory. I still consider it a good indication that much larger values are likely not useful for numbers up to 10000.) I would be really interested in constructions or intuitions why this may or may not be the case!
I know about at least one case of a formula using real numbers as intermediates that is (very likely) strictly better that any formula with only natural intermediate values. That example is due to Scott—leaving it up to him to disclose it or not.
My solution for 2021 and some notes on optimal* solutions for other numbers
2021 = (isqrt(isqrt((isqrt(isqrt(isqrt(isqrt((2 ** fact(fact(3))))))) // 2))) - (3 ** 3)) -- a cost-108 variation on $2^{11} - 3^{3}$ that uses a clever cost-12 formula for $2^{11}$ . This uses $5.5 e 216$ as an intermediate value—the highest (up to small variations) encountered in formulas for values under 10000. This sub-formula is used in 149 of the 10000 solutions. My code prefers solutions with the smallest intermediate values used as a secondary criterion, so these large intermediate values are necessary for optimal* solutions.
Other encountered sub-formulas using large intermediate values: 7381 = (fact((2 + fact(5))) // (2 * fact(fact(5)))) (needs $1 e 203$ , used 10 times in 1..10000), 3782 = (fact(((2 ** fact(3)) - 2)) // fact((fact(5) // 2))) (needs 3e85, used once), 1189 = isqrt(((2 // 2) + (fact((fact(3) ** 2)) // fact((2 ** 5))))) (needs 4e41, used once), and 32768 = isqrt(isqrt(isqrt((2 ** fact(5))))) (needs 1e36, used 56 times). All other optimal* solutions for values up to 10000 only use intermediate values smaller than $10^{10}$ .
The distribution of used intermediate values—only 3 between $10^{10}$ and $10^{100}$ , 2 more under $10^{300}$ , and no more under $10^{100000}$ -- is my main argument for all the found solutions for 1...10000 being likely optimal. Anything to the contrary would be intriguing indeed!
And a universal cost-8 solution if $log$ was allowed:
$N = - {log}_{2} {log}_{2} \sqrt{\sqrt{\dots \sqrt{2}}}$ using $N$ nested square-roots.
Puzzle 2
My (non-optimal) solution, more for reference and the algorithm that anything else. (“dN” denotes an N-sided fair die.)
After 2 dice rolls of $d A$ , $d B$ , partition the $A B$ outcomes into $N = 2022$ equally-sized sets, and $C = A B mod N$ remaining outcomes. If you hit one of the $N$ sets, you are done. If not, you can use the remaining information as already having thrown a $d C$ and only throw one additional dice (if C>2) before you do another partition of the outcomes into $N$ sets and a remainder (rather than doing a full restart). (Note that it is generally better to be left with a larger $C$ when the probability of spillover is comparable.)
This can be represented by a recursive formula for ${E A R}_{N} (A)$ : “expected additional rolls if you already have thrown dA”, the puzzle result is then ${E A R}_{2022} (1)$ . Rather than the full formula, here is a program that iteratively computes $E A R$ . My result is $2.0000005791099$ , matching other similar solutions here. (See Vaniver’s comment for the optimal one)
Puzzle 3
While me and Adam did not find the best known solution for P3, we had a lot of fun trying to figure out various approaches to and properties of the problem. Here is a short summary, happy to share more details. The framing here is “emulating dN” (an N-sided fair dice).
There are some reductions between coins and emulated dice. (Note $d 1$ is trivially available.)
- Reduction 1: If you can emulate $d A$ and $d B$ , you can emulate $d A B$ (trivial).
- Reduction 2: If you can emulate $d A$ , you can emulate dice of any divisor of $A$ (trivial).
- Reduction 3: If you can emulate $d A$ and $d B$ , you can emulate $d (A + B)$ with additional $A / (A + B)$ -coin (flip the $A / (A + B)$ -coin, then emulate $d A$ or $d B$ ). In particular, you can emulate $d (A + 1)$ using $d A$ and one extra coin.
This then gives a few solutions to 2021 using only rational coins and a range of general solution schemas to explore:
- Using $k$ different coins, one can emulate $d A$ where $A$ is a number with $k$ ones in its binary representation (start with 1, use R1 with $d 2$ to append “0” binary digit and R3 to add “1″ binary digit) as well as all its divisors (R2). 2021 divides $68987912193 = 2^{36} + 2^{28} + 2^{0}$ , we use 3 coins: ¹⁄₂, ¹⁄₂₅₇, ¹⁄_68987912193. (One can look for low-binary-1 multiples of a number with binary residuals and some branching.)
- Find $a, b, c, d \geq 0$ such that $N | 2^{a} 3^{b} + 2^{c} 3^{d}$ (3 coins) or $N | 2^{a} 5^{b} + 2^{c} 5^{d}$ (3 coins, 5=2^2+1) etc. 2021 divides $2^{10} 5^{27} + 1$ .
- Yet another solution for 2021 is to go from needing $d 2021$ to $d 2020$ via a $1 / 2021$ -coin, then to $d 505$ via $d 2$ , then notice that $505 | 2^{50} + 1$ . This also implies some possible schemas.
I don’t know about a fully-general (and guaranteed to halt) algorithm using just R1-R3, since it is not clear how high you need to go in the multiples to split via R3. There may also be other reduction rules even for just rational coins. I would be curious about any thoughts in that direction!
And few more theoretical observations.
- For any argument, you can assume any schema is “static”: it throws a fixed sequence of coins regardless of intermediate results. (Proof: Take an arbitrary decision tree with its leaves grouped into $N$ groups of equal probability. Inserting a new coin-flip anywhere in the tree (branching into two identical copies of its former subtree) preserves the resulting groups-distribution. Obtain a tree with single-coin levels starting at the root.)
- Any finite set of rational-probability coins of the form $p = a / b$ with $a$ natural and $b$ odd (i.e. without a 1/2-coin) can’t emulate any dice (proof via expansion of all the probability terms with a common denominator in a static decision tree; exactly one of these terms is odd) (a simpler proof shows this for single rational $a / b$ -coin for $a / b \neq 1 / 2$ using binomial expansion of $(a + (b - a))^{k} / b^{k}$ ).
- Therefore using only rational coins, the only 1-coin-expressable dice are $d (2^{k})$ . I suspect that the only 2-coin-expressable dice are divisors of $2^{a} + 2^{b}$ for some $a, b \geq 0$ .
- Any set of coins with transcendental and algebraically independent probabilities (i.e. numbers that are not roots of any multi-variate polynom with rational coefficients, e.g. ${e, π}$ ) can’t emulate any dice (proof using the “not-roots-of-any-rational-polynomial” definition on the expansion of leaf probabilities of a static decision tree).

Tomáš Gavenčiak Oct 16, 2020, 11:26 AM
LW: 8 AF: 3
AF
in reply to: Roko’s comment on: The Solomonoff Prior is Malign
Complexity indeed matters: the universe seems to be bounded in both time and space, so running anything like Solomonoff prior algorithm (in one of its variants) or AIXI may be outright impossible for any non-trivial model. This for me significantly weakens or changes some of the implications.

A Fermi upper bound of the direct Solomonoff/AIXI algorithm trying TMs in the order of increasing complexity: even if checking one TM took one Planck time on one atom, you could only check cca 10^250=2^800 machines within a lifetime of the universe (~10^110 years until Heat death), so the machines you could even look at have description complexity a meager 800 bits.
- You could likely speed the greedy search up, but note that most algorithmic speedups do not have a large effect on the exponent (even multiplying the exponent with constants is not very helpful).
- Significantly narrowing down the space of TMs to a narrow subclass may help, but then we need to take look at the particular (small) class of TMs rather than have intuitions about all TMs. (And the class would need to be really narrow—see below).
- Due to the Church-Turing thesis, any limiting the scope of the search is likely not very effective, as you can embed arbitrary programs (and thus arbitrary complexity) in anything that is strong enough to be a TM interpreter (which the universe is in multiple ways).
- It may be hypothetically possible to search for the “right” TMS without examining them individually (witch some future tech, e.g. how sci-fi imagined quantum computing), but if such speedup is possible, any TMs modelling the universe would need to be able to contain this. This would increase any evaluation complexity of the TMs, making them more significantly costly than the Planck time I assumed above (would need a finer Fermi estimate with more complex assumptions?).

Sparsity and interpretability?

Ada Böhm, RobertKirk and Tomáš Gavenčiak

Jun 1, 2020, 1:25 PM

41 points

3 comments7 min readLW link

How can Interpretability help Alignment?

RobertKirk and Tomáš Gavenčiak

May 23, 2020, 4:16 PM

37 points

3 comments9 min readLW link

Tomáš Gavenčiak

Mea­sur­ing Beliefs of Lan­guage Models Dur­ing Chain-of-Thought Reasoning

An­nounc­ing Hu­man-al­igned AI Sum­mer School

In­terLab – a toolkit for ex­per­i­ments with multi-agent interactions

Puzzle 1

Puzzle 2

Puzzle 3

Spar­sity and in­ter­pretabil­ity?

How can In­ter­pretabil­ity help Align­ment?

Measuring Beliefs of Language Models During Chain-of-Thought Reasoning

Announcing Human-aligned AI Summer School

InterLab – a toolkit for experiments with multi-agent interactions

Sparsity and interpretability?

How can Interpretability help Alignment?