anithite comments on Putting multimodal LLMs to the Tetris test

anithite 3 Feb 2024 11:14 UTC
5 points
0
I think GPT-4 and friends are missing the cognitive machinery and grid representations to make this work. You’re also making the task harder by giving them a less accessible interface.

My guess is they have pretty well developed what/where feature detectors for smaller numbers of objects but grids and visuospatial problems are not well handled.

The problem interface is also not accessible:
- There’s a lot of extra detail to parse
  - Grid is made up of gridlines and colored squares
  - colored squares of fallen pieces serve no purpose but to confuse model
A more accessible interface would have a pixel grid with three colors for empty/filled/falling

Rather than jump directly to Tetris with extraneous details, you might want to check for relevant skills first.
- predict the grid end state after a piece falls
- model rotation of a piece
Rotation works fine for small grids.

Predicting drop results:
- Row first representations gives mediocre results
  - GPT4 can’t reliably isolate the Nth token in a line or understand relationships between nth tokens across lines
  - dropped squares are in the right general area
  - general area of the drop gets mangled
  - rows do always have 10 cells/row
- column first representations worked pretty well.
I’m using a text interface where the grid is represented as 1 token/square. Here’s an example:
```
0 x _ _ _ _ _ 
1 x x _ _ _ _ 
2 x x _ _ _ _ 
3 x x _ _ _ _ 
4 _ x _ _ o o 
5 _ _ _ o o _ 
6 _ _ _ _ _ _ 
7 _ _ _ _ _ _ 
8 x x _ _ _ _ 
9 x _ _ _ _ _ 
```
GPT4 can successfully predict the end state after the S piece falls. Though it works better if it isolates the relevant rows, works with those and then puts everything back together.
```
Row 4: _ x o o _ _
Row 5: _ o o _ _ _
```
making things easier
- columns as lines keeps verticals together
  - important for executing simple strategies
  - gravity acts vertically
- Rows as lines is better for seeing voids blocking lines from being eliminated
  - not required for simple strategies
Row based representations with rows output from top to bottom suffer from prediction errors for piece dropping. Common error is predicting dropped piece square in higher row and duplicating such squares. Output that flips state upside down with lower rows first might help in much the same way as it helps to do addition starting with least significant digit.

This conflicts with model’s innate tendency to make gravity direction downwards on page.

Possibly adding coordinates to each cell could help.

The easiest route to mediocre performance is likely a 1.5d approach:
- present game state in column first form
- find max_height[col] over all columns
- find step[n]=max_height[n+1]-max_height[n]
- pattern match step[n] series to find hole current piece can fit into
This breaks the task down into subtasks the model can do (string manipulation, string matching, single digit addition/subtraction). Though this isn’t very satisfying from a model competence perspective.

Interestingly the web interface version really wants to use python instead of solving the problem directly.
- Lovre 4 Feb 2024 1:38 UTC
  1 point
  0
  Parent
  Thanks for a lot of great ideas!
  
  We tried cutting out the fluff of many colors and having all tetrominoes be one color, but that’s didn’t seem to help much (but we didn’t try for the falling tetromino to be a different color than the filled spaces). We also tried simplifying it by making it 10x10 grid rather than 10x20, but that didn’t seem to help much either.
  
  We also thought of adding coordinates, but we ran out of time we allotted for this project and thus postponed that indefinitely. As it stands, it is not very likely we do further variations on Tetris because we’re busy with other things, but we’d certainly appreciate any pull requests, should they come.

anithite comments on Putting multimodal LLMs to the Tetris test

making things easier