Louka Ewington-Pitsos

Karma: 46

Louka Ewington-Pitsos Sep 20, 2024, 1:51 AM
3 points
0
in reply to: Brendan Long’s comment on: Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails
It seems like it’s possible right?

Interestingly a similar tool run by a very smart guy (https://www.creatorml.com/) is apparently about to shut down, so it might not be a financially sustainable thing to try.

Louka Ewington-Pitsos Sep 6, 2024, 11:15 PM
1 point
0
on: Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
I couldn’t find a link to the code in the article so in case anyone else wants to try to replicate I think this is it: https://github.com/HugoFry/mats_sae_training_for_ViTs

Louka Ewington-Pitsos Aug 19, 2024, 4:06 AM
1 point
0
in reply to: Louka Ewington-Pitsos’s comment on: Efficient Dictionary Learning with Switch Sparse Autoencoders
Just to close the loop on this one, the official huggingface transformers library just uses a for-loop to achieve MoE. I also implemented a version myself using a for loop and it’s much more efficient than either vanilla matrix multiplication or that weird batch matmul I write up there for large latent and batch sizes.

Louka Ewington-Pitsos Aug 13, 2024, 9:39 AM

1 point

in reply to: Louka Ewington-Pitsos’s comment on: Efficient Dictionary Learning with Switch Sparse Autoencoders

wait a minute… could you just...

you don’t just literally do this do you?

input = torch.tensor([
    [1, 2],
    [1, 2],
    [1, 2],
]) # (bs, input_dim)


enc_expert_1 = torch.tensor([
    [1, 1, 1, 1],
    [1, 1, 1, 1],

])
enc_expert_2 = torch.tensor([
    [3, 3, 0, 0],
    [0, 0, 2, 0],
])



dec_expert_1 = torch.tensor([
    [ -1, -1],
    [ -1, -1],
    [ -1, -1],
    [ -1, -1],
])

dec_expert_2 = torch.tensor([
    [-10, -10,],
    [-10, -10,],
    [-10, -10,],
    [-10, -10,],

])

def moe(input, enc, dec, nonlinearity):
    input = input.unsqueeze(1)
    latent = torch.bmm(input, enc)

    recon = torch.bmm(nonlinearity(latent, dec))

    return recon.squeeze(1), latent.squeeze(1)


# not this! some kind of actual routing algorithm, but you end up with something similar
enc = torch.stack([enc_expert_1, enc_expert_2, enc_expert_1])
dec = torch.stack([dec_expert_1, dec_expert_2, dec_expert_1])

nonlinearity = torch.nn.ReLU()
recons, latent = moe(input, enc, dec, nonlinearity)

This must in some way be horrifically inefficient, right?

Louka Ewington-Pitsos Aug 13, 2024, 2:15 AM
1 point
0
on: Efficient Dictionary Learning with Switch Sparse Autoencoders
Can I ask what you used to implement the MOE routing? Did you use megablocks? I would love to expand on this research but I can’t find any straightforward implementation of efficient pytorch MOE routing online.
Do you simply iterate over each max probability expert every time you feed in a batch?

Louka Ewington-Pitsos Jun 30, 2024, 12:33 AM
1 point
0
on: Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.
This is dope, thank you for your service. Also, can you hit us with your code on this one? Would love to reproduce.