Louka Ewington-Pitsos Aug 19, 2024, 4:06 AM
1 point
0
in reply to: Louka Ewington-Pitsos’s comment on: Efficient Dictionary Learning with Switch Sparse Autoencoders
Just to close the loop on this one, the official huggingface transformers library just uses a for-loop to achieve MoE. I also implemented a version myself using a for loop and it’s much more efficient than either vanilla matrix multiplication or that weird batch matmul I write up there for large latent and batch sizes.

Louka Ewington-Pitsos 13 Aug 2024 9:39 UTC

1 point

in reply to: Louka Ewington-Pitsos’s comment on: Efficient Dictionary Learning with Switch Sparse Autoencoders

wait a minute… could you just...

you don’t just literally do this do you?

input = torch.tensor([
    [1, 2],
    [1, 2],
    [1, 2],
]) # (bs, input_dim)


enc_expert_1 = torch.tensor([
    [1, 1, 1, 1],
    [1, 1, 1, 1],

])
enc_expert_2 = torch.tensor([
    [3, 3, 0, 0],
    [0, 0, 2, 0],
])



dec_expert_1 = torch.tensor([
    [ -1, -1],
    [ -1, -1],
    [ -1, -1],
    [ -1, -1],
])

dec_expert_2 = torch.tensor([
    [-10, -10,],
    [-10, -10,],
    [-10, -10,],
    [-10, -10,],

])

def moe(input, enc, dec, nonlinearity):
    input = input.unsqueeze(1)
    latent = torch.bmm(input, enc)

    recon = torch.bmm(nonlinearity(latent, dec))

    return recon.squeeze(1), latent.squeeze(1)


# not this! some kind of actual routing algorithm, but you end up with something similar
enc = torch.stack([enc_expert_1, enc_expert_2, enc_expert_1])
dec = torch.stack([dec_expert_1, dec_expert_2, dec_expert_1])

nonlinearity = torch.nn.ReLU()
recons, latent = moe(input, enc, dec, nonlinearity)

This must in some way be horrifically inefficient, right?

Louka Ewington-Pitsos 13 Aug 2024 2:15 UTC
1 point
0
on: Efficient Dictionary Learning with Switch Sparse Autoencoders
Can I ask what you used to implement the MOE routing? Did you use megablocks? I would love to expand on this research but I can’t find any straightforward implementation of efficient pytorch MOE routing online.
Do you simply iterate over each max probability expert every time you feed in a batch?

Faithful vs Interpretable Sparse Autoencoder Evals

Louka Ewington-Pitsos12 Jul 2024 5:37 UTC

2 points

0 comments12 min readLW link

Louka Ewington-Pitsos 30 Jun 2024 0:33 UTC
1 point
0
on: Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.
This is dope, thank you for your service. Also, can you hit us with your code on this one? Would love to reproduce.

Louka Ewington-Pitsos

A suite of Vi­sion Sparse Au­toen­coders

Food, Pri­son & Ex­otic An­i­mals: Sparse Au­toen­coders De­tect 6.5x Perform­ing Youtube Thumbnails

Mas­sive Ac­ti­va­tions and why <bos> is im­por­tant in To­k­enized SAE Unigrams

Train­ing a Sparse Au­toen­coder in < 30 min­utes on 16GB of VRAM us­ing an S3 cache

Faith­ful vs In­ter­pretable Sparse Au­toen­coder Evals

A suite of Vision Sparse Autoencoders

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails

Massive Activations and why <bos> is important in Tokenized SAE Unigrams

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache

Faithful vs Interpretable Sparse Autoencoder Evals