Cool, Facebook is also on this apparently: https://x.com/PicturesFoIder/status/1840677517553791440
DaemonicSigil
This might be worth pinning as a top-level post.
The amount of entropy in a given organism stays about the same, though I guess you could argue it increases as the organism grows in size. Reason: The organism isn’t mutating over time to become made of increasingly high entropy stuff, nor is it heating up. The entropy has to stay within an upper and lower bound. So over time the organism will increase entropy external to itself, while the internal entropy doesn’t change very much, maybe just fluctuates within the bounds a bit.
It’s probably better to talk about entropy per unit mass, rather than entropy density. Though mass conservation isn’t an exact physical law, it’s approximately true for the kinds to stuff that usually happens on Earth. Whereas volume isn’t even approximately conserved. And in those terms, 1kg of gas should have more entropy than 1kg of condensed matter.
Speaking of which, I wonder if multi-modal transformers have started being used by blind people yet. Since we have models that can describe images, I wonder if it would be useful for blind people to have a device with a camera and a microphone and a little button one can press to get it to describe what the camera is seeing. Surely there are startups working on this?
Found this paper on insecticide costs: https://sci-hub.st/http://dx.doi.org/10.1046/j.1365-2915.2000.00262.x
It’s from 2000, so anything listed here would be out of patent today.
hardening voltage transformers against ionising radiation
Is ionization really the mechanism by which transformers fail in a solar storm? I thought it was that changes in the Earth’s magnetic field induced large currents in long transmission lines, overloading the transformers.
Sorry for the self promotion, but some folks may find this post relevant: https://www.lesswrong.com/posts/uDXRxF9tGqGX5bGT4/logical-share-splitting (ctl-F for “Application: Conditional prediction markets”)
tldr: Gives a general framework that would allow people to make this kind of trade with only $N in capital, just as a natural consequence of the trading rules of the market.
Anyway, I definitely agree that Manifold should add the feature you describe! (As for general logical share splitting, well, it would be nice, but probably far too much work to convert the existing codebase over.)
IMO, a very good response, which Eliezer doesn’t seem to be interested in making as far as I can tell, is that we should not be making the analogy
natural selection <--> gradient descent
, but rather,human brain learning algorithm <--> gradient descent ; natural selection <--> us trying to build AI
.So here, the striking thing is that evolution failed to solve the alignment problem for humans. I.e. we have a prior example of strongish general intelligence being created, but no prior examples of strongish general intelligence being aligned. Evolution was strong enough to do one but not the other. It’s not hopeless, because we should generally consider ourselves smarter than evolution, but on the other hand, evolution has had a very long time to work and it does frequently manage things that we humans have not been able to replicate. And also, it provides a small amount of evidence against “the problem will be solved with minor tweaks to existing algorithms” since generally minor tweaks are easier for evolution to find than ideas that require many changes at once.
Idle Speculations on Pipeline Parallelism
People here might find this post interesting: https://yellow-apartment-148.notion.site/AI-Search-The-Bitter-er-Lesson-44c11acd27294f4495c3de778cd09c8d
The author argues that search algorithms will play a much larger role in AI in the future than they do today.
I remember reading the EJT post and left some comments there. The basic conclusions I arrived at are:
The transitivity property is actually important and necessary, one can construct money-pump-like situations if it isn’t satisfied. See this comment
If we keep transitivity, but not completeness, and follow a strategy of not making choices inconsistent with out previous choices, as EJT suggests, then we no longer have a single consistent utility function. However, it looks like the behaviour can still be roughly described as “picking a utility function at random, and then acting according to that utility function”. See this comment.
In my current thinking about non-coherent agents, the main toy example I like to think about is the agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action is proportional to up to a normalization factor. By tuning we can affect whether the agent cares more about entropy or utility. This has a great resemblance to RLHF-finetuned language models. They’re trained to both achieve a high rating and to not have too great an entropy with respect to the prior implied by pretraining.
- Jun 2, 2024, 2:58 PM; 7 points) 's comment on What do coherence arguments actually prove about agentic behavior? by (
If you’re working with multidimensional tensors (eg. in numpy or pytorch), a helpful pattern is often to use pattern matching to get the sizes of various dimensions. Like this:
batch, chan, w, h = x.shape
. And sometimes you already know some of these dimensions, and want to assert that they have the correct values. Here is a convenient way to do that. Define the following class and single instance of it:class _MustBe: """ class for asserting that a dimension must have a certain value. the class itself is private, one should import a particular object, "must_be" in order to use the functionality. example code: `batch, chan, must_be[32], must_be[32] = image.shape` """ def __setitem__(self, key, value): assert key == value, "must_be[%d] does not match dimension %d" % (key, value) must_be = _MustBe()
This hack overrides index assignment and replaces it with an assertion. To use, import
must_be
from the file where you defined it. Now you can do stuff like this:batch, must_be[3] = v.shape must_be[batch], l, n = A.shape must_be[batch], must_be[n], m = B.shape ...
Linkpost for: https://pbement.com/posts/must_be.html
Oh, very cool, thanks! Spoiler tag in markdown is:
:::spoiler text here :::
Heh, sure.
Promote from a function to a linear operator on the space of functions, . The action of this operator is just “multiply by ”. We’ll similarly define meaning to multiply by the first, second integral of , etc.
Observe:
Now we can calculate what we get when applying times. The calculation simplifies when we note that all terms are of the form . Result:
Now we apply the above operator to :
The sum terminates because a polynomial can only have finitely many derivatives.
Use integration by parts:
Then is another polynomial (of smaller degree), and is another “nice” function, so we recurse.
Other people have mentioned sites like Mechanical Turk. Just to add another thing in the same category, apparently now people will pay you for helping train language models:
https://www.dataannotation.tech/faq?
Haven’t tried it yet myself, but a roommate of mine has and he seems to have had a good experience. He’s mentioned that sometimes people find it hard to get assigned work by their algorithm, though. I did a quick search to see what their reputation was, and it seemed pretty okay:
Linkpost for: https://pbement.com/posts/threads.html
Today’s interesting number is 961.
Say you’re writing a CUDA program and you need to accomplish some task for every element of a long array. Well, the classical way to do this is to divide up the job amongst several different threads and let each thread do a part of the array. (We’ll ignore blocks for simplicity, maybe each block has its own array to work on or something.) The method here is as follows:
for (int i = threadIdx.x; i < array_len; i += 32) { arr[i] = ...; }
So the threads make the following pattern (if there are threads):
⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫
This is for an array of length . We can see that the work is split as evenly as possible between the threads, except that threads 0 and 1 (black and brown) have to process the last two elements of the array while the rest of the threads have finished their work and remain idle. This is unavoidable because we can’t guarantee that the length of the array is a multiple of the number of threads. But this only happens at the tail end of the array, and for a large number of elements, the wasted effort becomes a very small fraction of the total. In any case, each thread will loop times, though it may be idle during the last loop while it waits for the other threads to catch up.
One may be able to spend many happy hours programming the GPU this way before running into a question: What if we want each thread to operate on a continguous area of memory? (In most cases, we don’t want this.) In the previous method (which is the canonical one), the parts of the array that each thread worked on were interleaved with each other. Now we run into a scenario where, for some reason, the threads must operate on continguous chunks. “No problem” you say, we simply need to break the array into chunks and give a chunk to each thread.
const int chunksz = (array_len + blockDim.x - 1)/blockDim.x; for (int i = threadIdx.x*chunksz; i < (threadIdx.x + 1)*chunksz; i++) { if (i < array_len) { arr[i] = ...; } }
If we size the chunks at 3 items, that won’t be enough, so again we need items per chunk. Here is the result:
⬛⬛⬛⬛🟫🟫🟫🟫🟥🟥🟥🟥🟧🟧🟧🟧🟨🟨🟨🟨🟩🟩🟩🟩🟦🟦
Beautiful. Except you may have noticed something missing. There are no purple squares. Though thread 6 is a little lazy and doing 2 items instead of 4, thread 7 is doing absolutely nothing! It’s somehow managed to fall off the end of the array.
Unavoidably, some threads must be idle for loops. This is the conserved total amount of idleness. With the first method, the idleness is spread out across threads. Mathematically, the amount of idleness can be no greater than regardless of array length and thread number, and so each thread will be idle for at most 1 loop. But in the contiguous method, the idleness is concentrated in the last threads. There is nothing mathematically impossible about having as big as or bigger, and so it’s possible for an entire thread to remain unused. Multiple threads, even. Eg. take :
⬛⬛🟫🟫🟥🟥🟧🟧🟨
3 full threads are unused there! Practically, this shouldn’t actually be a problem, though. The number of serial loops is still the same, and the total number of idle loops is still the same. It’s just distributed differently. The reasons to prefer the interleaved method to the contiguous method would be related to memory coalescing or bank conflicts. The issue of unused threads would be unimportant.
We don’t always run into this effect. If is a multiple of , all threads are fully utilized. Also, we can guarantee that there are no unused threads for larger than a certain maximal value. Namely, take then and so the idleness is . But if is larger than this, then one can show that all threads must be used at least a little bit.
So, if we’re using CUDA threads, then when the array size is 961, the contiguous processing method will leave thread 31 idle. And 961 is the largest array size for which that is true.
So once that research is finished, assuming it is successful, you’d agree that many worlds would end up using fewer bits in that case? That seems like a reasonable position to me, then! (I find the partial-trace kinds of arguments that people make pretty convincing already, but it’s reasonable not to.)
MW theories have to specify when and how decoherence occurs. Decoherence isn’t simple.
They don’t actually. One could equally well say: “Fundamental theories of physics have to specify when and how increases in entropy occur. Thermal randomness isn’t simple.” This is wrong because once you’ve described the fundamental laws and they happen to be reversible, and also aren’t too simple, increasing entropy from a low entropy initial state is a natural consequence of those laws. Similarly, decoherence is a natural consequence of the laws of quantum mechanics (with a not-too-simple Hamiltonian) applied to a low entropy initial state.
Let’s say we have a bunch of datapoints in Rn that are expected to lie on some lattice, with some noise in the measured positions. We’d like to fit a lattice to these points that hopefully matches the ground truth lattice well. Since just by choosing a very fine lattice we can get an arbitrarily small error without doing anything interesting, there also needs to be some penalty on excessively fine lattices. This is a bit of a strange problem, and an algorithm for it will be presented here.
method
Since this is a lattice problem, the first question to jump to mind should be if we can use the LLL algorithm in some way.
One application of the LLL algorithm is to find integer relations between a given set of real numbers. [wikipedia] A matrix is formed with those real numbers (scaled up by some factor ζ) making up the bottom row, and an identity matrix sitting on top. The algorithm tries to make the basis vectors (the column vectors of the matrix) short, but it can only do so by making integer combinations of the basis vectors. By trying to make the bottom entry of each basis vector small, the algorithm is able to find an integer combination of real numbers that gives 0 (if one exists).
But there’s no reason the bottom row has to just be real numbers. We can put in vectors instead, filling up several rows with their entries. The concept should work just the same, and now instead of combining real numbers, we’re combining vectors.
For example, say we have 4 datapoints in three dimensional space, xi=(xi,yi,zi). The we’d use the following matrix as input to the LLL algorithm.
⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1111ζx1ζx2ζx3ζx4ζy1ζy2ζy3ζy4ζz1ζz2ζz3ζz4⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠
Here, ζ is a tunable parameter. The larger the value of ζ, the more significant any errors in the lower 3 rows will be. So fits with a large ζ value will be more focused on having a close match to the given datapoints. On the other hand, if the value of ζ is small, then the significance of the upper 4 rows is relatively more important, which means the fit will try and interpret the datapoints as small integer multiples of the basis lattice vectors.
The above image shows the algorithm at work. Green dot is the origin. Grey dots are the underlying lattice (ground truth). Blue dots are the noisy data points the algorithm takes as input. Yellow dots are the lattice basis vectors returned by the algorithm.
code link
https://github.com/pb1729/latticefit
Run
lattice_fit.py
to get a quick demo.API: Import
lattice_fit
and then calllattice_fit(dataset, zeta)
wheredataset
is a 2d numpy array. First index into the dataset selects the datapoint, and the second selects a coordinate of that datapoint. zeta is just a float, whose effect was described above. The result will be an array of basis vectors, sorted from longest to shortest. These will approach zero length at some point, and it’s your responsibility as the caller to cut off the list there. (Or perhaps you already know the size of the lattice basis.)caveats
Admittedly, due to the large (though still polynomial) time complexity of the LLL algorithm, this method scales poorly with the number of data points. The best suggestion I have so far here is just to run the algorithm on manageable subsets of the data, filter out the near-zero vectors, and then run the algorithm again on all the basis vectors found this way.
...
I originally left this as a stack overflow answer that I came across when initially searching for a solution to this problem.
Linkpost for: https://pbement.com/posts/latticefit.html