As we know and you mentioned, humans do learn from small data. We start with priors that are hopefully not too strong and go through the known processes of scientific discovery. NN do not have that meta process or any introspection (yet).
“You cannot solve this problem by minimizing your error over historical data. Insofar as big data minimizes an algorithm’s error over historical results … Big data compensates for weak priors by minimizing an algorithm’s error over historical results. Insofar as this is true, big data cannot reason about small data.”
NN also do not reduce/idealize/simplify, explicitly generalize and then run the results as hypothesis forks. Or use priors to run checks (BS rejection / specificity). We do.
Maybe there will be a evolutionary process where huge NNs are reduced to do inference at “the edge” that turns into human like learning after feedback from “the edge” is used to select and refine the best nets.
NN also do not reduce/idealize/simplify, explicitly generalize and then run the results as hypothesis forks. Or use priors to run checks (BS rejection / specificity). We do.
This is very important. I plan to follow up with another post about the necessary role of hypothesis amplification in AGI.
″ This [edge] update is then averaged with other user updates to improve the shared model.”
I do not know how that is meant but when I hear the word “average” my alarms always sound.
Instead of a shared NN each device should get multiple slightly different NNs/weights and report back which set was worst/unfit and which best/fittest. Each set/model is a hypothesis and the test in the world is a evolutionary/democratic falsification. Those mutants who fail to satisfy the most customers are dropped.
NNs are a big data approach, tuned by gradient descent. Because NNs are a big data approach, every update is necessarily small (in the mathematical sense of first-order approximations). When updates are small like this, averaging is fine. Especially considering how most neural networks use sigmoid activation functions.
While this averaging approach can’t solve small data problems, it is perfectly suitable to today’s NN applications where things tend to be well-contained, without fat tails. This approach works fine within the traditional problem domain of neural networks.
As we know and you mentioned, humans do learn from small data. We start with priors that are hopefully not too strong and go through the known processes of scientific discovery. NN do not have that meta process or any introspection (yet).
“You cannot solve this problem by minimizing your error over historical data. Insofar as big data minimizes an algorithm’s error over historical results … Big data compensates for weak priors by minimizing an algorithm’s error over historical results. Insofar as this is true, big data cannot reason about small data.”
NN also do not reduce/idealize/simplify, explicitly generalize and then run the results as hypothesis forks. Or use priors to run checks (BS rejection / specificity). We do.
Maybe there will be a evolutionary process where huge NNs are reduced to do inference at “the edge” that turns into human like learning after feedback from “the edge” is used to select and refine the best nets.
This is very important. I plan to follow up with another post about the necessary role of hypothesis amplification in AGI.
Edit: Done.
I came across this:
The New Dawn of AI: Federated Learning
″ This [edge] update is then averaged with other user updates to improve the shared model.”
I do not know how that is meant but when I hear the word “average” my alarms always sound.
Instead of a shared NN each device should get multiple slightly different NNs/weights and report back which set was worst/unfit and which best/fittest.
Each set/model is a hypothesis and the test in the world is a evolutionary/democratic falsification.
Those mutants who fail to satisfy the most customers are dropped.
NNs are a big data approach, tuned by gradient descent. Because NNs are a big data approach, every update is necessarily small (in the mathematical sense of first-order approximations). When updates are small like this, averaging is fine. Especially considering how most neural networks use sigmoid activation functions.
While this averaging approach can’t solve small data problems, it is perfectly suitable to today’s NN applications where things tend to be well-contained, without fat tails. This approach works fine within the traditional problem domain of neural networks.