note that "gradient descent" isn't AI either. it's more computational linear algebra: a heuristic for numerical methods used to solve (usually to a local extrema) systems of equations without direct analytical solution(s).
This is not abstract math, and the article does explain what it's doing before presenting the code snippets.
How can you explain or implement gradient descent without math? At some point I think you have to accept that this is a topic that involves math, and you're way better off understanding it on those terms rather than trying to avoid it.
I thought gradient descent was mostly calculus, not linear algebra. I was under the impression linear algebra was used to frame calculations so that GPUs could be utilized (since GPUs are very good at LA operations)
I'm not an AI expert either, but let me give this a try.
I assume you are vaguely familiar with gradient descent. In gradient descent, we are basically trying to find the sweet spot where the value of a function is minimized. We do this by calculating the derivative of the function at a certain point and then use it to take small steps in the direction where we believe the function will have a lower value.
Gradient descent usually suffers from a problem where the algorithm gets stuck in local minimas if the function is not convex in shape.
However, when people use gradient descent to optimize functions with a very large number of parameters (as is the case in Deep Learning), another problem surfaces called saddle points. Imagine a 3 dimensional plot of the function at different values of its parameters (in reality the plot will be multi-dimensional). Now on this plot, there will be many regions where the derivative of the components defining the surface become zero. This messes with our plan to use derivatives to find the direction in which to move. So we need to come up with strategies to escape saddle points during the gradient descent process.
Gradient descent is a subset of survival of the fittest, described by Darwin in 1800-1900, and has been in applied in computer science since the 70's. An AGI will probably use some form of gradient descent during its training, yes, but I wouldn't argue that this has brought us even close to an AGI.
I've considered gradient descent for optimizing parameters on toy problems at university a few times. Never actually did it though, it's a lot of hassle for the advantage of less interaction at the cost of no longer building some intuition.
You are not off base at all, thanks for clarify and sorry for the confusion, I did not mean to say it was using gradient descent. It's been a while. The term I was thinking of was multiple "simulated annealing".
FYI, gradient descent is covered in one of the very first weeks of Andrew Ng's Coursera machine learning class, so perhaps just watch those lessons (free)
Gradient descent is the approximation solution basically because getting the exact solution requires a good computation of inverse matrices which is apparently not yet doable (it's too slow)
reply