The possibility of neural networks aiming to achieve internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans.
I agree that this distinction is critical.
My brief attempt at explaining the importance of internally-represented goals (nice phrasing btw!) for capabilities and alignment is Steering subsystems: capabilities, agency, and alignment.