Well, my naive first thought was to abuse the opencyc engine for a while so it starts getting good rough guesses of which particular mathematical concepts and quantities and sets are being referred to in a given sentence, and plug it either directly or by mass download and conversion into various data sources like WolframAlpha or international health / crime / population / economics databases or various government services.
But that still means doing math (doing math with linguistics) tons and tons of programming to even get a working prototype that understands “30% of americans are older than 30 years old”, way more work than I care to visualize just to get the system to not explode and respond in a sane manner when you throw at it something incongruent (“30 of americans are 30% years old” should not make the system choke, for example), etc. And then you’ve got to build something usable around that, interfaces, ways to extract and store data, and then probably pack everything together. And once you’re there, you probably want to turn it into a product and sell it, since you might as well cash in some money on all of this work. Then more work.
The whole prospect looks like a small asteroid rather than a mountain, from where I’m sitting. I am not in the business of climbing, mining, deconstructing and exporting small asteroids. I’ll stick to climbing over mountains until I have a working asteroid-to-computronium converter.
My suggestion would be to go via some sort of meta-analysis or meta-meta-analysis (yes, that’s a thing); if you have, for example, a meta-analysis of all results in a particular field and how often they replicate, you can infer pretty accurately how well a new result in that field will replicate. (An example use: ‘So 90% of all the previous results with this sample size or smaller failed to replicate? Welp, time to ignore this new result until it does replicate.’)
It would of course be a ton of work to compile them all, and then any new result you were interested in, you’d still have to know how to code it up in terms of sample size, which sub-sub-field it was in, what the quantitative measures were etc, but at least it doesn’t require nigh-magical AI or NLP—just a great deal of human effort.
Nigh-magical is the word indeed. I just realized that if my insane idea in the grandparent were made to work, it could be unleashed upon all research publications ever everywhere for mining data, figures, estimates, etc., and then output a giant belief network of “this is collective-human-science’s current best guess for fact / figure / value / statistic X”.
That does not sound like something that could be achieved by a developer less than google-sized. It also fails all of my incredulity and sanity checks.
(it also sounds like an awesome startup idea, whatever that means)
Or IBM-sized. But if you confined your ambitions to analyzing just meta-analyses, it would be much more doable. The narrower the domain, the better AI/NLP works, remember. There’s some remarkable examples of what you can do in machine-reading a narrow domain and extracting meaningful scientific data; one of them is ChemicalTagger (demo), reading chemistry papers describing synthesis processes and extracting the process (although it has serious problems getting papers to use). I bet you could get a lot out of reading meta-analyses—there’s a good summary just in the forest plot used in almost every meta-analysis.
But sounds totally awesome. Especially if it can be created once and used over and over for different applications.
Well, my naive first thought was to abuse the opencyc engine for a while so it starts getting good rough guesses of which particular mathematical concepts and quantities and sets are being referred to in a given sentence, and plug it either directly or by mass download and conversion into various data sources like WolframAlpha or international health / crime / population / economics databases or various government services.
But that still means doing math (doing math with linguistics) tons and tons of programming to even get a working prototype that understands “30% of americans are older than 30 years old”, way more work than I care to visualize just to get the system to not explode and respond in a sane manner when you throw at it something incongruent (“30 of americans are 30% years old” should not make the system choke, for example), etc. And then you’ve got to build something usable around that, interfaces, ways to extract and store data, and then probably pack everything together. And once you’re there, you probably want to turn it into a product and sell it, since you might as well cash in some money on all of this work. Then more work.
The whole prospect looks like a small asteroid rather than a mountain, from where I’m sitting. I am not in the business of climbing, mining, deconstructing and exporting small asteroids. I’ll stick to climbing over mountains until I have a working asteroid-to-computronium converter.
My suggestion would be to go via some sort of meta-analysis or meta-meta-analysis (yes, that’s a thing); if you have, for example, a meta-analysis of all results in a particular field and how often they replicate, you can infer pretty accurately how well a new result in that field will replicate. (An example use: ‘So 90% of all the previous results with this sample size or smaller failed to replicate? Welp, time to ignore this new result until it does replicate.’)
It would of course be a ton of work to compile them all, and then any new result you were interested in, you’d still have to know how to code it up in terms of sample size, which sub-sub-field it was in, what the quantitative measures were etc, but at least it doesn’t require nigh-magical AI or NLP—just a great deal of human effort.
Nigh-magical is the word indeed. I just realized that if my insane idea in the grandparent were made to work, it could be unleashed upon all research publications ever everywhere for mining data, figures, estimates, etc., and then output a giant belief network of “this is collective-human-science’s current best guess for fact / figure / value / statistic X”.
That does not sound like something that could be achieved by a developer less than google-sized. It also fails all of my incredulity and sanity checks.
(it also sounds like an awesome startup idea, whatever that means)
Or IBM-sized. But if you confined your ambitions to analyzing just meta-analyses, it would be much more doable. The narrower the domain, the better AI/NLP works, remember. There’s some remarkable examples of what you can do in machine-reading a narrow domain and extracting meaningful scientific data; one of them is ChemicalTagger (demo), reading chemistry papers describing synthesis processes and extracting the process (although it has serious problems getting papers to use). I bet you could get a lot out of reading meta-analyses—there’s a good summary just in the forest plot used in almost every meta-analysis.