Researchers from the US have developed an artificial intelligence (AI) system that surfs the internet, extracts information from the available plain text and organises it for quantitative analysis in very less time. Recently at the Association for Computational Linguistics’ Conference on Empirical Methods on Natural Language Processing, researchers from the Massachusetts Institute of Technology (MIT) Computer Science and Artificial Intelligence Laboratory won a best-paper award for a new approach to information extraction that turns conventional machine learning on its head.
Most machine-learning systems work by combing through training examples and looking for patterns that correspond to classifications provided by human annotators. In their new paper, the MIT researchers trained their system on scanty data — because in the scenario they’re investigating, that’s usually all that’s available. But then they find the limited information an easy problem to solve.
“In information extraction, traditionally, in natural-language processing, you are given an article and you need to do whatever it takes to extract correctly from this article,” said Regina Barzilay, the Delta Electronics Professor of Electrical Engineering and Computer Science. “That’s very different from what you or I would do. When you are reading an article that you cannot understand, you are going to go on the web and find one that you can understand,” Barzilay, who also a senior author of the paper, added.
A machine-learning system assigns each of its classifications a confidence score — which is a measure of the statistical likelihood that the classification is correct — given the patterns discerned in the training data.
With the researchers’ new system, if the confidence score is too low, the system automatically does a web search to pull up texts likely to contain the data it is trying to extract. It then attempts to extract the relevant data from one of the new texts and reconciles the results with those of its initial extraction. If the confidence score remains too low, it moves on to the next text pulled up by the search string, and so on. Eventually, the system learns how to generate search queries, gauge the likelihood that a new text is relevant to its extraction task, and determine the best strategy for fusing the results of multiple attempts at extraction.