Anupam Joshi and Tim Finin have received a $200,000 research award from the NSF Division of Information and Intelligent Systems for a two-year project focused on extracting information from tables. The project, T2K: From Tables to Knowledge, will explore the feasibility of automatically extracting new knowledge directly from data found in spreadsheets, database relations, and document tables and representing it as highly interoperable linked open data (LOD) in the Semantic Web language RDF. The extraction is guided by probabilistic graphical models that use statistical information mined from current LOD knowledge resources. To demonstrate the potential payoff of the research, the system is used to extract knowledge from tables collected from medical journals and tables from web sites like data.gov.
While the W3C semantic web languages RDF and OWL are used to represent the knowledge, the results are applicable to other semantic data frameworks such as Microdata (Search Consortium), Freebase (Google), Probase (Microsoft) and the Open Graph (Facebook). The open sourced prototype software allows other researchers to experiment with automatically producing semantically enriched data from tables for their domains.
If successful, such software extraction systems are expected to become part of a new online knowledge ecology — both consuming existing LOD knowledge to understand the intended meaning implicit in a table and producing new facts and knowledge that will become part of Web. This represents a dramatic increase in the breadth and depth of public semantic data that can make “big data'' analytics more effective.