|
|
Mining the Deep Web for Economic Data
What is commonly considered the World Wide Web is in fact a small fraction of the actual data available on the Internet. The metaphor of a web was motivated by linked textual material, but the volume of hypertext on the Internet is dwarfed by the amount of information made available in networked databases provided by directory services, information portals, government agencies, private companies, scientists, and a host of other providers. Since these data have no static inbound hyperlinks they are not accessed by the webcrawlers of search engines, and hence are largely untapped as a resource for any use other than point lookups. As a result, this data is often called the ""deep Web" or the "hidden Web". A recent study estimates the size of the deep web as being 7.5 petabytes or 400 to 550 times larger than the hypertext indexed by search engines. A large fraction of the data on the deep Web is not full-text documents, but rather quantitative
data, much of it potentially of economic interest. There are job ads, housing ads, SEC findings, and up-to-the minute prices for all sorts of things.
In the private sector, sites such as www.corporatesleuth.com have mined the SEC Edgar database for financial and accounting data of interest to investors. We know of one publisher who has reverse engineered Amazon's book rankings so that they can infer actual sales from rankings. This allows them to estimate book sales by other publishers, by category, and by season, making for much better inventory management. No doubt there are many other examples of private firms mining data from the deep Web. The public sector, on the other hand, has not yet made significant use of the data available on the deep Web.We believe that there are many compelling applications, and have created one interesting example of the use of political data at fff.cs.berkeley.edu. But this is just the beginning: we
think that there are great opportunities to mine the deep Web for data that will be of use for economic forecasting, particularly regional forecasting. There are several interesting technical and economic challenges in extracting and analyzing these data.
We propose to use screen scraping and related tools to mine the deep Web for data useful for economic forecasting. It is important to start this effort soon, since we hope to be able to develop some leading indicators of economic recovery, particularly in the technology sector. The primary focus of this work is to implement some tools and gather data that can be used in future analysis.
|