World-Wise Web?

By Richard Waters
Financial Times, March 4, 2008

Edited by Andy Ross

A new wave of AI technology, based on a collection of technologies that includes natural language processing, image recognition, and expert systems, may lead to intelligent machines. Thinking Machines founder Danny Hillis: "I had some hope you could just put everything into some big neural network that would just start to think, but it doesn't take long working in AI to realise it's much more complex than that."

The basic building block for this new technology movement is the semantic web, the brainchild of Sir Tim Berners-Lee, who invented the present World Wide Web. Sir Tim imagined a new web formed by linking the data contained inside the documents. That way the data, not just the documents, would become accessible to machines.

This semantic web is the product of a set of core standards promoted by the World Wide Web Consortium, the organisation that Sir Tim leads. Now some supporters say enough pieces are in place to make the first semantic web services a reality.

But there are some big obstacles. At the heart of the problem is the need to make information on the web "understandable" to machines, so that it can be extracted, processed and made useful. To make this possible, machine-readable "tags" need to be attached to each piece of data to describe what type of information it represents.

Attaching these tags to every piece of information on the web is a huge task. Without new semantic services, there is no incentive to undertake the laborious work of tagging data, but creating the services is pointless unless the data exist in the first place. To try to overcome the problem, the semantic web depends on a set of "ontologies", or dictionaries that help to create common definitions that can be universally applied. These are designed to establish a basic common level of understanding about language to allow machines to do their work.

A technology first developed for use in AI is natural language processing. Even simple words or concepts can mean very different things to different people, and their meaning changes depending on circumstances. While the human mind can make the necessary adjustments, computers that follow strict rules about language find it hard to grasp the many context-specific meanings.

Companies trying to employ natural language processing maintain that technical advances in recent years have at last given it a level of practical application. By using software to "read" text, services such as Powerset aim to add tags to data automatically. The natural language approach also raises the possibility of new applications, for example being able directly to answer questions posed by a user.

Powerset is using technology licensed from Parc to try to solve the problems of natural language processing. The software is based on similar ideas to those in quantum physics. A number of potential meanings for all the elements in the text are allowed to co-exist as equally accurate during the "reading", until the most likely answer is singled out at the end.

Combining this approach with other techniques of data analysis can lift the accuracy level further. One method relies on predicting the meaning of a word based on the probabilities of its proximity to other words in the text. As words do not appear in random sequences, the fact that one word has been used in a sentence increases the chance that a particular other word will also turn up.

Most expect the impact of the technology to be felt in stages. The early advances are likely to be incremental improvements. Search engines should return higher quality results, and services that rely on personalization should make better guesses about your preferences, while targeted advertising systems should become more accurate.

The Charms of Wikipedia

By Nicholson Baker
The New York Review of Books
Volume 55, Number 4, March 20, 2008

Edited by Andy Ross

Wikipedia: The Missing Manual
By John Broughton
Pogue Press/O'Reilly, 477 pages

Wikipedia is just an incredible thing. It has 2.2 million articles and it's very often the first hit in a Google search. It was constructed, in less than eight years, by strangers who disagreed about all kinds of things but who were drawn to a shared, not-for-profit purpose.

It worked and grew because it tapped into the heretofore unmarshaled energies of the uncredentialed. This was an effort to build something that made sense apart from one's own opinion, something that helped the whole human cause roll forward.

Wikipedia was the point of convergence for the self-taught and the expensively educated. All everyone knew was that the end product had to make legible sense and sound encyclopedic. The need for the outcome of all edits to fit together as readable, unemotional sentences muted natural antagonisms. Wikipedians see vandalism as a problem, but a Diogenes-minded observer would submit that Wikipedia would never have been the prodigious success it has been without its demons.

Co-founder Jimmy "Jimbo" Wales: "The main thing about Wikipedia is that it is fun and addictive."

John Broughton: "This Missing Manual helps you avoid beginners' blunders and gets you sounding like a pro from your first edit."

SAP NetWeaver TREX

Wikipedia: TREX search engine

TREX is a search engine in the SAP NetWeaver integrated technology platform produced by SAP AG. The TREX engine is a standalone component that can be used in a range of system environments but is used primarily as an integral part of such SAP products as Enterprise Portal, Knowledge Warehouse, and Business Intelligence (BI, formerly SAP Business Information Warehouse). In SAP NetWeaver BI, the TREX engine powers the BI Accelerator, which is a plug-in appliance for enhancing the performance of online analytical processing. The name "TREX" stands for Text Retrieval and information EXtraction, but it is not a registered trade mark of SAP and is not used in marketing collateral.
Search functions
TREX supports various kinds of text search, including exact search, boolean search, wildcard search, linguistic search (grammatical variants are normalized for the index search) and fuzzy search (input strings that differ by a few letters from an index term are normalized for the index search). Result sets are ranked using term frequency-inverse document frequency (tf-idf) weighting, and results can include snippets with the search terms highlighted.

TREX supports text mining and classification using a vector space model. Groups of documents can be classified using query based classification, example based classification, or a combination of these plus keyword management.

TREX supports structured data search not only for document metadata but also for mass business data and data in SAP business objects. Indexes for structured data are implemented compactly using data compression and the data can be aggregated in linear time, to enable large volumes of data to be processed entirely in memory.

Sir Tim: Google could be superseded

By Jonathan Richards
Times Online, March 12, 2008

Edited by Andy Ross

Google may eventually be displaced as the pre-eminent brand on the internet by a company that harnesses the power of next-generation web technology, says Tim Berners-Lee. The web of the future would allow any piece of information — such as a photo or a bank statement — to be linked to any other.

Tim Berners-Lee said that in the same way, the "current craze" for social networking sites would eventually be superseded by networks that connected all types of things. The semantic web will enable direct connectivity between much more low-level pieces of information — a written street address and a map, for instance — which in turn will give rise to new services.

Tim Berners-Lee: "Using the semantic web, you can build applications that are much more powerful than anything on the regular web. Imagine if two completely separate things — your bank statements and your calendar — spoke the same language and could share information with one another. You could drag one on top of the other and a whole bunch of dots would appear showing you when you spent your money."

Tim Berners-Lee invented the World Wide Web in 1989 while a fellow at CERN in Switzerland. Asked about the type of application that the Google of the future would develop, he said it would likely be a type of mega-mashup, where information is taken from one place and made useful in another context using the web.

Tim Berners-Lee is now a director of the Web Science Research Initiative, a collaborative project between the Massachusetts Institute of Technology and the University of Southampton.