Rowan's World, et Cetera

The Non-Semantic Web: A blog entry is not a database

by Rowan on 31 January, 2010

There has long been a hope – an expectation, even – that the Web will somehow develop into something “smart”; that it will move from being a mere store-house of information to something that will actually “know the answers”.

But the vision tends to overlook the nature of both computers and humans. On the one hand, humans have a limited memory, and a flawed ability to apply consistent logic; on the other hand, we have abilities at creatively interpreting knowledge and ideas that are far beyond the capacity of any computer so far designed.

Big Calculators

Artificial Intelligence is a fascinating area of research – it’s what I studied at University – but it is one that has had something of a reality check over the years. At first, it seemed like only a matter of time before computers of some sort or another would exceed all human abilities, ushering in a technological utopia – or perhaps an apocalypse. But it turns out there are things computers are really good at and things they’re really bad at.

The classic example is chess computers: when researchers started building them, the reasoning was that since it takes an intelligent human to play chess, a computer that could play chess would be intelligent too. Eventually, they built a computer that could consistently beat the best human chess players; unfortunately, they discovered that that’s all they’d built – they hadn’t broken through to a new level of computing, they’d just broken down chess into a very long list of sums. And computers are good at sums.

Another example is neural networks – the wonderfully futuristic sounding “Multi-Layer Perceptron” is a learning system based on simplified mathematical models of neurons, the building blocks of the human brain. You can “teach” such a network to recognise things, and it starts seeming very clever. But researchers realised that you don’t really need the clever model at all, because it’s all just maths – and indeed you can make a so-called “N-tuple Neural Network” that consists entirely of lookup tables, with no real maths involved, and certainly no “brain”. And while there are now very handy machines that can recognise faces, it’s still no use trying to have a conversation with one.

Whose Markup is it Anyway?

The “Semantic Web”, at the moment, is like that early AI research – it all seems just around the corner, if only we all try a little bit harder. In fact, the Web has already come a long way, and while the basic language of HTML remains the same, its usage has changed dramatically over the years.

At first, HyperText Markup Language was a very simple way of “marking up” documents in the big text repository that was the World Wide Web – separating out paragraphs, marking headlines and important bits, and, of course, turning text into “anchors” which “hyperlinked” documents together. As it grew, the Web demanded more, prettier, pages, and HTML grew into a rich, presentational language – with colours, tables, images, even simple animations. Then, the needs to automate complex and interactive systems meant HTML had to become a structural language as well – one that could identify blocks of a page, take them apart, and join them back together in a different order.

At this stage, a question arises, which is just how much structure does a document need? Two technologies readily associated with structural HTML are XHTML – HTML with the added rigidity and tools of XML – and the DOM – a way of look at a document as a big tree of “nodes”. This is an extremely useful way of looking at a page with lots going on – you can work dynamically with the navigation controls, or the interactive comment system, without touching the main content area. But how useful is it inside the content? In the sentence “I really like structure”, the word “really” is wrapped in its own HTML element, but not because of any structural significance – it’s just there for presentation.

Now, we’re told, HTML needs to evolve further, into a semantic markup, which doesn’t just separate out the blocks, it labels them all meaningfully. I’ve used italics as an example deliberately, because a common misconception is that the <i> tag is bad – it’s presentational – and <em> is better – it means emphasise this, rather than draw this in italics: much more semantic, surely? And so the Rich-Text control I’m typing in right now has a button labelled “I” – the recognised icon for italics – which inserts an <em> tag; it even replaces <i> with <em> if you edit the HTML! This is, obviously, a complete misunderstanding – a straight find-and-replace can’t magically gain us meaning – but the fact is, it’s what people are used to, and it’s people that write the content.

So this is the second conflict facing the Web – authors don’t want to add all this information. The real reason <i> is discouraged is that when it’s not being used for emphasis, the machine has no way of knowing why it is being used: it might be to indicate a foreign phrase, like in situ, in which case we could indicate what language it is; or it might be the name of a publication, like The Times, in which case we could label it with some appropriate global identifier. But the fact is, a human reader won’t gain anything from this, and content is written by humans, for humans.

Playing to our Strengths

In the end, the Semantic Web will not be built by insisting that I tell WordPress that in situ is Latin – who would it help if I did? A blog entry is not, and never will be, a database – it’s a big block of text, or perhaps a few medium-sized blocks of text, with some decoration and links thrown in. The structural markup can help us divide up those blocks, and mark off the bits of the page which aren’t part of the blog entry at all; the semantic markup will improve how those blocks get labelled.

If you want to find pages about a particular topic, what you need is a really big index, and that’s a good job to give a computer; if you want to find “the right page” about a particular topic, it’s up to you to decide what “right” means. We can help the computers to help us – by making articles that explicitly label their own topics, for instance (as long as you can trust them, but that’s another issue). We can even decide that sometimes we do want to embed little bits of data in our content – if we know an address might be useful to someone, we could label it with an appropriate microformat, so they can more easily feed it to another piece of software. But if they want to look up what in situ or Perceptron actually means, we can leave it to a human to look it up in an appropriate index.

This may be a disappointing prognosis if you were hoping the web could tell you the answers without you having to read the articles, but think of it as a division of labour: the computers can get better at categorising and filtering the content, we can carry on being good at understanding it.

3 thoughts on “The Non-Semantic Web: A blog entry is not a database

  1. Steve Power says:

    Very disappointed that your link to foldoc was not its former address of foldoc.doc.ic.ac.uk

    I still think that was the best website address to have ever existed.

  2. Rowan says:

    Haha, yes, I’d forgotten that, brings back memories… Wouldn’t fit the current craze for Twitter-friendly domain hacks, would it?

    I did make the effort to link somewhere other than the all-conquering Wikipedia, though!

  3. Jorja says:

    Whoa, things just got a whole lot eaiers.

Leave a Reply

Your email address will not be published. Required fields are marked *