{"id":84,"date":"2010-01-31T20:58:38","date_gmt":"2010-01-31T20:58:38","guid":{"rendered":"https:\/\/rwec.co.uk\/blog\/?p=84"},"modified":"2010-01-31T20:59:10","modified_gmt":"2010-01-31T20:59:10","slug":"a-blog-entry-is-not-a-database","status":"publish","type":"post","link":"https:\/\/rwec.co.uk\/blog\/2010\/01\/a-blog-entry-is-not-a-database\/","title":{"rendered":"The Non-Semantic Web: A blog entry is not a database"},"content":{"rendered":"<p>There has long been a hope &#8211; an <em>expectation<\/em>, even &#8211; that the Web will somehow develop into something &#8220;smart&#8221;; that it will move from being a mere store-house of information to something that will actually &#8220;know the answers&#8221;.<\/p>\n<p>But the vision tends to overlook the nature of both computers and humans. On the one hand, humans have a limited memory, and a flawed ability to  apply consistent logic; on the other hand, we have abilities at creatively interpreting knowledge and ideas that are far beyond the capacity  of any computer so far designed.<\/p>\n<p><!--more--><\/p>\n<h2>Big Calculators<\/h2>\n<p>Artificial Intelligence is a fascinating area of research &#8211; it&#8217;s what I studied at University &#8211; but it is one that has had something of a reality check over the years. At first, it seemed like only a matter of time before computers of some sort or another would exceed all human abilities, ushering in a technological utopia &#8211; or perhaps an apocalypse. But it turns out there are things computers are <em>really good at<\/em> and things they&#8217;re <em>really bad at<\/em>.<\/p>\n<p>The classic example is chess computers: when researchers started building them, the reasoning was that since it takes an intelligent human to play chess, a computer that could play chess would be intelligent too. Eventually, they built a computer that could consistently beat the best human chess players; unfortunately, they discovered that that&#8217;s <em>all<\/em> they&#8217;d built &#8211; they hadn&#8217;t broken through to a new level of computing, they&#8217;d just broken down chess into a very long list of sums. And <strong>computers are good at sums<\/strong>.<\/p>\n<p>Another example is neural networks &#8211; the wonderfully futuristic sounding &#8220;Multi-Layer Perceptron&#8221; is a learning system based on simplified mathematical models of neurons, the building blocks of the human brain. You can &#8220;teach&#8221; such a network to recognise things, and it starts seeming very clever. But researchers realised that you don&#8217;t really need the clever model at all, because it&#8217;s all just maths &#8211; and indeed you can make a so-called &#8220;N-tuple Neural Network&#8221; that consists entirely of lookup tables, with no real maths involved, and certainly no &#8220;brain&#8221;. And while there are now very handy machines that can recognise faces, it&#8217;s still <strong>no use trying to have a conversation with one<\/strong>.<\/p>\n<h2>Whose Markup is it Anyway?<\/h2>\n<p>The &#8220;Semantic Web&#8221;, at the moment, is like that early AI research &#8211; it all seems just around the corner, if only we all try a little bit harder. In fact, the Web has already come a long way, and while the basic language of HTML remains the same, its usage has changed dramatically over the years.<\/p>\n<p>At first, <em>HyperText Markup Language<\/em> was a very simple way of &#8220;marking up&#8221; documents in the big text repository that was the World Wide Web &#8211; separating out paragraphs, marking headlines and important bits, and, of course, turning text into &#8220;anchors&#8221; which &#8220;hyperlinked&#8221; documents together. As it grew, the Web demanded more, prettier, pages, and HTML grew into a rich, <em>presentational<\/em> language &#8211; with colours, tables, images, even simple animations. Then, the needs to automate complex and interactive systems meant HTML had to become a <em>structural<\/em> language as well &#8211; one that could identify blocks of a page, take them apart, and join them back together in a different order.<\/p>\n<p>At this stage, a question arises, which is just <strong>how much structure does a document need?<\/strong> Two technologies readily associated with <em>structural<\/em> HTML are XHTML &#8211; HTML with the added rigidity and tools of XML &#8211; and the <abbr title=\"Document Object Model\">DOM<\/abbr> &#8211; a way of look at a document as a big tree of &#8220;nodes&#8221;. This is an extremely useful way of looking at a page with lots going on &#8211; you can work dynamically with the navigation controls, or the interactive comment system, without touching the main content area. But how useful is it <em>inside<\/em> the content? In the sentence &#8220;I <em>really<\/em> like structure&#8221;, the word &#8220;really&#8221; is wrapped in its own HTML element, but not because of any <em>structural significance<\/em> &#8211; it&#8217;s just there for presentation.<\/p>\n<p>Now, we&#8217;re told, HTML needs to evolve further, into a <em>semantic<\/em> markup, which doesn&#8217;t just separate out the blocks, it labels them all <em>meaningfully<\/em>. I&#8217;ve used italics as an example deliberately, because a common misconception is that the &lt;i&gt; tag is bad &#8211; it&#8217;s <em>presentational<\/em> &#8211; and &lt;em&gt; is better &#8211; it means <em>emphasise this<\/em>, rather than <em>draw this in italics<\/em>: much more <em>semantic<\/em>, surely? And so the Rich-Text control I&#8217;m typing in right now has a button labelled &#8220;I&#8221; &#8211; the recognised icon for italics &#8211; which inserts an &lt;em&gt; tag; it even replaces &lt;i&gt; with &lt;em&gt; if you edit the HTML! This is, obviously, a complete misunderstanding &#8211; <strong>a straight find-and-replace can&#8217;t magically gain us meaning<\/strong> &#8211; but the fact is, it&#8217;s what <em>people<\/em> are used to, and it&#8217;s <em>people<\/em> that write the content.<\/p>\n<p>So this is the second conflict facing the Web &#8211; authors don&#8217;t want to add all this information. The real reason &lt;i&gt; is discouraged is that when it&#8217;s <em>not<\/em> being used for emphasis, the machine has no way of knowing why it <em>is<\/em> being used: it might be to indicate a foreign phrase, like <em>in situ<\/em>, in which case we could indicate what language it is; or it might be the name of a publication, like <em>The Times<\/em>, in which case we could label it with some appropriate global identifier. But the fact is, a human reader won&#8217;t gain anything from this, and <strong>content is written by humans, for humans<\/strong>.<\/p>\n<h2>Playing to our Strengths<\/h2>\n<p>In the end, the Semantic Web will not be built by insisting that I tell WordPress that <em>in situ<\/em> is Latin &#8211; who would it help if I did? A blog entry is not, and never will be, a database &#8211; it&#8217;s a big block of text, or perhaps a few medium-sized blocks of text, with some decoration and links thrown in. The <em>structural<\/em> markup can help us divide up those blocks, and mark off the bits of the page which aren&#8217;t part of the blog entry at all; the <em>semantic <\/em>markup will improve how those <em>blocks <\/em>get labelled.<\/p>\n<p>If you want to find pages about a particular topic, what you need is a  really big index, and that&#8217;s a good job to give a computer; if you want to find &#8220;the right page&#8221; about a particular topic, it&#8217;s up to you to decide what &#8220;right&#8221; means. We can help the computers to help us &#8211; by making articles that explicitly label their own topics, for instance (as long as you can trust them, but that&#8217;s another issue). We can even decide that sometimes we do want to embed little bits of <em>data<\/em> in our <em>content<\/em> &#8211; if we know an address might be useful to someone, we could label it with an appropriate <a href=\"http:\/\/microformats.org\/\">microformat<\/a>, so they can more easily feed it to another piece of software. But if they want to look up what <a href=\"http:\/\/en.wiktionary.org\/wiki\/in_situ\"><em>in situ<\/em><\/a> or <a href=\"http:\/\/foldoc.org\/perceptron\">Perceptron<\/a> actually means, we can leave it to a human to look it up in an appropriate index.<\/p>\n<p>This may be a disappointing prognosis if you were hoping the web could tell  you the answers without you having to read the articles, but think of it  as a division of labour: the computers can get better at <em>categorising <\/em>and <em>filtering <\/em>the content, we can carry on being good at <em>understanding <\/em>it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There has long been a hope &#8211; an expectation, even &#8211; that the Web will somehow develop into something &#8220;smart&#8221;; that it will move from being a mere store-house of information to something that will actually &#8220;know the answers&#8221;. But the vision tends to overlook the nature of both computers and humans. On the one [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[58,59,61,60,62,56,55,57],"class_list":["post-84","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-ai","tag-artificial-intelligence","tag-html","tag-language","tag-markup","tag-semantic","tag-semantic-web","tag-web","post-preview"],"_links":{"self":[{"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/posts\/84","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/comments?post=84"}],"version-history":[{"count":6,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/posts\/84\/revisions"}],"predecessor-version":[{"id":90,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/posts\/84\/revisions\/90"}],"wp:attachment":[{"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/media?parent=84"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/categories?post=84"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/tags?post=84"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}