Archive for the ‘Computers’ Category

1
Semanticweb.com published my semantic tagging article

June 23, 2008

Unfortunately its been a while since I have written anything new and interesting here although I have been working on some very exciting stuff (more on that later). In the meantime, Josh Dilworth from our great PR team at Porter Novelli submitted my last article to semanticweb.com and they published it! Its the first thing I have ever written that has been published somewhere besides this dinky little blog . Anywho, for those who didnt read it the first time, here is the article on semanticweb.com.

1 Comment | Add a comment

7
Tagging and the Semantic Web

May 20, 2008

John Clarke Mills

A while back I commented on a Tech Crunch article quoting my CEO regarding keyword searches in the Semantic Web space. Â My comment was later quoted on the Faviki blog, a Semantic startup involving tagging web pages with semantic wikipedia data.Â Finding this prompted me to start writing more about the Semantic Web on my own blog. Â This is actually the first time I have ever posted about someone else’s post.

(The following is based on a presentation I gave on the subject in September of 2007.)

Tags the way they are implemented today

The way the better Web 2.0 sites implement tags involves faceting. I have discussed this in a previous blog post regarding faceting with Lucene and SOLR, but it in a nutshell, it allows you to group together documents or objects based on attributes. For example, give me all documents about ‘George Bush’ and ‘Washington’. The problem with these attributes is they have little or no value on their own and they certainly they are not understood by computers. They are just strings denoting some type of concept. Here is a short list of limitations which I feel the Semantic Web web will address:

- Tags do not provide enough meaningful metadata to make meaningful comparisons
- More information is needed besides their origin
- Tags are essentially a full text search mechanism, although faceting helps
- Need more relationships between tags and the objects they pertain to

The solution, tags as objects

Allowing users to tag an object with another object we can make extremely interesting comparisons; discerning a lot more information about the original object becomes simple and accurate. With this type of interrelationship we can pivot through the data like never before, not with full text search but object graph linkages that machines and humans can understand. Lets go over an example.

Lets say a user adds a note into our system ranting about a beet farmer who lives in Washington state by the name of William Gates. The user goes on to discuss his beets and farming techniques in great detail, mentioning nothing about software and Windows Vista of course. In the current Internet model the user would tag this note with strings like, ‘William Gates’, ‘Bill Gates’, ‘beets’, etc.

Now another user comes along and starts digging through documents tagged ‘Bill Gates’ to try and find new articles about Vista. Unfortunately, many searches will turn up bad results, especially if the density of the word ‘Bill Gates’ is great enough in the document about beets. That being said, the other direction would work more as intended, searching on the tags ‘Bill Gates’ and ‘Beets’ would yield more expected results.

In the Semantic Web model, the document about William Gates the beet farmer would be tagged with the William Gates object which could contain a plethora of metadata, his location, occupation, etc. Now when we look at this document there is no guessing as to what it is referring, especially from a machines point of view. This is exactly what the Semantic Web was built for. In this model we are not relying on linguistics, natural language processing, or full text search. We are relying on hard links that machines can understand and relate to.

The disambiguation page (was the tag page in Web 2.0)

What about regular string tags? The Semantic Web cant possibly understand everything?! The fact of the matter is, thats true. We still support regular string tagging. Some things are not proper nouns and less concrete, like adjectives and verbs. They may not yet deserve their own object; however, lets think about actual language here for a second. The semantics behind how we describe things.

Take the adjective ‘cool’. Well, first of all, what are you looking for? Nouns? A grouping of multiple nouns? Probably ‘cool’ nouns. A search on this tag could turn up anything and everything from many different levels. It could start by pulling in a definition from Wikipedia. Then it could group together a list of groups tagged cool like the ‘Super Cars’ group or the ‘Fast Cars’ group. It would also show you users tagged ‘cool’ and documents tagged ‘cool’. But where it becomes really interesting is where you find the ‘cool’ string tag on a tag object! Now you can find proper noun tags like ‘Ferrari’ as well as ‘Super Cars’ the proper noun.

Joining these tags together in a search would yield detailed results from rich metadata like a list of Ferrari’s over the years represented as objects. Each car object would contain detailed specs on engine type, weight, horsepower, etc. Then by examining the ‘Ferrari Enzo’ object we can find all the people who used this tag on their bookmarks, links, documents, or other objects they created. With this information you can connect with these people, join their groups, and further your search for whatever it is you are interested in. The point is, everything is related at many different levels. What links them together are adjectives and verbs that describes them.

Conclusions

To be able to come at your data from every angle is important. Everyone thinks differently and everyone searches differently. The truth is, I think its going to be a while for machines to really understand what us humans are talking about. Its up to us to help organize data in a format that is machine readable so the machines can share, but in return it allows us to perform incredible searches likes never before.

There are many common misconceptions around the Semantic Web. It is a very broad term which has many facets. We are going after what I feel is an attainable portion of this idea. Our platform may not try and fully comprehend and reason what the term ‘cool’ means like a human does but you are a human. You the user understands what this term means and just how to use it. If not, our platform will definitely try to help.

7 Comments | Add a comment

0
Maven2, my first impressions

May 6, 2008

John Clarke Mills

Maven has been a far departure from the usual Ant builds that I have become accustomed to. Now although Ant doesnt deal with dependency management like Maven does I have used Savant in conjunction which worked quite well. The constructs werent complicated, it was getting accustomed to Maven which required me to forget I know everything about Ant; oh yea, and a few days of pounding my head.

After the initial mental exercise I really got to liking it. It allowed me to easily handle multiple-module projects and all their dependencies. It also has great plugin support for anything you can think of, one of my favorite being Jetty (developing locally with Tomcat is for suckers!). I also really like the pom.xml settings as well as profile setting to handle different environments. Anyway, next time I start another project from scratch I will probably take another run with Maven.

Add a comment

0
PipedInputStream and PipedOutputStream with Java

April 10, 2008

John Clarke Mills

Today I came across an interesting concurrency problem while deleting objects from the Social Graph (Semantic Web remember?). I have been tasked with mass deletes throughout our system, including exporting the objects in case they ever need to be reassembled again. Since our graph is so large, and we could potentially be deleting 10′s of thousands of triples at a time, the serialized XML would be about 10 times that many lines per triple represented in a file. In order to write the output to a file as fast as can be there was no need to store the serialized XML in memory. The best thing to do was to pipe the output stream to our binary store.

Now in order to do this, I need two threads, one to write and one to read. If you were to do this with one thread you would most likely run into a nasty deadlock situation. Anyway, here’s what I came up with:

public InputStream openStream() throws IOException {
    final PipedOutputStream out = new PipedOutputStream();

    Runnable exporter = new Runnable() {
        public void run() {
            tupleTransformer.asXML( tuples, out );
            IOUtils.closeQuietly( out );
        }
    };

    executor.submit( exporter );

    return new PipedInputStream( out );
}

Can anyone see the problem? Well unfortunately I couldnt either for over an hour. My unit tests would pass sometimes and fail others which led me to believe I was dealing with a timing issue. Turns out, sometimes the PipedOutputStream was completed before the PipedInputStream was even instantiated, completely missing the stream of the out.close(). The trick was to instantiate the two streams, in and out, at the same time then start the output with another thread. Problem solved. Here is what the finished product looks like:

public InputStream openStream() throws IOException {
    final PipedOutputStream out = new PipedOutputStream();
    PipedInputStream in = new PipedInputStream( out );

    Runnable exporter = new Runnable() {
        public void run() {
            tupleTransformer.asXML( tuples, out );
            IOUtils.closeQuietly( out );
        }
    };

    executor.submit( exporter );

    return in;
}

Add a comment

0
Ajax and IE 7: cache not invalidating

March 21, 2008

John Clarke Mills

For the past day or two I had been struggling at work to figure out why Internet Explore 7 would not pay attention to the response headers stating not to cache the response. First, I tried setting the date header to expire instantly, with a value of -1. QA confirmed that this solved the problem in some revisions of IE 7 but not all. After digging around the web it turns out that you have to set a few more headers, one of which I had never even heard of. Here’s what solved the problem for me. This snippet is in Java but could apply to any language:

response.setDateHeader( "Expires", -1 );
response.setHeader( "Cache-Control", "private" );
response.setHeader( "Last-Modified", new Date().toString() );
response.setHeader( "Pragma", "no-cache" );
response.setHeader( "Cache-Control", "no-store, 
                                 no-cache, must-revalidate" );

Add a comment

2
IE, ImageIO, and Microsoft’s ‘image/pjpeg’ prank

February 28, 2008

John Clarke Mills

Today I came across an issue at work regarding the made up MIME type of ‘image/pjpeg’ while uploading images from IE 7. Java’s ImageIO was choking while trying to upload this type of file because it is not in the list of supported MIME types. The supported MIME type list can be viewed by invoking ImageIO.getReaderMIMETypes(). After reading many blogs of others pounding their head, making fun of Microsoft, and using other solutions like ImageMagick, I found a solution. Add this to your list of supported types and move on. This format is made up and will not cause any problems. Pass the input stream onto your File IO and be done with it. Microsoft owes me half an hour of my life back.

2 Comments | Add a comment

0
Letter to the editor published in the Roundel

July 23, 2007

John Clarke Mills

I recently wrote a letter to the editor of the Roundel Magazine, published by the BMW Car Club Of America, about my friend JP Cadoux at A1 Imports Autoworks in an effort to get the word out about what he does. I am kinda surprised they actually published considering I wrote in a serious manner but frustrated after reading silly articles about carbon fiber, xenon headlights, and other useless do-dads. Nevertheless, here it is.

Add a comment

0
Dynamic Proxies in Java

July 13, 2007

John Clarke Mills

Up until recently I never had to write any code that dealt with reflection. Before now I worked for CNET which is basically a series of differnent CMS’ with one request, one response. My first week at the new job required me to build a JMS system that would accept method invocations across the wire seamlessly for the producer. Java.lang.reflect was the answer.

Once you can wrap your head around the idea its pretty simple to write. Essentially what we are attempting to do is shield the caller from having to know anything about the other method that is going to invoked behind the scenes.

public Class MyConcreteProxyClass implements InvocationHandler { Object obj; public MyDynamicProxyClass(Object obj) { this.obj = obj; } public Object invoke(Object proxy, Method m, Object[] args) throws Throwable { try { // do stuff } catch (Exception e) { throw e; } // return something } }
Now we have to create the proxy interface which the caller will implement. Any of the methods of this class that he implements and fire will invoke method on the MyConcreteProxyClass object.

public interface ProxyInterface{ public Object myMethod(); }

Now its time to create wire everything up. There are several ways to do this which help hide this proxying like a static factory method but this simply demonstrates how this works.

ProxyInterface proxy = (ProxyInterface) Proxy.newProxyInstance(obj.getClass().getClassLoader(), Class[] { ProxyInterface.class }, new MyConcreteProxyClass(obj));

Add a comment

0
More press for A1 Imports

July 10, 2007

John Clarke Mills

In an effort to the use social media to gain more publicity for the fantastic work that JP does we created this video and posted it on YouTube. Within the first day it recieved a few thousand views and was honored for two weeks on the site. It generated a lot of buzz on the Internet and was picked up by Jalopnik.com due to an anonymous tipster (myself). Here is a link to the post.

Add a comment

0
Once a URL goes live plan to maintain it forever

June 5, 2007

John Clarke Mills

Yes it may sound crazy but once a URL goes live you must maintain it… forever! (if SEO is your thing that is). Now that isn’t to say that you shouldn’t change or remove some URL’s, just make sure you create an Apache rewrite rule utilizing a 301 redirect. I deal with this on a monthly basis at work and its not so bad as long as you track your URL movements. Rewrites are an SEO’s best friend. Here is a good way to check to see if a Google has any records of a given string in your URLs:

http://www.google.com/search?q=site:johnclarkemills.com+inurl:seo
http://www.google.com/search?q=site:johnclarkemills.com+inurl:2002

Add a comment