Tag Archives: hyperlinks

Preview: AutoHyperlinks Plugin For Coda 1.6

The guys at Panic just released an update to Coda which, among other very useful features, includes a new plugin API.

I’ve hacked together a quick plugin around the AutoHyperlinks framework that will find raw URIs and email addresses and link them.  URIs in existing markup is skipped, mostly.

get the plugin | get the source

Quality is still a little rough, but it handles simple markup well (and you can always run it on just a selection of text).

I’m releasing the plugin under a 3-clause BSD license. More to come, I hope.

Enjoy.

AutoHyperlinks.framework Has Gone BSD

Check it.

It was a lot of work getting everything in place in order to do that, but it feels good.  I love you, Mac developers.

The AIHyperlinks Framework (or, How Adium Finds Links)

I thought it was about time I wrote up a little something about the way Adium finds hyperlinks in message text. It’s all done inside a nice little OS X framework: the AIHyperlinks Framework.

The autolinking code was contained within one component of the Adium source

It contains 2 main parts: A scanner class (SHHyperlinkScanner) and the link verification scanner itself (SHLinkLexer). A data type for detected hyperlinks (SHMarkedHyperlink) is also included, but is rarely used outside the framework.

The framework was designed to be dependent only upon system provided frameworks for easy adaption in other Cocoa based projects, so it’s in no way bound to the Adium code base. This lets it be used by other components as well as allowing it to be integrated at a lower level of the application source, or even in another application entirely.

A Brief History:

Before version 0.53 (way back in 2004), all the autolinking code was contained within one component of the Adium source: AIAutoLinkingPlugin. The class did the basics, though extending it was a nightmare. It relied on NSScanner to do the heavy lifting of recognizing the links; and while this is fine if all you’re looking for is “things that end in .com” links are rarely that nice and neat, so we end up with code like this:

AIAutoLinkingPlugin.m
  1. //Recognized URL types
  2. static int  linkSubStringCount = 8;
  3. static NSString *linkSubString[] = { //If any of these are found, the string is scanned in detail using the keys below
  4.          // You can find a current list of gTLD's at http://www.icann.org/tlds/
  5.     // You can find a full listing of TLD's at http://www.norid.no/domenenavnbaser/domreg.html
  6.     // This list only includes the gTLD's and some of the more popular TLD's
  7.     @"://", @"www.", @"@",
  8.     @".com", @".edu", @".gov", @".net", @".org", @".us", @".co.uk", @".org.uk", @".museum", @".aero", @".biz", @".coop", @".info", @".mil", @".com.ar", @".pro", @".com.jp"};
  9. static int  linkDetailStringCount = 13;
  10. static NSString *linkDetailString[] = { //Anything matching these keys is linked
  11.     @"*://*", @"www.*.*", @"*@*.*",
  12.      @"*.com", @"*.edu", @"*.gov", @"*.net", @"*.org", @"*.us", @"*.co.uk", @"*.org.uk", @"*.museum", @"*.aero", @"*.biz", @"*.coop", @"*.info", @"*.int", @"*.mil", @"*.pro", @"*.com.jp", @"*.com.ar",
  13.      @"*.com/*", @"*.edu/*", @"*.gov/*", @"*.net/*", @"*.org/*", @"*.us/*", @"*.co.uk/*", @"*.org.uk/*", @"*.museum/*", @"*.aero/*", @"*.biz/*", @"*.coop/*", @"*.info/*", @"*.int/*", @"*.mil/*", @"*.pro/*", @"*.com.jp/*", @"*.com.ar/*"};

Note that we needed to declare each TLD twice, just in case there was a path after the domain name. It works, but it’s hardly pretty, and I wouldn’t want to have to add a whole new URI scheme to it. Also, NSScanner has no concept of wildcards, let alone regular expressions. We had to handle all those wildcards above by ourselves. Like here:

AIAutoLinkingPlugin.m
  1.                             //Get template up to the next *
  2.                             wildRange = [template rangeOfString:@"*"];
  3.                             if(wildRange.location != NSNotFound){
  4.                                 templateSegment = [template substringToIndex:wildRange.location];
  5.                                 templateIndex = wildRange.location;    
  6.  
  7. //Scan that string from the suspected URL.  If not found, this URL is invalid.
  8.                                 if(![urlScanner scanString:templateSegment intoString:nil]){
  9.                                     URLIsValid = NO; //Didn't find first segment
  10.                                 }
  11.                             }

Here:

AIAutoLinkingPlugin.m
  1.                             //Scan the template string after *, up to next * or end
  2.                             templateIndex += 1;
  3.                             wildRange = [template rangeOfString:@"*"
  4.                                                         options:0
  5.                                                           range:NSMakeRange(templateIndex, [template length] - templateIndex)];    
  6.  
  7. if(wildRange.location != NSNotFound){
  8.                                 templateSegment = [template substringWithRange:NSMakeRange(templateIndex, wildRange.location - templateIndex)];
  9.                                 templateIndex = wildRange.location;
  10.                             }else{
  11.                                 templateSegment = [template substringFromIndex:templateIndex];
  12.                                 templateIndex = [template length];
  13.                             }

and here:

AIAutoLinkingPlugin.m
  1.                         //One final check.  Our URL must be complete at the right location (If the template doesn't end with a *)
  2.                         if([template characterAtIndex:[template length]-1] != '*'
  3.                            && [urlScanner scanLocation] != [urlString length]){
  4.                             URLIsValid = NO; //Didn't end
  5.                         }

Ouch.

A Replacement:

The replacement for AIAutoLinkingPlugin would have to be much more easily maintainable than the existing incarnation, while at the same time, maintaining or increasing its performance.

Overview

The way AIHyperlinks Framework works is deceptively simple: the SHHyperlinkScanner class takes a string, gobbles up to the next bit of whitespace. Whatever it takes in up to that point, is scanned for linkness and rejected as soon as it can’t be a match, what survives is given made a link in its attributed string.

One of the nice things about URI’s is that the rules to match them don’t (or, shouldn’t) change very often. This lends itself very well to the kind of pre-compiled finite state machine that a tool like flex can provide us. Also, flex knows regular expressions; giving us more flexibility with the matching rules, and allowing us to reduce the complexity of the scanner class. However, while flex is capable of detecting whitespace and other stop characters, it’s much easier to use a simpler scanner, like NSScanner, to split the string up in a more manageable, and customizable, way.

So, the framework has two main parts inside of it: the flex validation code (SHLinkLexer), and the tokenizing class using NSScanner (SHHyperlinkScanner).

SHLinkLexer

Flex actually tidies things up a lot, and recognizing a number of different URL formats becomes a relatively simple task:

SHLinkLexer.l
  1. urlSpecifier    ([[:alnum:]\x80-\xf4-]+\.)+{domains}(:[0-9]+)?(\/.*)?
  2. ipURL           ([0-9]{1,3}\.){3}[0-9][0-9]?[0-9]?(:[0-9]+)?(\/.*)?
  3. singleDomain    [[:alnum:]\x80-\xf4-]+
  4. mailSpecifier   [^:\/]+\@.+\.{domains}
  5. jabberSpec      xmpp:.*\@.+\.{domains}(\/.*)?
  6. aolIMSpec       aim:goim\?screenname=[^\ \t\n&]+(&message=.+)?
  7. aolChatSpec     aim:gochat\?roomname=[^\ \t\n&]+
  8. yahooIMSpec     ymsgr:sendim\?.+
  9. rdarSpec        rdar:\/\/(problems?\/)?[0-9]+(&[0-9]+)*
  10. spotifySpec     spotify:(track|album|artist|search|playlist|user|radio):[^<>]+

By “simple,” of course, I mean “simple… if you’re comfortable with regular expressions.” But, here we’ve managed to define more match conditions, in less ugly code, than the previous method.

The rules of the link scanner themselves are fairly simple. For example, to detect a link to start an AIM chat, something we couldn’t easily do in AIAutoLinkingPlugin, we just need one line:

SHLinkLexer.l
  1. {aolChatSpec}            {SHStringOffset += SHleng; return SH_URL_VALID;}

If the aolChatSpec (defined above) is found, then SHStringOffset is advanced (to prevent re-scanning those characters) and a value is returned to indicate the link is valid.

Possible return values either imply a “valid” status indicates the URI fully matches whatever specification is called for, and a “degenerate” status means we know it’s a link, but it’s not fully proper (usually an email address without the “mailto:” or a URL without the “http://”). More complex rules than the one above may be formed as well, but there’s plenty of good reading available for that.

SHHyperlinkScanner

So, we know how to find out if a given string is a valid URI or not; but how do we find all the URIs in any given string (if any exist at all)? Well, we use NSScanner, but in a much more sensible way.

Rather than scanning for patterns, like the AIAutoLinkingPlugin of old, we only care about certain sets of characters. We have a skip (or stop) set of characters we want to keep from ever being validated, and start and end sets of characters we only want to exclude at the immediate beginning and ending of a string, respectively. This lets us properly find links in strings like “<http://example.com/>” as well as properly unlinking punctuation marks in sentences, quotation marks, etc.

The heavy lifting of this part looks like the following:

SHHyperlinkScanner.m
  1.     while([preScanner scanUpToCharactersFromSet:skipSet intoString:&scanString]) {
  2.   NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
  3.  
  4.         unsigned int localStringLen = [scanString length];
  5.   unsigned int finalStringLen;

So far, we’ve started a loop that will scan all the characters from its current point up to the next whitespace into the scanString.

SHHyperlinkScanner.m
  1.  while (localStringLen > 2 && [startSet characterIsMember:[scanString characterAtIndex:0]]) {
  2.    scanString = [scanString substringFromIndex:1];
  3.    localStringLen;
  4.   }
  5.  
  6.   finalStringLen = localStringLen;
  7.  
  8.   while (finalStringLen > 2 && [endSet characterIsMember:[scanString characterAtIndex:finalStringLen - 1]]) {
  9.             scanString = [scanString substringToIndex:finalStringLen - 1];
  10.    finalStringLen;
  11.   }
  12.  
  13.         SHStringOffset = [preScanner scanLocation] - finalStringLen;

Now, weve adjusted the scanString for leading and trailing punctuation. Currently, we exclude the following characters in this step: “‘-,:;><()[]{}.?!

SHHyperlinkScanner.m
  1.         // if we have a valid URL then save the scanned string, and make a SHMarkedHyperlink out of it.
  2.         // this way, we can preserve things like the matched string (to be converted to a NSURL),
  3.         // parent string, it's validation status (valid, file, degenerate, etc), and it's range in the parent string
  4.         if((finalStringLen > 0) && [self isStringValidURL:scanString]){
  5.             SHMarkedHyperlink *markedLink;
  6.    NSRange    urlRange;
  7.  
  8.    urlRange = NSMakeRange([preScanner scanLocation] - localStringLen, finalStringLen);

That’s where we check if it’s a valid URI. Right there, in that “[self isStringValidURL:scanString]” bit. That sends it to a method which bridges the Obj-C class into flex’s generated FSA. Runs fast, returns a boolean.

Now, all that’s left is to handle the detected link properly, with some extra code that determines the kind of URL it is if it’s degenerate (http, ftp, or mailto):

SHHyperlinkScanner.m
  1.             //insert typical specifiers if the URL is degenerate
  2.             switch(validStatus){
  3.                 case SH_URL_DEGENERATE:
  4.                 {
  5.                     NSString *scheme = DEFAULT_URL_SCHEME;
  6.                     NSScanner *dotScanner = [[NSScanner alloc] initWithString:scanString];
  7.  
  8.                     NSString *firstComponent = nil;
  9.                     [dotScanner scanUpToCharactersFromSet:hostnameComponentSeparatorSet
  10.                                                intoString:&firstComponent];
  11.  
  12.                     if(firstComponent) {
  13.                      NSString *hostnameScheme = [urlSchemes objectForKey:firstComponent];
  14.                      if(hostnameScheme) scheme = hostnameScheme;
  15.                     }
  16.  
  17.                     scanString = [scheme stringByAppendingString:scanString];
  18.  
  19.                     [dotScanner release];
  20.  
  21.                     break;
  22.                 }
  23.  
  24.                 case SH_MAILTO_DEGENERATE:
  25.      scanString = [@"mailto:" stringByAppendingString:scanString];
  26.                     break;
  27.                 default:
  28.                     break;
  29.             }

Here, we make a “marked link” (SHMarkedHyperlink). Which is a nice and convenient data object that holds a bunch of info about the link: the URI, the parent string, and where in it’s parent it can be found, among other things.

SHHyperlinkScanner.m
  1.             //make a marked link
  2.             markedLink = [[SHMarkedHyperlink alloc] initWithString:scanString
  3.              withValidationStatus:validStatus
  4.                parentString:inString
  5.                 andRange:urlRange];
  6.             return [markedLink autorelease];
  7.         }

All thats left to do is clean up before continuing the loop condition:

SHHyperlinkScanner.m
  1.         //step location after scanning a string
  2.         location = SHStringOffset;
  3.  
  4.   [pool release];
  5.     }

This is a much more sophisticated approach than the previous scanning method; however, with the added complexity we retain the performance while making the link detection much more flexible. The hope is that the complexity goes completely unnoticed to the end user, who just gets consistent properly linked links in the UI.

After all, who likes to edit a link just because someone put it at the end of a sentence and then decided to punctuate?