Archive for the ‘Programming’ Category.

Yet another regular expression to parse URL

I needed an expression which can separate directory and file parts from URL. So, here it is:

^(https?)://([.0-9a-zA-Z-]+)(/?.*?)([^/]*)$

It extracts four URL parts: protocol://domain/directory/fileWithParams

It can be used in Perl this way:

my $url = "http://example.com/directory/file?parameters";
my ($proto, $domain, $dir, $file) = ($url =~ m{^(https?)://([.0-9a-zA-Z-]+)(/?.*?)([^/]*)$});
print “$proto|$domain|$dir|$file\n”;

And will print: http|example.com|/directory/|file?parameters

Here is the same in C++ using boost_regex:

string url = "http://example.com/directory/file?parameters";
regex expr("(https?)://([.0-9a-zA-Z-]+)(/?.*?)([^/]*)");
smatch match;
if (regex_match(url, match, expr)) {
    string proto = match[1], domain = match[2], dir = match[3], file = match[4];
    cout << proto << '|' << domain << '|' << dir << '|' << file << cout;
}