URL Shorteners Explained: A Comprehensive Deep-Dive

titleImagePath
date
May 18, 2024
slug
url-shorteners
status
Published
tags
summary
Discover how URL shorteners transform lengthy links into sleek, manageable versions, ensuring uniqueness and preventing duplication, while supporting vast digital ecosystems.
type
SystemGuide
systemType
url-shorteners
probability
In 1990, the introduction of the first browser with a user interface, initially named World Wide Web and later renamed Nexus, marked a pivotal moment in the history of the internet. While the fundamental method of accessing websites—typing a URL and pressing enter—has remained constant over the past three decades, the issues associated with URLs have persisted as well. Typographical errors can render URLs useless, and the ongoing preference for short, memorable URLs highlights an inherent design challenge.
Consider the complexity of a typical URL intended for a specific resource, such as a blog article or a page with extensive query parameters:
These lengthy URLs often contain detailed descriptive attributes to manage data hierarchies, command structures, transaction paths, or session information, resulting in URLs that can stretch to hundreds of characters with complex patterns. Such URLs are not only difficult to memorize but also challenging to share, particularly in printed formats like magazines or books, or on platforms with strict character limitations, such as Twitter.
The Solution: URL Shorteners
To address these challenges, URL shorteners were developed. These services convert long, cumbersome URLs into succinct, manageable versions. They not only enhance the aesthetic appeal but also add practical value by making links easier to share and remember.
 
This deep-dive will explore how URL shorteners function internally and scale to accommodate millions of users. It will delve into the encoding techniques these services employ to ensure that each shortened URL is unique and does not duplicate others previously created.
As we continue, we will provide a comprehensive look into the mechanisms that allow URL shorteners to support vast online ecosystems, demonstrating their critical role in the digital landscape.

How to genereate short but unique keys?

Converting a unique long URL into a short version while maintaining its uniqueness is technically challenging. Generating a concise key that also avoids collisions with existing keys adds to the complexity.
In order to craft short yet distinct keys, we have control over two parameters: the key's length and the set of characters permissible for each position.
Consider a scenario where the key is just one digit long and only integers are permitted. This would yield a mere 10 unique keys, zero inclusive.
Given our objective is to procure the shortest possible key, our latitude in key length is finite. Hence, expanding the scope of allowable characters becomes imperative!
However, a caveat here is that these characters must be URL-friendly. If browsers cannot decipher the URLs generated, then the utility of our system is diminished.
Conventionally, URL encodings often adhere to one of the following schemes:
Base36 encompasses the entirety of the lowercase alphabet, leading to a decent range to concoct distinctive URLs. Yet, broader scopes are accessible.
Base62, for instance, incorporates all alphabetic characters, both lowercase and uppercase, in addition to the numbers from 0 to 9. Base64 goes a step further, including a couple of additional placeholder characters.
Since Base64 includes the widest range of allowed characters, it's the best choice for our encoding schema. By that, we don't have to worry about the uniqueness of our key anymore.
With a key length of 6 digits we can already generate 68 billion unique keys.
The naive approach would be to believe we can simply encoding the long original URL with base64 would do the job. Unfortunately Base64 encoding will turn every 3 bytes of data into 4 encoded characters.
Which wouldn't make the URL shorter but longer, the length would be dependent on the original URL too, which isn't ideal either.
So we are clearly missing a in-between step, we need to translate our long URL into a smaller, fixed length format before base64-encoding it.

MD5 Hashing

MD5 is a hashing algorithm that can swiftly, with minimal computational power, generate a unique digest for any given input string.
For any original URL input, it returns a fixed-length 128-bit string, regardless of the length of the original URL.
But, this string is far too long for our needs as our goal is a 6 digit key. Given that each character in base64 represents 6 bits, we could use just the first 36 bits of the MD5 hash as our unique key.
6 bits * 6 base64 characters = total of 36 bits
However, the uniqueness is questionable. It's been found that the MD5 algorithm has vulnerabilities, with instances where it might produce the same hash for two distinct input strings. By only using the first 36 bits, the risk of such collisions becomes even greater.
To counteract this, we could add logic around our core encoding function that checks our database to ensure that the generated key hasn't been used before. But this introduces more complexity. An alternate thought - why not devise a key that meets our needs but isn't tied to the original URL? We don't necessarily need to derive the long URL from the short URL key.
This is far from ideal, particularly when considering scalability and an expanding user base. With millions of users, collisions become increasingly probable. A fresh idea might be to generate the key without any connection to the original URL, using any method that ensures its uniqueness.

Counter

There's a counter, which supplies values as input, leading into a "black box" labeled "base64", and finally producing an output.
We're aware that decimal numbers can undergo base64 encoding.
Why not employ a straightforward counter mechanism that increments every time a new short link is required, and then base64-encode this number to produce a unique key?
This method results in a unique and concise output, at least for a significant duration before we encounter exceptionally large numbers. It's highly scalable as well, since we don't need extra database queries to verify the key's uniqueness.
Yet, there's a drawback. The key lengths won't remain constant but will incrementally increase over time. This gradual lengthening undermines the fundamental value proposition of the URL shortener as time progresses. An intriguing proposition, but it's not without its flaws.

Key Range

Another approach would be to generate independent keys beforehand and simply assign long URL to them. We would generate all keys before-hand and store them in a database till they are used. This would allow for truely unique keys and avoid collisions.
Unfortunately, in case of a 6 digit keys, this would mean to store all 68 billion keys in a database - that's not a very thoughtful use of resources. So let's be smart about it and remember the core idea of the counter approach - generate keys on the fly!
What if we'd pre-generated keys, of let's say 5 digit, and stored them in a database. 64 to the power of five is still about 1 billion but that's 68 times less storage compared to storing 6 digit keys.
Once a user requests a URL to be shortened, the system would retrieve a unique key form the database and add the last character.
There is a range of 64 unique characters to pick from in base64, so the system can create 64 unique keys from each 5 digit prefix without any risk of collisions.

Advantages

That way, we would be able to create unique keys of uniform length.
The option is also highly scalable, and can efficiently be adapted to a growing need of unique Ids by adding additional digits to the key.
Next we talk about how to translate this concept into a viable system architecture.

System Architecture

Now, as we know that we want to use predefined keys and extend them on the fly, we should think about how to translate this idea into a viable architecture.
notion image
We definitely know that we need a database to store our pre-generated key ranges. We also have a Service to interact with the database called Key Generator Service. That service would request a range from the DB and update the status of the entry as "taken".
That's the only interaction of the Key Generator Service has with the database. By the way, don't worry about which kind of database would be best - we will get to that. Next, we define the API which the service exposes. Other services can request to get a key, and the key generator would encapsulate all steps required to generate one. Starting with gets a new key range, adding the last digit from the base64 range and then returns it. In case the service already would have pulled one key range and didn't run through all 64 digits yet it wouldn't even hit the database.
Last but not least, we also need a service that accepts incoming long URLS, and handles the whole creation of short URLS including persisting them in an extra database for retrieval once someone hits the short URL.

Summary

The technical problem we solved in this lecture was how to create unique short URL keys from unique long ones.
The challenges with this are, that the shorter the keys the more difficult it gets to guarantee their uniqueness and still have the option to create billions of them. On top of this, the short keys can't include any characters that aren't URL safe.
Based on these requirements its most reasonable to use base64 to encode the short URLs.
After looking into a couple of methods to generate those keys, it became clear, that it's most economical to pre-generate 5-figure keys and store them in a database. Only once they are retrieved, we add a last character from the 64 character-wide base range.
Last but not least we talked through how a system architecture around our approach would look like. I think it's essential to highlight that we decided to use two separate databases for the key ranges and the final short URLs.
/