Some of My Fine Wares
You know what? I have been writing these scribbles for over a year now and not once have I discussed what actually is my bread and butter: programming. Crazy, I know. What makes it even crazier is that the platform that I publish my scribbles on, meaning this very website, I programmed from scratch with my own ten fingers. As a professional I find talking about these simple hobbyist web applications a bit tedious, though, especially those of my own creation. It’s like writing a cook book consisting of my own recipes for myself. What really is the point? But, as an exercise in creative writing, I will hold back the curtain a bit and let you glimpse into a fresh modification to the website’s source code that I found mildly fascinating to implement.
I admit I behave like I don’t care about the traffic on this website with my too-cool-for-school attitude, but if there’s anything I do care about, it’s where visitors to my website originate and what they have decided to visit. Collecting data on visitors and processing it is called analytics, and almost every website on the internet does it. Think of your website as a Facebook post; you send it out there and after a while you can come back to it and see a number implying how many people liked it. Similar data can be gathered on a website. When you load a page, it creates a record in the website logs showing what you accessed, when and from where. At the time of writing I have links to this website on my LinkedIn and Instagram profiles. If someone clicks a link on one of my profiles and lands on this website, I will know. There are many third-party services that one could use to get even more intricate data on their website’s traffic.
The thing about collecting this kind of data is that you really have to track individual visitors to get a clear picture who accesses which page. If you want to do that then things can get a bit tricky. You probably have heard of a thing called GDPR, or the General Data Protection Regulation. All websites that get visited by a citizen of the European Union have to comply with this regulation, meaning that the website’s owner can’t just collect any kind of data without consent. You are probably aware that any time you visit a website for the first time, the first thing you see is a cookie consent banner or popup of some sort that asks for your permission for collecting data of your visit. Collection of any data that can be used to trace back to an individual requires consent.
In addition to that, most websites utilize a third-party analytics service that only muddle the issue. Not only will you need consent for collecting data, but you also need to ask for consent for sharing that data with a third party. One of the most popular third-party service is Google Analytics, which infamously has been deemed illegal within the EU due to its data processing and storing methods. So, at first when I wanted to get a better picture of how many visits my website gets and where my visitors generally come from, I also used a third-party service. This is where I will get a bit more technical.
I admit, at first I did use Google Analytics before I learned about its status in the EU. Oops. How Google Analytics works is that a cookie will be generated upon receiving consent. This cookie is user-specific and will be attached to your browser. Using scripts on this website Google then sends data to its servers every time the person does something on the page, be it loading a page or clicking a button. After I learned about Google Analytics and its issues, I switched to a GDPR-compliant service PanelBear. After using PanelBear for about a week, I received an email about the service being shut down. I was pretty much out of third-party options.
The thing about third-party analytics services is that the tracking they conduct happens on the client side, aka in the browser. They generally communicate with the analytics servers using JavaScript scripts that can be blocked with common advertisement-blocking browser extensions, which is very unfortunate for tracking purposes.
I was facing two challenges at once: finding an alternate solution for website analytics and finding a solution for script-blocking extensions that shrouded my view. I decided to program my own solution.
In website applications there are these things called middleware. When you make a request to a website, e.g. load a page, before reaching the user interface layer the request will go through middleware, which implement some logic for processing the request. A simple example of middleware would be authentication middleware. A user requests a profile page that is only available to the user themself. Before returning the page to the browser, the authentication middleware checks that the request sender has the permission to access that page. If not, the user is not granted access and the request stops at the middleware. If yes, the request goes through to the interface layer and returns the profile page.
Not only do middleware grant me access to intact requests and responses, but all the logic is located on the server side. By writing up a global middleware, meaning a middleware that is applied to all requests rather than page-specific ones, I could funnel all visitors through a pipeline that could record and store data on the visitor without any meddling ad blockers. However, there was one more problem to tackle.
As I said, collecting, processing or storing data on visitors requires consent from them, as regulated by the GDPR. However, this applies to data that is considered personal, i.e. data that can be used to identify the visitor. I am only interested in which pages get visited and where the requests originate on country-level, so I didn’t even need to worry about personal data. I started writing a middleware that would get information from the coming request, but only the kind of information that would let me stay on the legal side of the GDPR.
A new problem then arose: if I couldn’t collect data that could be used to identify a visitor, how could I then tell which requests came from the same source? This is where I came closest to the gray sidelines of the GDPR. Every request that comes along carries data on the visitor’s system; information on browser and operating system. This information is stored in the request in the form of a user-agent header. The user-agent header tells the website which browser and device the visitor is using, be it a Chrome browser on Windows or Firefox on Android. This is not data that can be traced back to an individual user, but it can be enough to tell different visitors apart.
The IP address is basically your computer’s identification address. The request carries the IP in order to return the requested resource to the correct destination. As you can imagine, the IP address is definitely personal data. However, a common IPv4 address has two main parts: the network ID part and the host ID part. When put together, they can be traced back to an individual device. By themselves, the network ID can only be used to identify the network the device operates in, and the host ID becomes meaningless.
As I was also curious about where visitors to this website are located on country-level, I could achieve geolocation by network ID alone, since networks rarely cross national borders. Also, to geolocate a visitor, I have to use an external third-party geolocation service that I need to share the visitor’s IP address with. Not sending them the whole address but only the network part, I can dodge the necessity for sharing personal data with third-party services, which would create an handful of new problems in itself.
Now that I had the user-agent header and the network ID filters set, I could fairly reliably tell requests originating from different sources apart. To also avoid problems with storing said data in an irresponsible way, I added a hashing function to the mix; by combining the user-agent header, the network ID and, in addition, the current date, I could create a hash string that would be unique to each request source each day, but be also absolute gibberish to the naked eye in case someone with nefarious intentions got access to my database. There is no way anyone could use it and trace it back to a person.
As you clicked the button that took you onto this page, chances are my middleware collected your country of origin and stored it in my database to satisfy my curiosity. Anytime you load a new page a record will be created, and I can see a log of visitor jumping from page to page. Not you, mind, since there is no way I can trace the records back to a user, only to a country where a user is located. And unless you are the Pope, chances are you’ll be untraceable.
In the future if I come up with anything new and interesting to add to this website, I might reconsider my former stance of not writing anything related to my career interests. All in all, be safe, dear visitor, for I will not collect nor sell your information to the Cambridge Analyticas of the world.