Observability for Frontend Developers

with Honeycomb and Netlify Functions

swyx 2020-03-17

Observability is a hotly debated topic in the backend world, and I accidentally got involved when I tried to match up my knowledge of open source JS tools to the Monitoring pillars of Metrics, Logs and Traces in a prior blogpost. Charity Majors did everyone a great service by schooling me on what Observability really meant and Chase Adams had the awesome idea that she and I should have a Real Chat™ about how frontend developers can embrace Observability too.

We did chat (recorded screenshare here!), and this post is the result!

Bottom Line Up Front

I see 2 opportunities for Frontend Developers to instrument for Observability - their clientside apps, and their developer tooling. The former has much more established thinking than the latter.

I have added Honeycomb to Excalidraw in this proof of concept, a popular open source React sketch drawing app, so you could envision what the code might look like in practice (of course in real life it would be much more heavily instrumented).

The key here is to shield the Honeycomb API key in a serverless function (here we use a simple Netlify Function), and to pipe events through it. In production, you might add a check for authentication before sending over the events. You can also append serverside information to your clientside event (including user metadata, for debugging) before sending it in to your event store.

But this only tells you the how - let's talk about the why.

The Core Problem

What do we want with observability, and why should we frontend developers care?

I'll quote from Charity's tweetstorm:

Can you understand what is happening inside the system, can you understand ANY internal state the system may get itself into, simply by asking questions from the outside?

At its core, observability is about these unknown-unknowns.

Plenty of tools are terrific at helping you ask the questions you could predict wanting to ask in advance. That’s the easy part. “What’s the error rate?” “What is the 99th percentile latency for each service?” “How many READ queries are taking longer than 30 seconds?”

Monitoring tools like DataDog do this — you predefine some checks, then set thresholds that mean ERROR/WARN/OK.

Logging tools like Splunk will slurp in any stream of log data, then let you index on questions you want to ask efficiently.

APM tools auto-instrument your code and generate lots of useful graphs and lists like “10 slowest endpoints”.

But if you can’t predict all the questions you’ll need to ask in advance, or if you don’t know what you’re looking for, then you’re in o11y territory.

In my own words, it's about giving yourself/your team the power to answer open ended questions about how your app behaves in production. Not just obvious stuff like when errors occur, but also subtle UX-research-ey stuff like "How do employees and users use our tool differently and why?" - a great story covered in Emily Nakashima's talk at O11yCon. I won't go further into how Observability differs from Monitoring and the Metrics/Logs/Traces data types - Charity definitively covered that in her post.

Observability, as described, is a high bar - higher than we're used to. But notice none of these issues are technically restricted to the backend.

Tooling

Probably some tools come to mind - LogRocket, Sentry and BugSnag mainly because of their generous sponsorship of frontend dev podcasts. For other frontend devs, maybe you're told to send events and logs to Google Analytics or Mixpanel or report metrics to Datadog or Splunk or New Relic. For UX research you might use FullStory or Heap or Hotjar.

Maybe you pay for multiples of these, but thats kind of the problem - you're paying to store disparate bits of the same data in multiple places, all disconnected to each other and unhelpfully siloed.

Honeycomb doesn't have a monopoly on Observability, but it certainly is investing heavily in customer education around the idea that you should have a single source of truth for all these events, and to aggregate these events into metrics, logs or traces on demand instead of pre-aggregating and losing all granularity.

It's difficult for me to comment on all these tools mainly because I don't have experience with most of them. It really depends on what kind of company you work at and what you already use.

But we want to go beyond tools.

Fight Log-o-phobia

Observability is more about the mindset that instead of setting alerts to tell you when stuff you expect to go wrong goes wrong, you should instead be in constant conversation with your code, getting a good gut feel for how it works in production, and instrumenting so that you can answer questions you don't even have yet.

If you've ever felt Log-o-phobia (the fear of looking at logs, which I just totally made up), it could be because your systems don't help give you enough information to reconstruct the sequence of events after the fact. What if instead of being afraid of your tools, you got a dopamine rush from having everything you need to figure it out fast?

But enough about tools - what concrete things can we as frontend engineers report?

Opportunity 1: Instrumenting Frontends for Observability

Most of this is straight up verbatim from Emily's talk at O11yCon, which you should definitely go watch, but I'm just listing them here for the sake of completeness.

You could send an event per thing of interest, for example:

On page load (Emily has a great blogpost on this)
On SPA navigation
On significant user actions
On error
On page unload

I think there is room to evolve here. As we discussed in our chat, Frontend Developers think more in terms of Sessions. We had some debate over what a session is - In my prior experience I've defined them as contiguous 30 minute blocks, whereas Charity was keen on much more tightly scoped sessions, between 1-10 minutes, since a lot of tracing happens during that session and it gets very wide.

There's no hard and fast rule for what you put in an event - at the end of the day you're just trying to give your future self/your team the best possible chance of figuring out the internal state of your app when the observed behavior is going on. But, for example, you might imagine it is useful to tie together page load and unload events, or to track what A/B test bucket the user is in, what permissions they have, what browser they use, and so on.

Here's a full list so you get an idea:

App specific
- Page type
- User id (is employee?)
- Ab testing groups
Performance/environment
- Load time
- Resource count
- Asset version
- Request id
Capabilities
- Screen height/width
- User agent
- Window height/width
- Color depth
- Feature support
Others?
- installed fonts
- browser language
- online/offline status
- Geolocation?
- Page visibility?
- Zoom level?
- Font size?

That's just the baseline - you can go on to create custom derived metrics out of your events to figure out things like how long it takes your users to do certain key actions and why they fail (see Rachel Fong's great story in the O11yCon talk). You can create great metrics like detecting Refreshes or Rage clicks.

I will say this is where Honeycomb doesn't do a great job yet - it is common for frontend-focused tools like LogRocket to offer Rage click detection out of the box, whereas in Honeycomb you'll both have to instrument and devise a query for detecting this. But once you make a query, it's trivial to share and reuse it, and I'm sure Honeycomb will look at building it in someday.

See this in action

Here is where you should really check out my Proof of Concept if you'd like to see what the code looks like. All this work takes place in two commits:

You can read a detailed code discussion in the README.

A quick note on Privacy

Of course, with this much data collection, we need to be careful to be sensitive around privacy, especially with Personally Identifiable Information, or HIPAA/PCI and GDPR sensitive information. It seems the practice here is to hash/encode that data before sending it over to your event store. Honeycomb lets you run a proxy server that they interface with, so that sensitive info only lives on a server you control.

Personally I like the idea that most of the time you just work with an anonymized/double blind user ID to do your work, and all the deanonymizing information is kept far away under someone else's lock and key.

Opportunity 2: Instrumenting JavaScript Developer Tools

As someone who works on and uses JS tools, and thinks JavaScript Tooling Sucks, this one really strikes home as an area of opportunity.

How many times have we had a build go bad and had no idea why? How many times does Gatsby or Sapper or Docusaurus or Webpack or Parcel or whatever else we use plain not work the way we wanted it to, whether it was through a misconfigured plugin or a misnamed file or (yes, this just happened to me) a corrupted binary? How many times have we had to close issues because we "cannot reproduce" them, or had trouble getting users to report the right information for us to help them?

I think we could do a lot to make our CLIs and browser extensions and language servers and dev servers and everything else more observable. The key difference is we're exposing this data to the users, rather than just the developers of the tools. They can then choose to send this data to us alongside bug reports or we can provide them tools to figure out their own problems on their own from the events logged therein.

In my chat with Charity, she mentioned that Intercom's engineers had instrumented their own CI/CD pipeline this way, to figure out why their builds can go wrong. I think this is an excellent idea and we'll likely need different thinking and data structures to tackle this user-facing observability.

Watch our discussion

If you'd like to hear me grill Charity more on Frontend Observability, have a watch!