Contents

Buying Kolide vs Building Your Own Osquery Solution

When you’re making the case for your company to buy Kolide, executives and technical procurement managers will inevitably ask if you’ve considered alternatives. And since Kolide uses open-source software, you should expect that one of those proposed alternatives will be building it yourself on top of tools like osquery.

This should be no surprise; osquery is the most popular open-source endpoint security project on Github. Amidst all the fanfare, it’s reasonable to ask: how much value is Kolide really providing on top of it, and how much can I get by just rolling my own solution?

In this article, we are going to cover the differences between Kolide and vanilla osquery on its own, and explore what it would take to replicate some of the features Kolide provides.

Why Does Kolide Use Osquery?

Before we get into build vs buy, let’s take a moment and explain why Kolide uses osquery in the first place.

At its core, Kolide’s product is intended to cover three primary use-cases:

Achieve Compliance - Measure, achieve, and maintain your compliance goals
Obtain Visibility - Obtain complete fleet visibility across Mac, Windows, and Linux endpoints
Implement Honest Security - Make security a core value in your company’s culture

To help your organization accomplish each of these use-cases, Kolide needs an endpoint agent that can collect the necessary telemetry required across Mac, Windows, and Linux devices without hurting performance. On top of that, as part of our commitment to end user privacy and transparency, we wanted to ensure all the source code, for all the binaries we ship to the endpoint, could be scrutinized by our customers and even the end users themselves. Given all of these requirements, osquery is the only open-source tool out there that fits the bill.

With that said, it’s easy to forget that osquery is a means to an end, not a complete solution itself.

While osquery is a great fit for our use-case, there are a few things it doesn’t do out of the box, which are prerequisites needed by any organization before they can roll it out and manage it competently. This is why we created Kolide Launcher, our own agent that wraps around osquery, extending its data collection capabilities, providing native installation packages, and most importantly, solving the problem of automatic updates.

Kolide vs Osquery: Architecture & Deployment

For your in-house solution to work at all, we need to reason about the architecture of the solution and how to deploy it across your fleet (the technical act of getting the agent installed on the devices).

You need to deploy more than just osquery to achieve any meaningful use-case.

The above diagram shows an example of what a completely free (more accurately freemium) set of tools look like. The main thing to take away is that osquery at its core is a simple producer of telemetry. You need tools to get on to devices, to extend its data gathering capabilities to your needs, to keep it up-to-date, and to forward and consume the data to high performance storage tools that eventually feed into reporting and data visualization software.

This is in stark contrast to Kolide’s product, where what we ship has all of the batteries included.

How Kolide Handles Deployment

As you are likely already aware, Kolide is a SaaS product. In this context, that means we host all of the centralized infrastructure that your endpoints send telemetry to on a regular basis. Hosting isn’t just turning on a web server–it includes:

Regularly applying security patches
Automatically scaling the environment
Ensuring the service remains highly available
Safely storing the data devices send to us

When it comes to rolling out the agent to devices, Kolide is incentivized to make this as simple for you as possible. This is why we pre-build native installation packages on Mac (.pkg), Windows (.msi), and Linux (.deb and .rpm).

Kolide pre-builds install packages that are signed, notarized, and “just work.”

These native packages are perfect for distribution via MDM software such as Jamf and Microsoft Intune. Since they are signed and notarized by Kolide, end users won’t see any worrying prompts about untrusted code. As soon as the package is run, the agent automatically connects to the Kolide application without any further action needed on your part.

If you don’t want to use MDM software to distribute Kolide, you can use our onboarding feature. It can reach out to end users directly on Slack and guide them step-by-step through the process of installing the agent. You can even automatically message new employees on their first day to self-install the agent.

Onboarding lets you reach out to users via Slack to guide them to self-install the agent.

How To Deploy Osquery On Your Own

Osquery is just a standalone tool and the official documentation doesn’t really offer any specific advice or guidance on how to deploy it across your fleet.

As an aside, if you want to get a sense of how osquery works on just a handful of devices, I highly recommend this tutorial series by long time osquery community member @securityclippy:

While undocumented, most typical production deployments are one of these two strategies:

Strategy #1: Use DevOps Tooling (Ideal For Servers)

In this deployment strategy, you use tools like Chef, Puppet, or Ansible to distribute not only the osquery binary itself, but also its configuration file and any future updates to that file.

In this strategy, osquery is configured to emit logs either locally (and use a tool like Splunk’s Universal Log Forwarder to get them somewhere useful) or osquery itself can directly connect to popular data streaming services like Amazon AWS Kinesis Streams.

For example, here is a configuration file that emits the current time to AWS Kinesis and Firehose.

{
  "options": {
    "host_identifier": "hostname",
    "schedule_splay_percent": 10,
    "logger_plugin": "aws_kinesis,aws_firehose",
    "aws_kinesis_stream": "foo_stream",
    "aws_firehose_stream": "bar_delivery_stream",
    "aws_access_key_id": "ACCESS_KEY",
    "aws_secret_access_key": "SECRET_KEY",
    "aws_region": "us-east-1"
  },
  "schedule": {
    "time": {
      "query": "SELECT * FROM time;",
      "interval": 2,
      "removed": false
    }
  }
}

The advantage of this approach is that you don’t need any additional infrastructure. Osquery is doing all the heavy lifting and sending logs directly to the services in question. This approach is also ideal if you are comfortable using DevOps tools, have the ability (and are allowed) to push sensitive credentials directly to those endpoints, and are already sending logs from these devices.

In practice, end user devices hardly ever meet these requirements, and dealing with them this way introduces serious security risks. All it takes is a curious end user to find those credentials and suddenly they have programmatic access to portions of the production system. That’s why this approach is typically only used for production servers.

Strategy #2: Use A Management Server

If you want to avoid shipping a brand new configuration file to osquery every time you want to adjust the query schedule, or you don’t want to ship production credentials in plain-text to end users, you will need to use an osquery management server.

An osquery management server is simply any webserver that implements Osquery’s remote TLS API. There are many paid management servers others host for you, include Kolide’s product. There are also free (and freemium) servers that you can host on your own infrastructure.

Part of your deployment will be standing up this infrastructure (which likely will include managing dependencies like MySQL and Redis), ensuring that infrastructure remains highly available, security patches are applied regularly, and that it’s accessible to the public internet so that devices can continuously check-in even when they aren’t on a privileged network.

An architectural diagram for self-hosted deployment of Fleet (a freemium self-hosted Osquery Manager) Source: Hosting Fleet on AWS EKS

Taking the above architecture diagram and putting it through the AWS Calculator can get you in the range of $10k - 20k a year on AWS costs alone. That doesn’t account for the Site Reliability Engineering (SRE) time/headcount needed to manage and maintain those servers.

While we are on the subject, this tutorial that the fine folks at Segment created (and where the above AWS diagram is sourced from) is actually a great guide to setting up the infrastructure of an osquery management server. Be warned though: the management server they use, FleetDM, is not free if you want things like official support or to use enterprise features like RBAC.

It’s important that you plan your deployment in advance when trying to estimate hosting costs. Even with conservative choices, cloud hosting platforms like AWS can get expensive really fast.

Even with a management server up and running, you will need to distribute the agent to the endpoints. Each management server will come with its own recommendations and tools, but all of them put the onus on you to generate, and most importantly sign these packages, before distributing through an existing MDM tool.

If you must build your own solution, management servers are ideal over the DevOps strategy we discussed earlier. To deploy them correctly, you must divide responsibilities like so:

SREs in your company can handle the complexity of hosting the management server and its dependencies.
IT/Security practitioners can focus exclusively on developing the SQL queries and building the reports in downstream tools like Splunk that you will need to solve your compliance and device visibility goals.

To expand on that last point, keep in mind that the data osquery provides isn’t automatically neatly arranged. Below is an example of a query result for a single device run in a format known as snapshot, a mode where each time the query is run, the full results are emitted to a log with some additional metadata about the device.

{
  "action": "snapshot",
  "snapshot": [
    {
      "parent": "0",
      "path": "/sbin/launchd",
      "pid": "1"
    },
    {
      "parent": "1",
      "path": "/usr/sbin/syslogd",
      "pid": "51"
    },
    {
      "parent": "1",
      "path": "/usr/libexec/UserEventAgent",
      "pid": "52"
    },
    {
      "parent": "1",
      "path": "/usr/libexec/kextd",
      "pid": "54"
    }
  ],
  "name": "process_snapshot",
  "hostIdentifier": "hostname.local",
  "calendarTime": "Mon May  2 22:27:32 2016 UTC",
  "unixTime": "1462228052",
  "epoch": "314159265",
  "counter": "1",
  "numerics": false
}

This is in contrast to the differential log format, which instead of showing the full result set of a query run, shows only the differences between the current query run and the previous query run. When logging results in the differential logging format, it’s up to you to assemble the final state.

{
  "action": "added",
  "columns": {
    "name": "osqueryd",
    "path": "/opt/osquery/bin/osqueryd",
    "pid": "97830"
  },
  "name": "processes",
  "hostname": "hostname.local",
  "calendarTime": "Tue Sep 30 17:37:30 2014",
  "unixTime": "1412123850",
  "epoch": "314159265",
  "counter": "1",
  "numerics": false
}
{
  "action": "removed",
  "columns": {
    "name": "osqueryd",
    "path": "/opt/osquery/bin/osqueryd",
    "pid": "97650"
  },
  "name": "processes",
  "hostname": "hostname.local",
  "calendarTime": "Tue Sep 30 17:37:30 2014",
  "unixTime": "1412123850",
  "epoch": "314159265",
  "counter": "1",
  "numerics": false
}

With Kolide, you can still view and export the logs (if you want), but we also give you intuitive visualizations about what’s going on inside a device and calling out the things that matter.

Kolide automatically turns osquery data into clear visuals describing every device.

No matter if your use-case is fleet visibility or compliance, there are several ongoing costs you will need to consider if you plan on building your own product. These costs are non-obvious for newcomers to osquery, but become relevant quickly when you try to use osquery to solve an actual use-case.

Dealing With Annual OS Changes

Given Apple’s success with shipping iterative OS releases annually, even OS vendors like Microsoft (a company famous for changing Windows at a glacial pace) are doing everything they can to substantively change their OS at a faster rate.

If you are building your own solution, these annual releases (and the shorter and shorter beta cycles that precede them) are extremely disruptive to tools like osquery which rely on private and undocumented APIs to get the critical data you need.

Since osquery often relies on unsupported APIs to gather data, many queries that work in one OS version and CPU architecture can suddenly and inexplicably stop working after even a minor upgrade. There is no better example of this than the macOS screenlock feature that was completely rewritten in macOS 10.13 and required reverse engineering an undocumented API and the development of a new osquery capability to fix.

To resolve these situations, it will be imperative that the operators of your solution regularly test your osquery SQL queries against the latest development and beta releases as soon as they are made available.

In the upcoming release of macOS 13 (ventura) Apple completely redesigned the System Preferences app. Source: Apple Insider

If you are also attempting to replicate Kolide’s self-remediation features, you will need to contend with differentiating instructions between OS versions as features are removed, added, and altered throughout the OS.

# Fix Instructions

<% capture catalina %>

1. With the Application selected, make sure the checkbox next to **'Show notification preview'** is checked. Click the dropdown next to it and make sure it is set to **'when unlocked'**
   <% endcapture %>

<% capture big_sur %>

1. With the Application selected, click the dropdown in the lower right labeled **'Show previews'** `when...` and make sure it is set to **'when unlocked'**
   <% endcapture %>

_Estimated time to fix: 3-5 minutes_

For each application, perform the following:

1. Make sure you are logged into the correct user account for the app in question.
1. Click the Apple logo at the top left of your screen and select System Preferences from the dropdown.
1. Click the item marked **'Notifications'**
1. In the left-hand sidebar find the Application(s) listed below and click to select it.

<% if issue.build_prefix <= 19 %>
<%= catalina %>
<% elsif issue.build_prefix >= 20 %>
<%= big_sur %>
<% endif %>

1. Repeat as necessary for each failing application.
1. When you are done, you can click "I've fixed it. Check again" to verify your settings are now properly configured.

An example of the branching logic Kolide uses for the remediation text we send end users for the notification preferences check.

This is something Kolide has dealt with since its inception and we have a dedicated team of engineers that think about this problem. They participate vigorously in the osquery open-source project and collaborate with the OS vendors to ensure our app works correctly on day one of an OS’ official release.

Ongoing Research & Development To Meet Data Collection Requirements

One of the largest misconceptions about vanilla osquery is that once you’ve rolled it out, it will automatically measure your device’s compliance. The truth is, osquery is only as good as the queries you run and the tools you use to analyze the data it produces.

In reality, with vanilla osquery, it’s on you to:

Develop/find SQL queries that will help you measure your compliance objectives
Add those queries to the osquery schedule in the configuration
Aggregate the data to produce a meaningful report

While the official osquery project does provide example query packs, most of the packs have not been updated for years. For example, the vast majority of threats in the Unwanted Chrome Extensions Pack are from 4+ years ago and reference extensions that are no longer distributed in the Chrome Extensions store.

Even seemingly trivial asks can often balloon into multi-year consulting engagements with experts like Trail of Bits if they require custom extensions to osquery. For insight into the incredible amount of skill and work that goes into developing some of these queries, check out our article on how Kolide built our macOS Screenlock Check.

In contrast, Kolide comes prepackaged with dozens of Checks (specially written osquery SQL queries) that, when run on a device, produce an accurate attestation of that device’s state. Kolide aggregates these check results in a dashboard so you can track how well you are doing across your entire fleet. If there isn’t a check available for your needs Kolide will help you bring it to life, even if that means R&D and agent development on our side.

Additionally, Kolide’s most unique feature is its ability to aggregate, persistently store, meticulously document, and make programmatically available thousands of data points about each enrolled device. Not only do we do this at no additional charge, we allow you to query the database this data is stored in with SQL.

VSCode Extensions is an example of data osquery doesn’t currently collect. Kolide collects, centralizes, augments, and documents this data so you can understand it and take appropriate action, without the need for external tools and data storage.

Beware of Open-Source Tools That Aren’t Actually Free

Kolide’s business model is simple and straightforward. You pay us to use our product for each device that enrolls and stays active. Every feature is available and support is included. There are no hidden fees, no penalty for using enterprise features or integrations; what you see is truly what you get.

As you do your research, you may encounter what appear to be free and open-source tools that work with osquery and extend its capabilities.

While there are several out there, you should keep in mind some are actually paid products with a free tier. This business model, known as Open Core (popularized by companies like Gitlab) encourages practitioners who don’t have access to budget today to invest their time and energy into standing up the free version. Then, when you inevitably need an enterprise feature (or anything more than basic support), you are prompted to upgrade to the paid version that requires an annual subscription based on your usage. Now, not only are you managing the deployment yourself, you are also paying a SaaS-like subscription.

One such Open Core solution is FleetDM, a monetized fork of the open-source Kolide Fleet osquery management server that we retired a few years ago.

Open Core Software like FleetDM posts their source code on Github and starts off free but requires a utilization based subscription for enterprise support and security features like RBAC. Source: fleetdm.com/pricing (Captured Aug 23, 2022)

As you think through alternatives, it’s important that you consider their true costs, not just during the experimentation phase, but when they are fully deployed in a production setting. Is it reasonable to assume you won’t ever need Role Based Access Control (RBAC) in a production system that houses data from your endpoints? Probably not. Expect to build that feature yourself or pay for it in the form of a subscription.

Building Your Own Slack App

At Kolide, end user-driven remediation of nuanced problems is our core use-case. To enable it, we’ve built an amazing Slack application.

If you are interested in replicating Kolide’s Slack app experience, Slack has made a great and comprehensive tutorial on how to build an app that can publish interactive notifications to end users.

In addition to installing an osquery management server to house some of this data you will need to reason about the following:

Standing up infrastructure to receive the Slack app webhooks when users interact with the bot
Procuring persistent storage to hold the state of the interactions
Dynamically issuing queries to devices via the osquery management server to recheck compliance after the user fixes problems

For this to work, you also have to ensure you can associate end users with their osquery installation. This sounds easy, but often requires some sleuthing and automatic assignment based on evidence collected from the device itself. Without an osquery management server that is end user aware, you will need to build this understanding yourself in order for the Slack interactions to work as intended.

If you’re thinking, “that’s okay, the Slack app is a nice-to-have,” you should reconsider that position. While it’s great to unearth a lot of scary problems in your environment, the other half of that equation is having a way to remediate them, and there are plenty that cannot be fixed with automated tools like MDM. Those tricky issues can quickly spiral into an emergency situation where end user remediation will be an essential feature.

Key Takeaways

This was a lot to absorb, but my hope is you take away the following key points when comparing the cost of buying Kolide to building everything yourself with osquery:

Understand that you cannot do much with osquery without other supporting tools. Among these tools will be a management server which you will need to learn, install, maintain, and host using your own infrastructure. These management servers typically need to be accessible on the public internet, so it’s imperative that you ensure qualified Site Reliability Engineers (SREs) and Security Engineers are continuously involved in the deployment and ongoing maintenance of the solution. Even with the management server, you should expect to need to send data to tools like Splunk to do meaningful analysis of the data osquery collects.
Even when everything is up and running, beware of the hidden costs. For example, you will need to maintain your own set of queries that will need regular changes and maintenance when new versions of macOS and Windows are released on an annual basis. Additionally, you should expect some percentage of facts you need to collect about a device (that are available in Kolide) to not be available in vanilla osquery. Expect to invest time in osquery SQL query R&D and potentially a consulting engagement with experts who can extend the agent’s capabilities to your specifications.
Remember that self-service remediation via Slack is a feature that is unique to Kolide, not something available or even easily achievable with vanilla osquery. To replicate it, you will need to spend software engineering time building the bot, hosting it, and integrating it with any open-source management server you choose.

In closing, it’s important to remember that Kolide built its product because we saw so much promise in osquery, but found the only organizations who could enjoy it were the ones who had the technical know-how and enormous budgets to operationalize it safely and effectively. We made a big bet that if we built software to dramatically reduce the complexity, risk, and toil needed to tap into that promise, more organizations would be able to experience the value.

We strongly believe we’ve built something compelling and when analyzed honestly against building it yourself, Kolide is the obvious choice for nearly all organizations.