In the depths of my hard drive, I recently stumbled upon an interesting project I started working a year ago. But never really took off at that time. After working with various SIEM tooling for several years, I had become a big fan of Microsoft Sentinel. The fact that it was cloud-based and from Microsoft was less appealing (yes, I am still an open-source and Linux fan), but working with KQL was a breath of fresh air compared to the regex hell and GUI systems of other SIEMs. But what if I (hypothetically) wanted to run KQL on local data using open-source tools? Just some random thoughts, but it was enough to get me started. So, I began with what you start with for any language implementation, writing a parser.
A year ago, it never progressed much beyond a Proof of Concept (PoC), but by now I’ve dusted off the code, neatly processed it in Git, and uploaded the code online. So for everyone who is interested, you can find the code on GitHub.
What is it exactly that you made?
For the people who are not familiar with KQL, it is a powerful query language created by Microsoft for querying big data. It took a lot of inspiration from SPL, the query language used in Splunk. According to some its similar to SQL, but I think the two beside being query languages take a very different approach. It is the query language used in Azure Data Explorer, Azure Monitor, and Application Insights. But it is also by Microsoft in their security products for quering logs, such as Azure Sentinel and Microsoft Defender for Endpoint. Unsurprisingly, this is the main reason I use KQL a lot in my day-to-day work.
Microsoft already provides a way to run KQL queries locally by providing a Docker image for running Kusto locally. But it’s only meant for development and testing purposes, and not for production use. No support provided, and far from open-source.
Beside a parser, I also needed a query engine to run the parsed KQL queries. Writing a query engine from scratch is a lot of work, and the performance would propably be horrible. Luckily, there is an open-source project by Apache called Datafusion that could help me with this. For the people who are not familiar with Datafusion, it is a Query Engine for columnar data that uses the Apache Arrow memory model written in Rust. But if you want learn more about Datafusion, there are other blog posts that explain it better than I can.
I’ve splitted my work into three components:
- kqlparser: A parser for the KQL language
- datafusion-kql: Connects the parsed KQL query to the Datafusion engine
- kq: A command line tool combining the above two to run KQL queries against various files
Kqlparser
An open source parser from Microsoft for KQL is already available, but it is written in C# and therefore not very usable for this project. Also, I only came across the parser after I had already started writing my own parser. The parser is still in an early stage, but it can already parse a lot of KQL queries.
I’ve build the parser using Nom, a parser combinators library. It differs from the common aproach of using a lexer and a parser. Instead, it uses a single pass parser, which should be easier to write and understand.
This is for example the function to parse the project
operator:
fn project_operator(i: &str) -> IResult<&str, Vec<(Option<String>, Expr)>> {
preceded(terminated(tag_no_case("project"), multispace1), separated_list0(
tag(","),
trim(alt((
map(separated_pair(identifier, trim(tag("=")), expr), |(n, e)| (Some(n), e)),
map(expr, |e| (None, e))
))),
))(i)
}
There are still some challenges to overcome, since the KQL language is quite complex and not always consistent. For example, boolean values are case sensitive depending on the context in KQL:
bool(True) # valid
bool(false) # valid
True # invalid
false # valid
The official documentation doesn’t define the grammer that well, so it’s mostly trial and error.
There is also no support for comments and no support for for multiple statements.
Defining variables using the let
operator are therefore not yet supported.
The more complex operators can not yet be parsed, but it will also take a lot of time to get them working in DataFusion.
Although Nom makes it easy to write a parser, it’s not always easy to understand the error messages. So there is some work to do to improve the error messages and see if that’s actually possible with Nom.
Maybe oneday I will write a test suite to compare the output of my parser with the official parser, but for now I’m happy with the current state.
datafusion-kql
To be able to execute the parsed KQL queries, the parsed query needs to be converted to a Datafusion plan. The datafusion-kql crate contains the planner to create a logical plan from the parsed KQL query. The way logical plans are created in DataFusion matches quite well with the way KQL queries are parsed.
The most importent part of the planner is a simple for loop that iterates over the operators from the query. See the simplified version of the function below:
fn query_statement_to_plan(&self, ctx: &mut PlannerContext, query: Query) -> Result<LogicalPlan> {
let mut builder = [...]
for op in query.operators.into_iter() {
builder = match op {
Operator::MvExpand(x) => builder.unnest_column(Column::from(x))?,
Operator::Extend(x) => [...],
Operator::Join(_, x, y) => {
let keys: Vec<&str> = y.iter().map(|s| s.as_ref()).collect();
builder.join(self.query_statement_to_plan(ctx, x)?, JoinType::Inner, (keys.clone(), keys), Option::None)?
},
Operator::Project(x) => builder.project(x.iter().map(|(a, b)| [...]))?,
Operator::Where(x) => builder.filter(self.ast_to_expr(ctx, &x)?)?,
Operator::Summarize(x, y) => {
let mut ctx1 = ctx.clone();
builder.aggregate(y.iter().map(|z| self.ast_to_expr(&mut ctx1, z).unwrap()), x.iter().map(|z| self.ast_to_expr(ctx, z).unwrap()))?
},
Operator::Sort(o) => o.iter().fold(builder, |b, c| b.sort(iter::once(Expr::Sort(Sort::new(Box::new(col(c)), false, false)))).unwrap()),
Operator::Take(x) => builder.limit(0, Some(x.try_into().unwrap()))?,
_ => return Err(DataFusionError::NotImplemented("Operator not implemented".to_string())),
};
}
}
builder.build()
}
For the supported operators in DataFusion, it’s quite easy to map them to the KQL operators. The other ones are gonna be a bit more challenging. But hopefully I can implement some of them in the future. In comparison to the parser, the planner is still in a very early stage.
Other obstacles are for example the field types. The timespan type in KQL works with ticks (100 ns), while DataFusion uses nanoseconds, milliseconds, seconds etc… So there is some conversion loss. Also the dynamic field type in KQL is not directly mappable to DataFusion. While DataFusion supports arrays and maps, they require a static schema unlike the dynamic field type in KQL. I have some ideas to for example use a binary representation of the dynamic field type, but implementing this would be a lot of work. And most likely not that performant. Although I have no idea how this is actual implemented in Kusto itself, so maybe the performance difference is not that big.
A big difference between KQL and DataFusion is that DataFusion doesn’t use indexes, while KQL does use indexes for strings. For example when using the ‘has’ operator in KQL, it will use an index to search for the string. DataFusion doesn’t have this feature, so it will be a lot slower for these kind of queries.
Operators like ‘mv-expand’ and ‘scan’ are not native to DataFusion, so I will have to implement them myself. I actually find them very useful, so I propably will at least try to implement them.
While in early stage it can do project, extend, filter using the where operator, do joins and summarize. Expressions are mostly supported, but only a very limited set of functions are implemented.
The first next step will propably to expand the current small set of supported functions and fix some behaviroal differences. Unnamed columns currently gets very different names in DataFusion by default.
kq
Kq is a simple command line tool to run KQL queries against various data sources.
For now, it’s easiest way to test the parser and planner.
It can read from arrow, avro, csv, ndjson and parquet files.
Every file you reference as argument to the kq
command will be available as a table in the KQL query.
A query can be passed as an argument to the kq
command, or you can pass a file with a KQL query.
Example
kq -f users.csv 'users | where name == "iwan" and age > 30'
kq -f logins.csv 'logins | summarize count(name) by name'
kq -f users.csv -f logins.csv 'logins | join (users) on name | project name, age, login_time'
To build the kq
command you need rust installed and a clone of the rust-kql repository.
cargo build --release
./target/release/kq
Future
I don’t have a real usage for this project yet. For now, it’s just a fun project to work on. And maybe down the road I can think of a use case for it. Maybe there is a far more superiour query language which would make this project completely useless ;)
At some point I still have to decide if I want to replicate the behavior of Kusto, or if I want to make it more compatible with DataFusion.