Creek Service, write business logic, not boilerplate

Evolving JSON Schemas - Part II

2024-01-09T00:00:00+00:00

In the previous article we looked at how Confluent’s Schema Registry’s compatability checks when evolving JSON schemas are so limiting as to be basically unusable, requiring the use of verbose partially-open content models to map property names to specific types. In this second and final part we’ll look at leveraging Confluent’s Schema Registry to build a more useful set of compatability checks, leading to a more user-friendly and clean evolution model, free from the noise of a partially-open content model.

Requirements for JSON schema evolution

How should JSON Schema evolution work? What operations are required to mean we have a useful way to evolve schemas with full compatability?

What we’ve come to expect from other schema types, for example Avro, is that required properties can’t be removed if we want forwards compatibility, or added if we want backwards compatibility. Confluent’s checks already cover this.

It’s the handling of optional properties that needs to change: adding and removing optional properties should be a fully compatible change, but are not supported by Confluent’s checks.

This gives us the following requirements in table form:

	Forward Compatible Old schema / new data	Backwards Compatible New schema / old data	Fully Compatible
Add required	:heavy_check_mark:	:x:	:x:
Add optional	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Remove required	:x:	:heavy_check_mark:	:x:
Remove optional	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Optional -> required	:heavy_check_mark:	:x:	:x:
Required -> Optional	:x:	:heavy_check_mark:	:x:

If JSON Schema compatability checks supported these operations it would be user-friendly and applicable to real-world use-cases.

Splitting readers and writers

So how can we achieve full compatibility when adding and removing optional fields?

Simple. We differentiate between the schemas used to produce the data from those used to consume the data.

Because producing schemas are never used to consume data, there is no requirement for producing schemas to be compatible with each other. Likewise, there is no requirement for consuming schemas to be compatible with each other, as they never produce data. All that matters is compatability between producing and consuming schemas.

The figure below shows how this would work when adding a new consuming schema C2 and a new producing schema P3.

To maintain backwards compatibility, new consuming schemas must be backwards compatible with data produced by all the existing producing schemas. When C2 is added, it must be backwards compatible with P1 and P2.

To maintain forwards compatibility, new producing schemas must be forward compatible with all the existing consuming schemas. When P3 is added, it must be forwards compatible with C1 and C2. To put this another way, C1 and C2 must be backwards compatible with P3.

To maintain full compatability, we ensure every consuming schema is backwards compatible with ever producing schema, (both sets of arrows in the diagram above), i.e. all consuming schemas can consume the data produced using any producing schema.

We know a system has fully compatible schema changes if every consuming schema is backwards compatible with every producing schema.

Hopefully this makes sense and even intuitive. The next question is what kind of schemas should these new producing and consuming schemas be if we’re to meet our requirements? Should they use an open, closed or partially-open content model?

Producers of data control the schema of the data. They know the exact set of properties, with no ambiguity. This is a great match for a JSON Schema with a closed content model.

Consumers of data don’t control the schema of the data, but do know the set of properties they read from the data. They can ignore any additional properties. This is a great match for a JSON Schema with an open content model.

Producing schemas should use a closed content model. Consuming schemas should use an open content model.

How does this work in practice?

Let’s walk through the evolution of a JSON Schema using this new way of working.

Let’s start with v1 of the producing application. It produces data that conforms to the following closed schema:

{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "id": { "type": "integer" },
    "name": { "type": "string" }
  },
  "required": [ "id", "name" ]
}

…and v1 of one of the consuming application requires data that conforms to the same schema, only with an open content model:

{
  "type": "object",
  "additionalProperties": true,
  "properties": {
    "id": { "type": "integer" },
    "name": { "type": "string" }
  },
  "required": [ "id", "name" ]
}

This consuming schema is backwards compatible with the producing schema, so we know we are maintaining full compatability.

Evolving the producing schema

So far so good, but what happens if we want to deploy v2 of the producing application with an evolved schema?

The new v2 producing schema contains a new optional checked property:

{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "id": { "type": "integer" },
    "name": { "type": "string" },
    "checked": { "type": "boolean" }
  },
  "required": [ "id", "name" ]
}

Because the consuming v1 schema is open, it is backwards compatible with this new producing schema, so we know we are maintaining full compatability.

Evolving the consuming schema

Next, we want to deploy v2 of the consuming application to take advantage of the new checked property.

The new v2 consuming schema is:

{
  "type": "object",
  "additionalProperties": true,
  "properties": {
    "id": { "type": "integer" },
    "name": { "type": "string" },
    "checked": { "type": "boolean" }
  },
  "required": [ "id", "name" ]
}

Both v1 and v2 of the consuming schema are backwards compatible with v1 and v2 of the producing application, so we know we are maintaining full compatability.

Now lets say we realise that v2 of the consuming app is not fit for purpose, and we’d like to roll back the deployment to v1. Is it safe to do so? As we’ve maintained full compatability we know we’re good to roll back.

After investigation into the issues with v2, we’re soon ready to deploy v3 of the consuming application, which will take advantage of an upcoming enhancement to the producing application. It turns out the issue was the recently added checked property wasn’t fit for purpose and a new status enum will be added upstream as its replacement. The new consuming app contains logic to take advantage of the new status property if its present.

The new v3 consuming schema, with the upcoming status property, is:

{
  "type": "object",
  "additionalProperties": true,
  "properties": {
    "id": { "type": "integer" },
    "name": { "type": "string" },
    "status": { 
      "type": "string", 
      "enum": ["pending", "passed", "failed"]
    }
  },
  "required": [ "id", "name" ]
}

As the v3 consuming schema is backwards compatible with the v1 and v2 producing schemas, so we know we are maintaining full compatability.

Evolving the producing schema late

After the new v3 consuming application is deployed we want to deploy v3 of the producing application, with the new status property. Normally, we’d probably release a version that produced data with both the old checked and the new status properties for a while. But, in this instance we know there is only one downstream consumer, which is already prepped to handle status.

The new v3 producing schema, without checked and with status, is:

{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "id": { "type": "integer" },
    "name": { "type": "string" },
    "status": { 
      "type": "string", 
      "enum": ["pending", "passed", "failed"]
    }
  },
  "required": [ "id", "name" ]
}

All known consuming schemas are backwards compatible with this new producing schema, so we know we are still maintaining full compatability.

Although all the examples above were checking for full compatibility, this design supports checking for just backwards, or just forwards, compatibility. Not that we recommend you do, mind. If you did you may have found yourself in a hole, unable to revert the bad consumer app.

Negative examples

The above walk through was all ‘happy path’. Does the proposed pattern of checks capture incompatible changes as well? Yes!

Consider what would have happened if a new junior developer had jumped in and tried to change v2 of the producing application to fix the issue with the checked property. Rather than remove the old checked property and add a new enum type, the junior developer might just change checked to an enum:

{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "id": { "type": "integer" },
    "name": { "type": "string" },
    "checked":  {
      "type": "string",
      "enum": ["pending", "passed", "failed"]
    }
  },
  "required": [ "id", "name" ]
}

As the v2 consumer schemas isn’t backwards compatible with this producing schema, we know such a changes isn’t compatible.

Likewise, adding or removing required properties also breaks backwards compatability with existing consumers.

Capturing schemas in a schema registry

What schemas do we need to capture to make these proposed evolvability checks work?

Encourage ownership to decouple teams

Before we get to that, let’s first discuss one additional requirement around ownership.

In larger organisations it is often the case that data produced by one team is consumed by applications written and maintained by different teams, potentially in different departments. The use of fully compatible schema evolution can go a long way to removing the need for costly “onboarding processes” and aligned release dates etc. Data becomes more self-service. This is a good thing!

In such an operational model, the producing team owns the data products it publishes for other teams to consume. This model would break if consuming teams were free to register any consuming schema they liked.

Consider what would have happened in the walk through above if v3 of the consuming app had published a consuming schema with the new status property as an integer rather than an enum? Maybe because they left the design meeting thinking that’s what had been agreed. The v3 consuming schema would then be:

{
  "type": "object",
  "additionalProperties": true,
  "properties": {
    "id": { "type": "integer" },
    "name": { "type": "string" },
    "status": { "type": "integer" }
  },
  "required": [ "id", "name" ]
}

Now, when the producing team tries to release v3 of their app, it will fail as the v3 consuming schema is not backwards compatible with the v3 producing schema as they disagree on the type of status. This consuming schema is now dictating the type of status. The producing team can either switch to using an integer or rename their property, and are forever restricted on the type of any future status property they want to add.

Allowing any consuming schema to be registered by consuming applications removes control of the data’s schema from the team that owns the data. This is not a good thing!

Evolving producing schemas

Keeping control of the schema with the team that owns the data is achieved by something potentially unintuitive: not registering the consuming schema in the schema registry.

Yes, you read that right :)

Let’s look at how this can work:

It’s pretty easy to write code to create an open consuming schema from a closed producing schema. This means we can capture the producing schemas, and synthesis the consuming schemas as needed, i.e. when performing compatability checks:

We keep control of the schema with the data product owner by registering only closed producing schemas in the Schema Registry.

Checking consuming schema compatibility

The eagle-eyed among you may have already noticed in the walk through that in each consuming schema matched the producing schema, except it used an open, rather than _closed, content model.

The most simple process for checking consuming schema compatability is to convert the open consuming schema to a closed producing schema, and then confirming the closed producing schema is already registered. If it is, then the consuming schema has already been checked for compatability.

This simple one-to-one mapping between producer and consumer schemas is efficient, as it only requires a single look-up in the Schema registry when a service starts up.

Having a consuming schema derived from a producing schema also often follows the development and release process of organisations, as downstream teams will often use the latest schema of the data when developing their consuming application.

However, it is not a strict requirement that the consuming schema exactly matches the properties defined in a registered producing schema. It is also possible for a consuming schema to contain a subset of the properties defined in a registered producing schema. More accurately:

To maintain full compatability a consuming schema must be backwards compatible with at least one open schema synthesised from a registered, closed, producing schema.

Using a smaller ‘view’ schema containing only the minimal subset of properties the consuming app reads will decrease the time the consuming app spends validating and deserializing incoming data. But this comes at the cost of service start up time, as the service may need to check multiple schema versions before finding one the view schema is compatible with.

The increased start up costs can be avoided if the consuming application knows the exact producing schema to look up.

What does the implementation look like?

The implementation involves two parts.

Synthesising consumer schemas

The default value for additionalProperties is true, i.e. an open-content model. This means, given a closed-content model producer schema, it will contain explicit "additionalProperties": false entries. The closed-content producer schema can be converted to an open-content consumer schema by simple exchanging those false values for true. e.g.

class SchemaConverter {
  public static JsonSchema toConsumerSchema(final JsonSchema producerSchema) {
    final String schemaText = producerSchema.canonicalString();
    return new JsonSchema(
        schemaText.replaceAll(
            "\"additionalProperties\":\\s*false",
            "\"additionalProperties\": true"));
  }
}

Compatability checks

The example code below doesn’t bother trying to implement non-transitive FORWARD, BACKWARD or FULL checks as, in our opinion, they are not much use given the long-lived nature of Kafka data and distributed nature of modern systems. Instead, it focuses on checks that test all versions are compatible, i.e. equivalent to the Schema Registry’s FORWARD_TRANSATIVE, BACKWARD_TRANSITIVE and FULL_TRANSITIVE.

class Example {
  /**
   * Check its safe to consume with a consumer schema 
   * derived from the supplied producerSchema.
   * 
   * @param subject the Schema Registry subject
   * @param producerSchema the producer schema that the consumer schema is derived from.
   * @return id of registered schema.
   */  
  int ensureConsumerSchema(
          String subject,
          JsonSchema producerSchema) {
    // If the producer schema is registered, we can safely consume with the derived consumer schema.
    return srClient.getId(subject, producerSchema.normalize(), false);
  }

  /**
   * Check its safe to consumer with a reduced-view consumer schema.
   * 
   * @param subject the Schema Registry subject
   * @param producerSchema the closed-content producer schema that the consumer schema is derived from.
   * @return id of registered schema.
   */
  int ensureConsumerViewSchema(
          String subject,
          JsonSchema producerSchema,
          JsonSchema consumerViewSchema) {

    JsonSchema consumerSchema = toConsumerSchema(producerSchema);

    // The reduced-view schema must be backwards compatible with the full consumer schema:
    List<String> issues = consumerViewSchema.isBackwardCompatible(consumerSchema);
    if (!issues.empty()) {
        throw new IncompatibleSchemaException(consumerSchema, consumerViewSchema, issues);
    }
      
    // And the associated producer schema must be registered:
    return ensureConsumerSchema(subject, producerSchema);
  }

  /**
   * Ensure a producer schema is registered.
   * 
   * If it is not, check compatability and register it.
   * @param subject the Schema Registry subject
   * @param producerSchema the producer schema to ensure registered.
   * @param backwards check backwards compatability?
   * @param forwards check forwards compatability?
   * @return id of registered schema.
   */
  int ensureProducerSchema(
          String subject, 
          JsonSchema producerSchema, 
          boolean backwards, 
          boolean forwards) {
    
    JsonSchema normalized = producerSchema.normalize();

    try {
      // Early out if schema already registered:
      return srClient.getId(subject, normalized, false);
    } catch (RestClientException e) {
      // If not already registered, register:
      return registerWriter(subject, normalized, backwards, forwards);
    }
  }

  private int registerWriter(
          String subject, 
          JsonSchema producerSchema, 
          boolean backwards, 
          boolean forwards) {
    
    JsonSchema consumerSchema = toConsumerSchema(producerSchema);
      
    // If known subject, i.e. not v1, check compatability:
    if (srClient.getAllSubjects().contains(subject)) {
      if (backwards) {
        checkCompatability(subject, producerSchema, consumerSchema, false);
      }
          
      if (forwards) {
        checkCompatability(subject, producerSchema, consumerSchema, true);
      }
    }

    // Ensure server-side compatibility checks are disabled:
    srClient.updateCompatibility(subject, "NONE");
    
    // Register normalized producer schema in the Schema Registry:
    return srClient.register(subject, producerSchema);
  }

  private void checkCompatability(
          String subject, 
          JsonSchema newProducer, 
          JsonSchema newConsumer, 
          boolean forwards)  {
    
    // For each registered producer schema:
    for (Integer version : srClient.getAllVersions(subject)) {
      Schema existing = srClient.getByVersion(subject, version, false);
      if (!existing.getSchemaType().equals(JsonSchema.TYPE)) {
        throw new IllegalArgumentException("Existing schema is not JSON");
      }

      JsonSchema oldProducer = (JsonSchema) srClient.parseSchema(existing)
              .orElseThrow();

      List<String> issues;
      if (forwards) {
        // Forward: old schemas reading new data.
        //   all data that conforms to the new (producer) schema 
        //   can be read by the old (consumer) schema:
        ParsedSchema oldConsumer = toConsumerSchema(oldProducer);
        issues = oldConsumer.isBackwardCompatible(newProducer);
      } else {
        // Backwards: new schema reading old data.
        //   all data that conforms to the old (producer) schema 
        //   can be read by the new (consumer) schema:
        issues = newConsumer.isBackwardCompatible(oldProducer);
      }

      if (!issues.isEmpty()) {
        throw new IncompatibleSchemaException(newProducer, newConsumer, issues);
      }
    }
  }
}

Presently, these evolution check are implemented client side in the Creek JSON serde under development. Server-side checks are set to NONE. This does introduce race conditions when registering new schemas.

We’ve raised Issue #2927 in the Schema Registry GitHub repo to hopefully get the improved algorithm into the Schema Registry :crossed_fingers:.

The above code, combined with appropriate calls to ensureProducerSchema and ensureConsumerSchema when creating serializers and deserializers, respectively, results in appropriate schema compatibility checks to ensure system integrity, without any need for convoluted patternProperties.

A Voilà, no more PROPERTY_ADDED_TO_OPEN_CONTENT_MODEL or PROPERTY_REMOVED_FROM_CLOSED_CONTENT_MODEL errors from the Schema Registry!

Evolving JSON Schemas - Part I

2024-01-08T00:00:00+00:00

Confluent’s Schema Registry’s rules for evolving JSON schemas are so limiting as to be basically unusable. In this two-part series we’ll look at why its unusable and then, in the second part, how we can leverage Confluent’s JSON schema registry extension to build a more useful evolution model.

A brief history of evolution

No, not the darwinian sort of evolution. Here, we’re talking about schema evolution and JSON schema evolution in particular.

Recommended reading before reading this article would be our article on JSON Schema Validators, which gives some good background on Schemas in general, Robert Yokota’s article on Understanding JSON Schema Compatability, which goes in-depth into the specifics of how JSON schema compatability works, and maybe Confluent’s own documentation on JSON Schema compatibility rules.

If that seems like a lot of reading, or if you’ve previously read these and just need a refresher, then the gist of all of the above can be boiled down to the following:

Backward compatibility means that readers with a newer schema can correctly parse data written using an older schema, i.e. new schemas can read old data.
Forwards compatibility means that readers with an older schema can correctly parse data written using a newer schema. i.e. old schemas can read new data.
Full compatibility means both being forward and backwards compatible.
Confluent’s Schema Registry differentiates between a schema being forwards or backwards compatible with its neighbours, or transitively compatible with all schema versions that come before it or after it. (The rest of this article will discuss transitively compatible schema changes).
We recommend that all the schemas used to describe data in a Kafka topic should be fully compatible, or FULL_TRANSITIVE in Schema Registry terminology, as data in Kafka topics can be around for a long time.
- Transitive Backwards compatible allows Consumers to read data produced with an older schema, either because they were updated before the producing app(s), or because they were lagging during deployment, or because they need to be able to rewind and reprocess old data in the topic, etc.
- Transitive Forwards compatible allows Consumers to read data produced with a newer schema, either because they were updated after the producing app(s), or because you want the ability to roll back a bad deployment, which can leave data in topics produced by newer schemas, etc.

Is Confluent’s JSON Schema evolution fit for purpose?

Looking at the posts on StackOverflow and GitHub it seems there is some confusion. There’s lots of talk about not being able to evolve schemas in a meaningful way, especially if your aim is full compatability. People are running into PROPERTY_ADDED_TO_OPEN_CONTENT_MODEL and PROPERTY_REMOVED_FROM_CLOSED_CONTENT_MODEL errors even when performing changes they expect to be compatible.

While we’re likely all familiar and comfortable with the standard schema evolution rules for required properties seen with other schema types, e.g.

Not being able to remove required properties in a forwards compatible way: remember, that’s old schemas reading new data. Old schemas that still require the property can’t read new data that may not contain the property.
Not being able to add required properties in a backwards compatible way: remember, that’s new schema reading old data. New schema requiring a new property can’t read old data that may not contain it.
Combining the previous two means with Full compatability required properties can neither be added nor removed.

We also intuitively expect adding and removing optional properties to be fully compatible. After all, they’re optional, right? Optional properties can be added and removed in any other schema type I can think of.

Unfortunately, this is not how the Confluent has implemented it’s JSONs schemas compatability checks in the Schema Registry.

It’s this inability to be able to add and remove optional properties, when looking for full compatability, that’s causing people so much confusion. So lets look into what the schema registry is doing and why that results in this unintuitive functionality.

The diagram below shows how the schema registry performs compatibility checks when a new schema version v4 is being added.

FORWARD_TRANSITIVE checks each existing schema can read data produced by the new schema.
BACKWARDS_TRANSITIVE checks the new schema can read data produced by each old schema.
FULL_TRANSITIVE compatibility performs both checks.

While this pattern seems sensible and matches that used with other schema types, this pattern causes a problem with JSON Schema, due to how JSON Schema compatibility works, and specifically due to what JSON schema calls content models. Yokota’s article goes into some detail on JSON Schema compatability and content models.

Let’s look at each content model and its suitability to the above pattern of compatibility checks.

Evolving closed content model

A closed content model, i.e. one with additionalProperties set to false and no patternProperties, means the data can only contain the properties defined in the Schema; no additional properties are allowed.

{
  "type": "object",
  "properties": {
    "foo": { "type": "integer" },
    "bar": { "type": "string" }
  },
  "additionalProperties": false
}

If we evolve a closed schema by adding a new optional property, then new data could have this new field. The old schema would reject this, breaking forwards compatability. However, the new schema can read all the old data, so the change is backwards compatible.

If we evolve a closed schema by removing an existing optional property, then old data could still have this property. The new schema would reject this, breaking backwards compatability. However, the old schema can read any new data, so the change is forwards compatible.

Adding & removing required properties always breaks forwards and backwards compatibility for closed models.

With this model its also forward compatible to change an optional property to required, and backwards compatible to change a required property to optional.

So, for a closed content model the following table summarizes valid changes:

	Forward Compatible Old schema / new data	Backwards Compatible New schema / old data	Fully Compatible
Add required	:x:	:x:	:x:
Add optional	:x:	:heavy_check_mark:	:x:
Remove required	:x:	:x:	:x:
Remove optional	:heavy_check_mark:	:x:	:x:
Optional -> required	:heavy_check_mark:	:x:	:x:
Required -> Optional	:x:	:heavy_check_mark:	:x:

As you can see, the full compatability column is all :x:’s, as an operation must have a :heavy_check_mark: in both the forward and backwards compatability columns to be fully compatible. As the closed-content model doesn’t allow any operations under full compatability, we can say:

A closed content model is too restrictive and can not be used to evolve JSON schemas in the Confluent schema registry in a fully compatible way.

Evolving open content model

An open content model, i.e. one with additionalProperties set to true, but still no patternProperties, means the data can contain the properties defined in the Schema, and any additional properties of any type.

{
  "type": "object",
  "properties": {
    "foo": { "type": "integer" },
    "bar": { "type": "string" }
  },
  "additionalProperties": true
}

If we evolve an open schema by adding a new required or optional property, then, because an open model allows the data to contain additional properties, it could be possible that there is existing data containing a property with the same name, but a different type, to the new property. The new schema wouldn’t be able to read such old data, breaking backwards compatibility. However, old schemas can read new data, as they will ignore the new property, so long as the old schema does not itself contain a property with the same name and different type, so the change is forward compatible,

If we evolve an open schema by removing an existing required or optional property, then the new data could contain a property with the same name as the removed property, but with a different type. The old schemas wouldn’t be able to read this new data, breaking forwards compatibility. However, the new schemas can read the old data, so the change is backwards compatible.

Like with closed content models, for open models it’s also forward compatible to change an optional property to required, and backwards compatible to change a required property to optional.

For an open content model the following table summarizes valid changes:

	Forward Compatible Old schema / new data	Backwards Compatible New schema / old data	Fully Compatible
Add required	:heavy_check_mark:	:x:	:x:
Add optional	:heavy_check_mark:	:x:	:x:
Remove required	:x:	:heavy_check_mark:	:x:
Remove optional	:x:	:heavy_check_mark:	:x:
Optional -> required	:heavy_check_mark:	:x:	:x:
Required -> Optional	:x:	:heavy_check_mark:	:x:

More green ticks here than with the closed model. However, again, if we require full compatibility, then there are no valid operations. Leading us to the conclusion:

An open content model is too open and can not be used to evolve JSON schemas in the Confluent schema registry in a fully compatible way.

Evolving partially-open content models

If neither closed nor open contents models offer us a way to evolve JSON schemas, then that only leaves partially-open content models. A partially-open model either has a more complex schema for additionalProperties, or uses patternProperties, to restrict the schema of additional properties.

The following schema restricts additional properties to being of type string:

{
  "type": "object",
  "properties": {
    "foo": { "type": "integer" },
    "bar": { "type": "string" }
  },
  "additionalProperties": { "type": "string" }
}

While this can allow optional fields of a matching type to be added and removed in a fully compatible way, it restricts the type of those properties to a single schema type, making it impractical.

The following schema restricts additional properties to specific types based on the name of the property:

{
  "type": "object",
  "properties": {
    "i_foo": { "type": "integer" },
    "s_bar": { "type": "string" }
  },
  "patternProperties": {
    "^i_": { "type": "integer" },
    "^s_": { "type": "string" }
  },
  "additionalProperties": false
}

Surely this they must allow full compatibility?

Yokata’s article goes into this in more detail and seems to be suggesting this is the way to building a chain of fully compatible schema changes.

To our mind, this solution is just too clunky, restrictive and verbose. Not only would patternProperties need to include elements for each type supported by JSON Schema, it would also need to restrict properties on any nested object properties and handle arrays. Our best stab at such a schema would be:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Verbose and restrictive partially open content model",
  "$ref": "#/definitions/obj",
  "definitions": {
    "obj": {
      "type": "object",
      "additionalProperties": false,
      "patternProperties": {
       "^i_": { "type": "integer" },
       "^n_": { "type": "number" },
       "^s_": { "type": "string" },
       "^b_": { "type": "boolean" },
       "^o_": { "$ref": "#/definitions/obj"},
       "^ai_": { "type": "array", "items": {"type": "integer"} },
       "^an_": { "type": "array", "items": {"type": "number" } },
       "^as_": { "type": "array", "items": {"type": "string" } },
       "^ab_": { "type": "array", "items": {"type": "boolean" } },
       "^ao_": { "type": "array", "items": {"$ref": "#/definitions/obj"} }
      }
    }
  }
}

The above doesn’t actually define any properties. This is just setting up the rules for mapping property names to types. If you make a mistake in setting this up… you can’t go back later and fix it, as that would break compatability.

Even if you can live with such a verbose schema, there are additional issues to consider:

the solution puts restrictions on the names of properties. This isn’t going to work for projects where you’re not in full control of the names of properties.
the solution would not be able to take advantage of any new types added to the JSON Schema standard in the future, as they wouldn’t have an appropriate mappings in patternProperties.
the solution probably falls foul of other edge cases. Such as changing to a format, etc.

Strictly speaking, we think it may be possible to produce a compatible timeline of schema changes using the partially-open content model, for use-cases where you control the names of properties. But, it wouldn’t be pretty and with all these issues combined, as far as we are concerned:

A partially-open content model is too unwieldy & restrictive to be used to evolve JSON schemas in the Confluent schema registry in a fully compatible way.

Summary

Hopefully, this article has gone some way to explain why using strict JSON Schema compatability checks, with either closed, open or partially-open content models, doesn’t result in a workable solution for evolving the JSON schemas used to describe the data in your Kafka topics.

Unfortunately, as Confluent’s current JSON Schema compatability checks in its Schema Registry, v7.3.1 at the time of writing, use these strict rules, it makes it - in our honest opinion - unusable.

Primarily, its unusable as it only allows addition and removal of optional properties through, verbose and restrictive, mapping of property name patterns to property type.

This is the key issue. Confluent’s model requires the forward planning to add property mappings that map any name to a specific type. This trick allows new properties to be added later without fear that they are clashing with existing data that uses the same property name, but with a different property type.

In the second part of this topic, we will look at how we can leverage a mixed-mode approach to JSON Schema compatability checking that provides a much more user-friendly and clean solution.

Comparison of JSON schema validator implementations

2023-11-14T00:00:00+00:00

One of the big ticket items remaining before Creek can leave alpha is support for serializing complex objects. The first object based serialization format will be JSON, as its easy to view and debug messages with standard tooling, and compresses well. Yes, it’s not as efficient as Proto-buffers or Avro or any number of binary serialization formats. But in our experience, its efficient enough for all but the most high-throughput ‘firehose’ applications, and its ease of use outweighs the performance implications.

The importance of schemas

Perhaps the biggest challenge when deploying any highly distributed architecture is having confidence that deploying a new version of one part isn’t going to break other parts of the system.

In a Kafka based microservice architecture all communication between different services is accomplished by sending data to Kafka. Without suitable guardrails in place, deploying an updated service can easily cause catastrophic failures and issues downstream, e.g. the new version of the service might remove a field required by a downstream service.

Schema compatability

The common solution to this problem is to capture the schemas of the data the service is producing and ensuring any new version of the service has a compatible schema.

Schemas can be backwards compatible, forwards compatible, or both. Briefly, forwards compatibility means data written with one schema version can be read by applications using previous versions of a schema. Conversely, backwards compatibility means data written with one schema version can be read by applications using a new version of the schema.

Backward compatibility means that readers with a newer schema can correctly parse data written using an older schema, i.e. new schemas can read old data.

Forwards compatibility means that readers with an older schema can correctly parse data written using a newer schema. i.e. old schemas can read new data.

Given that data can live in Kafka topics for a long time, e.g. key compacted changelog topics or topics will long, or even no, deletion policies, it is common for Kafka based microservices to encounter both data written with older and newer versions of a schema, regardless of the timing of the release of producer and consumer services. For this reason, it is strongly recommended that you default to ensuring schema changes are both forward and backwards compatible over all versions of the schema.

Any change that breaks compatability needs to be carefully managed to ensure the role-out does not break the platform and, in our experience, is often better achieved by producing data to a new topic in tandem with the old for a period of time. Turning off and deleting the old topic once all consumers have migrated.

See the follow-on post series Evolving JSON Schemas for more info on the specifics of evolving JSON Schemas.

Schema registries

The requirement for schemas to be transitively forwards and backwards compatible, i.e. compatible with all previous and future schemas, necessitates the storing of each version of a schema. This is normally achieved through the use of a Schema Registry of some kind: a service that stores the versions of a schema and often both links those schemas to the resources that use them, such as a Kafka topic, and offers the ability to enforce compatibility between versions.

Schema validation

Having a schema for the data a service is producing, that is known to be compatible, removes the risk of deployments breaking down-stream systems, right? Well… no, not quite. A schema is useless unless there is confidence the data being produced matches the schema. We’ve seen systems with handwritten schemas that differ greatly from the JSON payloads being produced.

It is important that each JSON object being produced to Kafka aligns with the known forward and backwards compatible schema.

In our experience, the best way to achieve this is to build the schema from the code, or the code from the schema, and then to validate each JSON object before producing it to Kafka. Yes, this is relatively expensive. Yes, there is an argument that with perfect testing before deployment this validation step is superfluous. But let’s be honest, how many projects have you worked on with perfect testing?

By validating each and every message before producing to Kafka, you can have confidence your service isn’t going to adversely affect downstream services.

What about validating when reading messages? Surely, as each message is validated before being produced to Kafka there is no need, right? In an ideal world, this would be the case. In the real world, unless your topics are locked down tight so that no person or tool can produce to them without schema validation, then there’s the chance there could be bad data on the topic.

By validating each and every message being consumed from Kafka, bad data is detected before it hits the business logic of a service and can’t contaminate downstream systems.

JSON schema validator libraries

Given the importance of validating JSON data against a JSON Schema, our first step to implementing a JSON serialiser for Creek was to determine which validator implementation to use, and there are many.

When our search for functional and performance comparisons of these different implementations drew a blank, we simply wrote our own to test JVM based implementations, and as we’re nice people we open sourced the code and published the results in a microsite.

The functional comparison is achieved by running each implementation through the standard set of test cases. This covers core required functionality and optional features.

The performance comparison is achieved by benchmarking each implementation using the Java Micro-benchmarking Harness.

The site auto-updates as new versions of the libraries under test are released, and we’re actively encouraging new validator implementations to be added to the test.

The site is linked to from the implementations page on the JSON Schema website.

Note: Project Bowtie is looking to provide functional comparison of all validator implementations, not just JVM based ones. Bowtie was unknown to us when we started writing our own comparison and, at the time of writing, doesn’t cover the optional functional tests.

Comparison conclusions

Feature comparison

The latest functional results can be viewed on the microsite.

The two graphs visualise the overall number of tests each implementation successfully handles in the draft versions it supports.

At the time of writing, the top three implementations for required functionality are DevHarrel, Medeia and ScheamFriend.

DevHarrel only supports the latest two schema drafts, DRAFT_2020-12 and DRAFT_2019_09, and doesn’t score so well for optional features.
Medeia only supports older schema drafts, up to DRAFT_7.
SchemaFriend supports all versions of the JSON Schema and scores well in both required and optional functionality.

To our mind, SchemaFriend wins in the feature comparison.

Performance comparison

The latest performance results can be viewed on the microsite.

The performance comparison benchmarks two different use-cases.

The first validate benchmark runs each implementation the functional test suite.
The second serde benchmark runs each implementation through serialising a simple Java object to JSON and back, validating the JSON.

The graphs below capture the essence of the results, covering the latest and an older draft specification. More information is available on the microsite.

At the time of writing, benchmarking of older schema drafts highlighted Medeia and Everit as clear winners. For the more up-to-date schema drafts, Skema, DevHarrel and SchemaFriend lead the pack.

Interestingly, the general cost of validation seems to have increased as the JSON schema specification has evolved. This is likely due to more things being possible, but is a slightly worrying trend as it looks to have increased the cost even for the same simple use-case.

To our mind, for pure speed Medeia is hard to beat, and indeed we have used it successfully in previous companies. Unfortunately, it looks to be an inactive project and only supports up to DRAFT_7.

For newer draft versions, the winners would be Skema and DevHarrel and SchemaFriend

Conclusions

Hopefully this comparison is useful. The intended use-case will likely dictate which implementation(s) are suitable for you.

For its wide-ranging schema draft version support and being near the top in both functional and performance comparisons, SchemaFriend looks to be a great general-purpose validator library.

If your use-case requires ultimate speed, doesn’t require advanced features or support for the later draft specifications, and you’re happy with the maintenance risk associated with them, then either Medeia or Everit may be the implementation for you.

It’s worth pointing out that Confluent’s own JSON serde internally use Everit, which may mean they’ll be helping to support it going forward, and may mean this is the best choice for you if other parts of your system already use Confluent’s serialisers and hence compatability with Everit’s functionality is key.

Note: The author of this post and the repository is not affiliated with any of the implementations covered.

v0.4.1 preview release is available

2023-04-22T00:00:00+00:00

The v0.4.1 patch release of Creek is now publicly available on Maven Central and the Gradle plugin portal.

Outside the usual dependency updates, the reason for the release was to publish enhancements to our Gradle plugins to support Gradle 8, and to fix an issue in the JSON schema plugin that was causing it to generate duplicate schemas.

Fixes and improvements:

(Json Schema: Gradle): 🎉 Gradle 8.x support
(Json Schema: Gradle): :beetle: Fix module whitelisting
(System Test: Gradle): 🎉 Gradle 8.x support .

Release dependency updates:

Bump Slf4j from 2.0.6 to 2.0.7.
Bump TestContainers from 1.17.6 to 1.18.0.
Bump info.picocli:picocli from 4.7.1 to 4.7.3.

Outside of doing this release, time is being spent investigating and comparing the different JVM-based JSON Schema validator libraries. This will drive the decision on which validator library to use for the new JSON SerDe , which is also being worked on.

We’ll let you know when the comparison is complete and share the results.

New tutorial: Kafka Streams - Aggregate APIs

2023-03-21T00:00:00+00:00

It gives me great pleasure to announce that the third, and final, tutorial in the quick-start series is now live :tada:.

The Kafka Streams aggregate API tutorial builds upon the work done in the first Basic Kafka Streams tutorial to walk users through defining the API of an aggregate, wrapping parts of a system that don’t use Creek in an aggregate, and how to integrate one aggregate with another.

Combined, its hoped the quick-start tutorial series will provide a great introduction to the power of Creek and how to use it to build a tested, reliable microservice architecture quickly.

I’m very happy to announce this tutorial because it completes the series, but mainly because it means I can stop working on documentation and tutorials for a moment and pivot to coding :smiley:!

Next on the list of tasks is adding JSON support to Creek. This is a biggie in terms of effort and impact. Creek’s not much use in a real-world situation util it’s done.

Once JSON support is complete, Creek will be close to moving from alpha to beta release status. Feel free to view the MVP project board to see what’s remaining.

It’s worth noting, while it isn’t documented yet the serialisation formats used by Creek Kafka are totally customisable. JSON support is the first on the cards, but Avro, Protobuf, and others, including organisation-specific serialisation formats are easily supportable.

I’ll update you once JSON support is out…

v0.4.0 preview release is available

2023-03-14T00:00:00+00:00

The v0.4.0 minor release of Creek is now publicly available on Maven Central and the Gradle plugin portal.

The highlights of this minor release are:

Fixes and improvements:

(System Tests: Gradle): :beetle: Fix around debugging services during system testing , where more than one service is defined.
(System Tests): 🎉 Enhance system test executor options to allow caller to supply env vars for debugging to support the above bug fix.
(System Tests): :beetle: Ensure Docker container logs are captured on error .

Dependency updates:

Bump log4j from v2.19.0 to v2.20.0.
Bump io.github.classgraph:classgraph from v4.8.154 to v4.8.157.

Work has started on the third tutorial in the quick-start series, which covers connecting aggregates. We’ll let you know when it is ready.

New tutorial: Kafka Streams - Connected Services

2023-03-11T00:00:00+00:00

After a long wait, due to other commitments, I’m happy to announce the release of the second tutorial in the quick-start series: the Kafka Streams connected services tutorial is now live!

This follows on from basics covered in the Basic Kafka Streams tutorial. Work will now start on the third, and final, part of the quick-start tutorials. This work is tracked under issue-259 .

Combined, its hoped these three tutorials will provide a great introduction to the power of Creek and how to use it to build a tested, reliable microservice architecture quickly.

I’ll let you know once the third quick-start tutorial is up…

v0.3.2 preview release is available

2023-02-16T00:00:00+00:00

The v0.3.2 patch release of Creek is now publicly available on Maven Central.

This small patch contains a few dependency updates to fix some security vulnerabilities in dependencies. Nothing really worth calling out as being fixed, as its mostly stuff that wouldn’t affect their use in Creek.

The same vulnerabilities still exist in Snake YAML and Jackson core as for the 0.3.1 release. Creek will be updated once there are patches available for this. Neither are of real concern to Creek due to the way the libraries are used in Creek.

Work has started on the next tutorial, which covers how to connect services together within the same aggregate. We’ll let you know when it is ready.

v0.3.1 preview release is available

2023-01-30T00:00:00+00:00

The v0.3.1 patch release of Creek is now publicly available on Maven Central.

However, I will call out a few of the remaining ‘vulnerabilities’ in Creek dependencies.

Snake YAML’s Deserialization of Untrusted Data

See CVE-2022-1471 & GHSA-mjmj-j48q-9wg2.

At the time of writing, this was marked with High / Critical priority. However, if you read up on the vulnerability, you’ll see the vulnerability is that the deserializer allows instantiation or arbitrary types, and this can lead to remote code execution if you’re parsing YAML from an untrustworthy source, e.g. text submitted from a form on a website.

This is not an issue for Creek, as all YAML being deserialized is from a trusted source, i.e. you, the user, running Creek system tests written in YAML.

SnakeYaml isn’t used directly by Creek. Creek makes use of it via Jackson. Fixing this (none) issue in Creek is not currently possible.

Jackson core’s Uncontrolled Resource Consumption

See sonatype-2022-6438.

At the time of writing, this is marked with High priority. However, if you read up on this vulnerability, this is also about parsing data from untrustworthy source.

This is not an issue for Creek, as all data being deserialized is from a trusted source, i.e. you, the user, running Creek system tests written in YAML.

There is already a fix in Jackson. Creek will update to 2.15.0 of Jackson when it is released.

Kafka Stream’s divide by zero

See sonatype-2019-0422

This seems to be a vulnerability detected by SonaType OSS Index scanning a PR that fixed a potential divide-by-zero issue. The PR was never merged, hence the vulnerability report. However, from the PR comments it looks as though this issue is unlikely, or even impossible, to be hit.

An issue has been raised to track a potential fix.
Creek will be updated should a fix become available.

Sign off

As for continuing work on Creek: my focus is currently elsewhere for the next month or two, but I will get to those tutorials soon, and will post when I do!

v0.3.0 preview release is available

2023-01-12T00:00:00+00:00

It’s been somewhat delayed by Christmas holidays, but we’re now happy to announce the v0.3.0 release of Creek is now publicly available on Maven Central!

The new feature in this release is code coverage analysis while running system tests. As identified when writing the first tutorial, code coverage analysis wasn’t happening during system tests. This meant it was hard to know what code was and, more importantly, was not being exercised by the system tests. As we feel the system tests are such a powerful feature of Creek, this was obviously a hole that needed filling.

With this release, code coverage metrics are now captured for system tests! Creek is capturing code coverage data from your service code running inside Docker containers. This is achieved by mounting a directory into the service container that contains the JaCoCo coverage agent and starting services with the JaCoCo agent set in the JAVA_TOOL_OPTIONS environment variable, so the services picks it up.

With that out of the way, focus can return to knocking out a few more tutorials :)

We’ll post again when those tutorials go live…