How to automatically generate rich descriptions using Droughty, dbt, and OpenAI

Lewischarlesbaker
Rittman Analytics Blog
5 min readJan 3, 2023

--

A data warehouse should do more than model and test data for the purpose of descriptive analytics, only to be exposed in a visualisation tool. As this article explores, a data model organises ontologies and embeds semantic conventions, providing a unique view of curated information that takes on a life of its own.

In this article, we will look at how this structure we impose allows us to do exciting things that we could not otherwise do. Like we’re doing with droughty's new feature, droughty docs.

The problem to be solved

When building a data warehouse, many small things turn it from good-to-great. If done manually, the problem is that these small things take a lot of time… and can be pretty dull.

One of these tasks is writing field descriptions. Often an afterthought, field descriptions in some warehouses fail to be populated. Sometimes, even if they are, they’re little more than just the field name with some token verbiage.

This is where dbt, OpenAI and Droughty’s new feature droughty docs comes in. It aims to save engineers days of work whilst improving the richness and detail of warehouse descriptions.

To explore this new feature, we’ll cover the following sections:

  1. Summary
  2. Implementation
  3. The future vision

Feature summary

droughty docs scans through your warehouse's columns, and for columns that don’t have descriptions, it generates candidate descriptions using the OpenAI’s text-davinci-003 model engine.

This feature saves hours by simultaneously inferring detailed descriptions for thousands of columns. These descriptions add detail to warehouse entries that improve the data's discoverability and helps end-users contextualise the information they are served. An example of an inferred description is:

{% docs jira_issue_key %}

A Jira issue key is a unique identifier used to identify a specific issue
or task in a Jira system. They typically consist of a project abbreviation
followed by a hyphen and a numerical value. For example, ACC-123
is an example of a Jira issue key.

{% enddocs %}

This automatically generated description provides a much-needed explanation for the field and is far better than the description engineers would give.

You can then ingest these descriptions into the warehouse using dbt and propagate them to downstream exposures using droughty cube, droughty lookml or droughty dbml

BigQuery
Looker Data Dictionary

These descriptions are in much more detail than I would write. They are propagated throughout downstream tools by droughty’s automatically generated code; they also come from a canonical source and require no additional work, irrespective of downstream consumption.

Implementation

droughty docs is simple to set up and can be split into seven steps:

  1. Install or update to at least droughty 0.9.1
  2. Create an OpenAI secret
  3. Configure profile parameters
  4. Configuredroughty docs project parameters
  5. Build dbt
  6. Run droughty for downstream exposures
  7. Additional considerations

Install or update to at least droughty 0.9.1

Run pip install --upgrade droughty in your terminal or follow the installation guide here.

Create an OpenAI secret

This blog has a good guide about how to do this:

Configure droughty docs profile parameters

If you already have droughty installed and set up, this step takes moments. Add the openai_secret parameter to your profile.yaml file.

If you’re new to droughty, follow the configuration documentation here. An example profile.yaml file can be found below:

droughty_demo:

host:

key_file: /Users/droughty_user/[key_file]

password:

port:

project_name: example-project

schema_name: analytics_qa

user:

warehouse_name: big_query

openai_secret: sk-....

Configure droughty docs project parameters

Once you have your OpenAI secret in your droughty profile.yaml file, all you have to do is set up two parameters in your droughty_project.yaml.

field_description_path: warehouse_docs
field_description_file_name: field_descriptions.md

The openai_field_descriptions_path parameter instructs droughty where to output your generated descriptions and the openai_field_descriptions_filename lets you overwrite the default outputted filename.

Find a full droughty_project.yaml example below:

profile: example_project

dimensional_inference: enabled

field_description_path: warehouse_docs
field_description_file_name: field_descriptions.md

openai_field_descriptions_path: warehouse_docs
openai_field_descriptions_filename: openai_field_descriptions

test_schemas:
- lewis_analytics_dev_staging
- lewis_analytics_dev_integration
- lewis_analytics_dev


test_overwrite:
models:
wh_marketing__web_event_items_fact:
web_event_item_pk:
- not_null
- dbt_utils.at_least_one
- unique
web_event_parameter_float_value:
- dbt_utils.at_least_one

test_ignore:
models:
- base_backend__web_events
- base_ga4__web_events


dbml_schemas:
- lewis_analytics_dev_staging
- lewis_analytics_dev_integration
- lewis_analytics_dev

dbml_filenames:
- test_10
- test_11
- test_12

Build dbt

Run dbt build to ingest the field descriptions into the warehouse.

Run droughty for downstream exposures

Run any of the following droughty lookml,droughty cube, droughty dbml to produce code for downstream tooling.

Additional considerations

There are a few other considerations.

  1. Droughty will check for a field_descriptions.md file to see what descriptions already exist. It will either look in the default docs directory (/warehouse_docs/field_descriptions.md) or the pathing specified within the droughty_project.yaml file.
field_description_path: warehouse_docs
field_description_file_name: field_descriptions.md

2. In our warehouses, we use verbose naming conventions. Suppose we have the column name from a project table from jira as a source. We would name this jira_project_name. This takes a few moments more when staging the data but significantly decreases ambiguity and improves discoverability. We depend on this convention when using droughty docs to add much-needed context to the OpenAI model engine.

3. The output descriptions require sanity checking, as the results can exceed expectations but can also miss expectations. For example, the following column pertains to the Harvest SAS platform and will return the following result:


{% docs harvest_project_bill_by %}

The Harvest Project bill is supported and funded by the United States Department of Agriculture's Farm Service Agency and the Natural Resources Conservation Service.

{% enddocs %}

This shows some of the current limitations. I have a feature in the roadmap that will allow end-users to provide additional context for the text-davinci-003 model engine through the droughty_project.

The future vision

The detailed column descriptions open up opportunities for future analytics functions that allow end-users to use NLP to ask complex verbal queries and get rich semantic responses, returning facts, dimensions, aggregates and their long-form contextual descriptions without requiring manual input.

droughty docs not only helps end users automate the repetitive task of writing field descriptions but can improve the quality of those field descriptions. The future ambition is for droughty to use these field descriptions within the activation of the data, using the context they provide to add richer meaning to the data end-users explore.

What is droughty?

Droughty is an open-source semantic modelling tool; it takes warehouse metadata and outputs semantic files in different languages for downstream consumption.

For more information, please refer to https://droughty.readthedocs.io/en/latest/index.html or https://github.com/LewisCharlesBaker/droughty

For help or assistance on anything referenced in this article or to help build out your analytics capabilities, contact us at Rittman Analytics now to organise a 100%-free, no-obligation call — we’d love to hear from you!

--

--