Data Modeling an Alcohol Product Catalog

A headshot of Jay.

Jay Sobel

October 5, 2024

Today I’ll outline a data model for alcohol products, and at the end write down every nuance I know in the space.

This is not meant to be a thought piece. I haven’t used ChatGPT to fill in fluffy details. Let's get into it.

Data modeling, like many engineering tasks, is a balance of capabilities and costs, and of complexity and practicality.

Ultimately the worth of a data model is what you actually do with it. Not just the infinite reconfigurability. Drizly made a $1.1B exit despite never properly modeling vintages. Did customers complain sometimes, yes, but it just wasn’t that important of a feature.

What is an Alcohol Product, exactly?

Let's define an alcohol product as an item you can find on a retailer’s shelf. Alcohol products are a nice little category because they share many characteristics, and their differences are not so vast that they can’t be lumped together. 

On the Unknown

Data Modeling encourages a completeness of though; let’s enumerate every possible feature of a product. In real world information is never fully available, and not every feature is going to create value for your business. If you’re trying to sell products online, you might get better results putting all your effort into photos, and very little into ABVs and vintages.

Mapping the whole space is a waste of time. Unknowns need to be prioritized.

I will add however, that we increasingly live in a landscape of free, automatic digital omniscience. In 2022 if you wanted to increase the coverage of details in your catalog, you likely had to pay a real person to spend time running Google Searches and typing into spreadsheets. This can now be automated with AI for several orders of magnitude less cost, and likely with more accurate results.

If the answer exists digitally, then accessing it is really just a dollar cost.

A Data Model

Liquid

Bud Light is Bud Light. Many of the most interesting consumer-oriented dimensions of a product live at this grain. The liquid has a category, a producer, a region of origin, an ABV and so-on. We could say It has a density, but not a volume. It’s the idea of the liquid.

Vintage

A liquid has many liquid-vintages.

Wine liquids (and some spirits and beers) are further distinguished by the year they were produced, in some cases. 

A liquid can come in many vintages. It can also come in NULL vintages; Bud Light is just Bud Light. Or even a mix of unspecified and specified. 

This is a dimensionless mapping table. There’s not much to say about 2021 the year that applies to all associated liquids. But some sites do model this out into regional seasonal dimensions so they can say that X year in Y region had a, b, c characteristics. You could do the same for specific producers, but this is beyond the scope of the “product catalog” described here..

Container

A liquid-vintage-container has many containers.

A liquid can be sold in many different containers. Bud Light comes in a 12 oz can and a 16oz can and a 12oz bottle and even a 12oz bottle with a regional sports team themed label…

The key dimensions of a container are:

  • Volume
  • Volume units
  • Type (can, bottle, keg)
  • Material (glass, plastic)

You could even go farther, but there are diminishing returns to delineating different types of plastic bottles, no matter who the catalog’s intended audience is. It may make sense to maintain containers as a separate table unlike vintages. There are several dimensions to track that would be ugly to denormalize.

And excitingly for COLA Cloud, a container has (many) label image(s).

Packaging

A “six pack” is a type of package that contains six containers. A liquid-vinatage-container has many packagings. A single bottle of wine is a 1-pack. And a four pack can be in a box, or joined using can-rings, etc. 

The key dimensions of a package are: 

  • number of containers
  • Type (box, can rings, carrier)
  • Images
  • Sales data?

With Package we have finally reached an item you find on a retailer shelf; “a product”.

Something that has not been handled here is the variety-pack case. Can a packaging have multiplel liquid-vintage-conatainers? This increases complexity considerably, but the only alternative is to call “Variety Pack Seltzer” a liquid, which seems similarly incorrect. I won’t go a detailed solution here -  I doubt it wouldn’t be worth the benefit. I consider the “Variety Liquid” to be a relatively acceptable compromise.

Barcodes

This is the first real foray from “a catalog of products” toward a component of a larger system that needs to integrate with other systems.

Ideally, one or more barcodes relate to a combination of liquid-vintage-container-packages, and uniquely identify them. That would be a great world to live in. The key dimensions of a barcode are its type (UPC is one of several) and value. 

QR codes could arguably fall under this same category, as they can appear on a container’s label or a packaging, and typically evaluate to a URL, sometimes shortened. QR codes can also contain a data payload, like a name and phone number.

Unfortunately, UPC barcodes are a privately organized system - literally a company called GS1 “allocates” you a number for $30 for a year.

The biggest enforcers of sanitary UPC practices are major retailers like Costco who will not stock products that do not follow sane UPC practices, because it would be too much of an operational headache to try and sort them out.

Here are several UPC bad-practices in descending order of infamy.

  1. Seasonal beers sharing a UPC year-round; different liquids, same UPC.
  2. Vintages sharing the same UPC across years; different vintage, same UPC
  3. Cans joined together with can rings with the UPC on each individual can. A packaging UPC matching a container UPC.
  4. Truncation and/or padding of leading and trailing 0’s in PoS systems.
  5. Truncation of leading zeros by Microsoft Excel automatic type detection (an absolute classic).
  6. Libertarian producers (or data-integrity extremists) making up their own UPC.
  7. Europe using different–likely superior–formats.

Product Photos

In ecommerce, photos are the most valuable single piece of information on a page. They communicate to the user that this product really does exist in a way that ABV and a description fail to.

As mentioned in the packaging section, a package has photos. A photo might depict solely a container, but that can be considered the NULL package. And a container also has label images which can serve as an image of last-resort, and with an understanding of legal rights to imagery.

They feel a little like barcodes. And ultimately the access pattern for images may involve fallbacks from the specific package, to any container, to a label, to anything associated with the liquid + a little notice that the depiction may be inaccurate. Photos are valuable enough to warrant such treatment.

Brand and Company

This is the least fleshed out area of this post. This could be structured in terms of brands belonging to companies belonging to conglomerates, and cope with brands changing hands as they often do in this industry.

The minimal implementation of this would be to just cover brands, and relate liquids to brands. A nice simple relationship, as a liquid cannot realistically change brands. Several dimensions of a liquid could be relocated to the brand. A brand has a geographic relationship that generally supersedes that of a liquid. A brand also has a year when it was established, and a nice collection of products that a consumer might be interested in.

If you're building a B2B service that services supplier companies, then a company model becomes very important, perhaps even more so than liquids. Useful dimensions of a company might be their number of brands, number of products, age, ownership.

How COLAs Fit In

If “a product” is a combination of entity relationships, then where does a COLA fit in?

The data model described above is the scaffolding of a product catalog. Once the model is defined, the actual data needs to be filled-in, and this is where COLAs come into play.

The Registry

The COLA Registry contains over 2.5M label approvals for alcohol products that are sold in the US. It grows by ~3,000 approvals each week. Every COLA contains a product label, which fits nicely into this model at the granularity of “a container” and the label contains information about the liquid, vintage, as well as the brand and company. Labels often contain UPC and QR barcodes. These are incidental, not required by the TTB, but can provide a high degree of confidence for catalog integration. 

Catalog Integration

The most basic way to integrate COLA data into this model would be to join COLA records into existing records using exact equality of the UPCs. The product details structured by COLA Cloud can be easily mapped into whatever fields and taxonomies used in the catalog. This process can run once, and continuously to automate detail capture into the future.

A more ambitious integration would use COLA data to insert new records into a catalog. Newly approved COLAs have probably never been added to your catalog before because they haven’t reached the market yet. They could be inserted with relative confidence. For wine and liquor, you can assume packaging is irrelevant and every label corresponds to a single bottle. In this case, a COLA is almost 1:1 with a product.

COLA Caveats

There are a couple other caveats to be aware of when likening COLAs to products.

First, not every product requires a COLA. There are law practices that specialize in these details, so take my list as the biggest generalities. This is data advice, not legal advice!

  1. Non-material changes to the label. For example graphics, colors, vintage years, or UPCs.
  2. Slightly different container sizes do not need to apply for separate COLAs.
  3. Products sold only within one US State. The TTB is a federal agency, after all.

Second, imported products are submitted to the TTB by their US-based importers. Duplicate COLAs can be created when different importers submit applications for the same imported product’s label. Imported products make up a large portion of the dataset (and the market itself), and they need to be considered with extra care.

Lastly, COLAs can be submitted by mail! Around 0.3% of approvals will display as large JPG scans of physical documents, rather than HTML pages. In these scans, the “label images” are often physically attached to the sheet of paper. If you were trying to launch a product quietly, this is how you’d do it! At least until COLA Cloud starts paying for GPT-4 parsing of these rare records…

Get the data

Thank you! We'll get back to you as soon as possible.
Hmm something about that didn't work out. You can also email me directly at jay@colacloud.us