What is a Data Dictionary?

2021年4月27日

Organizations are collecting, generating, and storing more data than ever before. While it is fantastic to have an abundance of information to drive evidence-based decisions, it won’t do any good unless people in the company have the resources to fully understand what is available.

This is where a data dictionary comes in. A data dictionary provides guidance about specific datasets so that people who may not be familiar with the underlying data are still able to explore and make connections with confidence.

Searching for data

Let’s say your manager asks you to retrieve some market data. The business goal is to decide in which country should the company open its first global office. Senior leadership wants a stable operating environment that also has an attractive domestic market for new sales opportunities.

You were fortunate enough to find a global data set posted by an economics graduate student on her website. It includes several potentially useful variables. Score!

You open the Excel file and are pleased with the country coverage. The top of the dataset looks something like this:

country	region	college_share	gdp_billions	pop_millions	gini
Angola	Middle East & Africa	9.3	99.0	31.8	51.3
Albania	Europe	59.8	14.9	2.9	33.2
Argentina	Americas	90.0	437.8	44.9	41.4
Armenia	Europe	51.5	14.0	3.0	34.4
Australia	Asia Pacific	107.8	1450.5	25.4	34.4
Austria	Europe	86.7	448.8	8.9	29.7
Azerbaijan	Asia Pacific	31.5	58.9	10.0	26.6
Burundi	Middle East & Africa	4.1	2.4	11.5	38.6
Belgium	Europe	78.9	546.9	11.5	27.4
Benin	Middle East & Africa	12.5	14.9	11.8	47.8

The good news is that there doesn’t appear to be any missing data and that the table is already well-organized for data analysis, consisting of:

Rows

Observations from a single object of interest (e.g., Argentina)

Columns

Variables that describe or measure something for each object (e.g., Population)

Cells

The value of a specific variable (e.g., 44.9 million)

The more you look at it, the more you realize that just because you have data doesn’t necessarily mean that you’re ready to start making sense of it.

Some of the column names seemingly make sense such as region and pop_millions. Others, like gini and college_share, are rather ambiguous. So, what do you do?

This lack of clarity is a real problem. Here we are only talking about one small dataset of which you are personally invested, but imagine all the data tables that your organization dumps into SQL Server with minimal documentation.

Making sense of new data

We need to answer several questions before we have the confidence to perform meaningful analysis.

What is our unit of analysis?

Generally, what do we find each row to represent? Is our unit of analysis a person, a place, and product?

What do the column names mean?

Variable names can be messy and it is not uncommon to find unknown abbreviations or random characters such as ISO3c, XRTY79, REG2 at the top of each column.

When building or cleaning your own datasets, make column names as descriptive as possible. Also, avoid white space that could turn into a headache during subsequent analysis. I prefer using all lowercase with the underscore character as needed. For example, instead of Population in 2020, use population_2020. It is important to pick an approach and stick with it.

What are the variable types?

Are the variables text (e.g., country) or numeric (e.g., gdp_billions). Are the numeric values scaled in some way? Are percentages reported in decimal form (e.g., 0.4 vs. 40)? Do date variables have the format mm-dd-yyyy, dd-mm-yyyy, or something else?

This is helpful for anyone trying to use your new-found data or understand your analysis. It is critical if you plan to load the data points into company systems or incorporate assets into Business Intelligence (BI) tools.

Who collected the data?

Where did the data come from? Did they collect it themselves or compile it from secondary sources? Is it available online? Are there terms of use associated with it? Is the source considerable reputable? What biases might exist?

How was it collected?

Did the results come from a survey? How many people responded? Was it representative of a wider group of interest or a convenient sample based on internet traffic?

How often is the data updated?

One of the most common follow-up questions you’ll hear is how has the data changed over time? Knowing when the next data refresh will occur or if historic data exists is therefore very helpful.

Who will be responsible for making updates and ensuring accuracy?

Many companies have difficulty assigning ownership to datasets. If the data is truly an organizational asset, someone needs to guide the collection, cleaning, and storing process.

A formal data dictionary

All of this information should be documented somewhere in a data dictionary or data glossary. This can be as simple as a shared word doc or as formal as a dedicated tool that connects directly to a company’s data storage systems.

What should you do about the data provided in our example above?

You decide to reach out to the graduate student and ask these clarifying questions. You’ll thrilled when she responds that the data came from the World Bank’s World Development Indicators. The resources pulls from many original sources and the economist included only the most recent year available in her spreadsheet.

You’re able to construct the following, which you add to the first page of your data spreadsheet. This act of data kindness positions those removed from the data retrieval process to have a fighting chance in understanding and using the information.

variable	full_name	description	type	source	updated
country	Country name	Full country name based on World Bank list.	Text	World Bank
region	Region name	Region name. Aggregated from several World Bank subregions.	Text	World Bank
college_share	Share of student-age population going to college	The number of people enrolled in tertiary education as a proportion of a country’s student age population. Note: It can be higher than 100 due to inbound student mobility.	Number	UNESCO	Sep
gdp_billions	Gross Domestic Product (GDP) in billions	The total amount of annual economic output as measured by GDP in billions of US dollars.	Number	OECD National Accounts	Feb
pop_millions	Population in millions	The total population size of a country in millions.	Number	UN Population Division	Jun
gini	Gini Coefficient	A measure of income inequality that ranges from 0 to 100 with higher values indicating greater levels of income inequality.	Number	World Bank	Mar

Taking ownership

Documentation, include data dictionary creation, doesn’t always sound like a fun task. And if you were the one who found the data and are using it for a specific analysis, it might not even seem necessary.

However, until organizations are able to bring their disparate data assets into a connected environment with resources to make sense of what’s available, it is unlikely that anyone will extract full value from the collective efforts.

Just as clean data is better than messy data, understandable data is better than cryptic data. We are only fully confident to use data when it is both understandable and clean.

Data Usability Matrix

	Understandable	Cryptic
Clean	I’m fully confident in knowing what I’m working with and doing analysis on what’s included.	I can analyze the data, but unsure how to interpret the results.
Messy	I know what the data is, but it is hard to perform analysis in its current form.	I don’t know what I’m looking at and can’t do much with it.