Organizations are collecting, generating, and storing more data than ever before. While it is fantastic to have an abundance of information to drive evidence-based decisions, it won’t do any good unless people in the company have the resources to fully understand what is available.
This is where a data dictionary comes in. A data dictionary provides guidance about specific datasets so that people who may not be familiar with the underlying data are still able to explore and make connections with confidence.
Let’s say your manager asks you to retrieve some market data. The business goal is to decide in which country should the company open its first global office. Senior leadership wants a stable operating environment that also has an attractive domestic market for new sales opportunities.
You were fortunate enough to find a global data set posted by an economics graduate student on her website. It includes several potentially useful variables. Score!
You open the Excel file and are pleased with the country coverage. The top of the dataset looks something like this:
country | region | college_share | gdp_billions | pop_millions | gini |
---|---|---|---|---|---|
Angola | Middle East & Africa | 9.3 | 99.0 | 31.8 | 51.3 |
Albania | Europe | 59.8 | 14.9 | 2.9 | 33.2 |
Argentina | Americas | 90.0 | 437.8 | 44.9 | 41.4 |
Armenia | Europe | 51.5 | 14.0 | 3.0 | 34.4 |
Australia | Asia Pacific | 107.8 | 1450.5 | 25.4 | 34.4 |
Austria | Europe | 86.7 | 448.8 | 8.9 | 29.7 |
Azerbaijan | Asia Pacific | 31.5 | 58.9 | 10.0 | 26.6 |
Burundi | Middle East & Africa | 4.1 | 2.4 | 11.5 | 38.6 |
Belgium | Europe | 78.9 | 546.9 | 11.5 | 27.4 |
Benin | Middle East & Africa | 12.5 | 14.9 | 11.8 | 47.8 |
The good news is that there doesn’t appear to be any missing data and that the table is already well-organized for data analysis, consisting of:
Observations from a single object of interest (e.g., Argentina)
Variables that describe or measure something for each object (e.g., Population)
The value of a specific variable (e.g., 44.9 million)
The more you look at it, the more you realize that just because you have data doesn’t necessarily mean that you’re ready to start making sense of it.
Some of the column names seemingly make sense such as region
and pop_millions
. Others, like gini
and college_share
, are rather ambiguous. So, what do you do?
This lack of clarity is a real problem. Here we are only talking about one small dataset of which you are personally invested, but imagine all the data tables that your organization dumps into SQL Server with minimal documentation.
We need to answer several questions before we have the confidence to perform meaningful analysis.
Generally, what do we find each row to represent? Is our unit of analysis a person, a place, and product?
Variable names can be messy and it is not uncommon to find unknown abbreviations or random characters such as ISO3c
, XRTY79
, REG2
at the top of each column.
When building or cleaning your own datasets, make column names as descriptive as possible. Also, avoid white space that could turn into a headache during subsequent analysis. I prefer using all lowercase with the underscore character as needed. For example, instead of Population in 2020
, use population_2020
. It is important to pick an approach and stick with it.
Are the variables text (e.g., country
) or numeric (e.g., gdp_billions
). Are the numeric values scaled in some way? Are percentages reported in decimal form (e.g., 0.4 vs. 40)? Do date variables have the format mm-dd-yyyy
, dd-mm-yyyy
, or something else?
This is helpful for anyone trying to use your new-found data or understand your analysis. It is critical if you plan to load the data points into company systems or incorporate assets into Business Intelligence (BI) tools.
Where did the data come from? Did they collect it themselves or compile it from secondary sources? Is it available online? Are there terms of use associated with it? Is the source considerable reputable? What biases might exist?
Did the results come from a survey? How many people responded? Was it representative of a wider group of interest or a convenient sample based on internet traffic?
One of the most common follow-up questions you’ll hear is how has the data changed over time
? Knowing when the next data refresh will occur or if historic data exists is therefore very helpful.
Many companies have difficulty assigning ownership to datasets. If the data is truly an organizational asset, someone needs to guide the collection, cleaning, and storing process.
All of this information should be documented somewhere in a data dictionary
or data glossary
. This can be as simple as a shared word doc or as formal as a dedicated tool that connects directly to a company’s data storage systems.
You decide to reach out to the graduate student and ask these clarifying questions. You’ll thrilled when she responds that the data came from the World Bank’s World Development Indicators. The resources pulls from many original sources and the economist included only the most recent year available in her spreadsheet.
You’re able to construct the following, which you add to the first page of your data spreadsheet. This act of data kindness positions those removed from the data retrieval process to have a fighting chance in understanding and using the information.
variable | full_name | description | type | source | updated |
---|---|---|---|---|---|
country | Country name | Full country name based on World Bank list. | Text | World Bank | |
region | Region name | Region name. Aggregated from several World Bank subregions. | Text | World Bank | |
college_share | Share of student-age population going to college | The number of people enrolled in tertiary education as a proportion of a country’s student age population. Note: It can be higher than 100 due to inbound student mobility. | Number | UNESCO | Sep |
gdp_billions | Gross Domestic Product (GDP) in billions | The total amount of annual economic output as measured by GDP in billions of US dollars. | Number | OECD National Accounts | Feb |
pop_millions | Population in millions | The total population size of a country in millions. | Number | UN Population Division | Jun |
gini | Gini Coefficient | A measure of income inequality that ranges from 0 to 100 with higher values indicating greater levels of income inequality. | Number | World Bank | Mar |
Taking ownership
Documentation, include data dictionary creation, doesn’t always sound like a fun task. And if you were the one who found the data and are using it for a specific analysis, it might not even seem necessary.
However, until organizations are able to bring their disparate data assets into a connected environment with resources to make sense of what’s available, it is unlikely that anyone will extract full value from the collective efforts.
Just as clean data is better than messy data, understandable data is better than cryptic data. We are only fully confident to use data when it is both understandable and clean.
Understandable | Cryptic | |
---|---|---|
Clean | I’m fully confident in knowing what I’m working with and doing analysis on what’s included. | I can analyze the data, but unsure how to interpret the results. |
Messy | I know what the data is, but it is hard to perform analysis in its current form. | I don’t know what I’m looking at and can’t do much with it. |