As a data analyst, nothing can be worse than producing a data product for an executive to make critical business decisions and finding that your assumption was utterly wrong, leading the company in the wrong direction.
The main role of a data analyst is to produce relevant data insights into business data to help make decisions. When you have an urgent request from an executive to create a report, it can be easy to go along with the assumptions you’ve made with the data and run with it to perform the analysis you need.
This is why data discovery is integral to data analysis and should not be skipped due to time constraints. This article will delve into what data discovery is, what the benefits of data discovery are, and why data discovery is a cycle rather than a procedure.
What is Data Discovery?
Data discovery is a framework for learning and building trust with business data before you use it for a specific use case. It’s the process of analyzing data across your many different sources and understanding how it needs to be structured, transformed, and modeled for your downstream use cases before it arrives in your data warehouse.
The problem with data discovery is that it can be perceived as valueless. Creating a report for an executive produces an output, while data discovery is only a step of data analysis and doesn’t produce a tangible outcome that directly provides value.
It’s similar to a data engineer writing documentation to explain code. Sure, that data engineer could be doing something that has an immediate impact on your organization, but writing documentation will save that individual a huge amount of time down the line as you scale up and bring in new team members. Instead of addressing each question individually, the documentation will provide all the answers–thus allowing your original data engineer to focus on higher-priority tasks that drive business value.
Why is Data Discovery Important?
Data discovery is important because it helps you understand your entire data flow from the point your data is generated to how it’s used in your downstream use cases. Effective data discovery enables you to understand how your data is structured in your sources and how it needs to be structured in your warehouse before you can actually start leveraging it to power your analytics and activation use cases. The entire purpose of data discovery is to help you logically think through your data so you have a greater understanding.
The Data Discovery Cycle
Data discovery has been described as the kid’s games of chutes and ladders. You try to work your way to the top using the ladders scattered throughout the playing board. But littered throughout your path you encounter chutes, which send you back.
Data discovery is similar because it isn’t a process. It is more of a cycle. You can go through parts of the data discovery cycle to only go back to previous steps before continuing.
Uncover the Business Processes
One of the important parts of the data discovery cycle is truly understanding the business processes. Often there is a stark difference between a company's process and what is actually being done.
Let’s be honest, there are some people out there who, if they can find a way to make their life easier and get their job done quicker, will take it. Maybe this is good for the company because productivity is increasing. However, altering the business processes can impact other areas, such as company data.
Part of the data discovery cycle is to try and uncover these situations. If you can find out how data is captured and how it might not follow the process it was supposed to, it can prevent you from pulling results that could be wrong, ignoring data that is completely wrong, or amending it before it reaches your data warehouse.
Understanding Your Data
It’s easy to quickly look at some of the data you need to work with and call it a day. Or, even worse, make assumptions that don’t have facts to back them up. If you don’t truly understand the data you’re working on, how can you be confident in your analysis?
Remember, business leaders use data analysis to make company decisions affecting the organization’s future. So being a data analyst is a high-pressure role. Why would you skim over the data you need, knowing how important your work is?
It pays to take the time to understand the data and to be truly confident your results are as accurate as possible. This could be from understanding how the source data is captured, how data is transferred from the source to the data warehouse, or performing basic analysis to understand the relationships between tables and the data in them.
Testing
Another important part of the data discovery cycle is testing. Testing helps ensure that the insights you have produced in reports or dashboards stay accurate.
Inaccurate data can happen from:
- Source system changes: where an engineer could make changes at the source, such as data type or data format.
- Data collection failure: where a source application is down or bugs have been introduced into the code.
- Data ingestion failure: where a data pipeline has failed, so data hasn’t been transferre and becomes outdated.
- Human errors: where someone has manually entered data incorrectly.
Implementing testing alerts you to these situations to ensure the reports stay accurate. Testing is also an important element of data observability.
You could build the tests within your code, but extra time is needed to create them, and often they’re not as effective. This type of alerting can be switched off due to the multiple notifications that come through and the fact the errors aren’t that important. But what’s the point of creating tests if you’re just turning them off?
A more common and better solution is dbt. dbt allows you to build tests for the reports that you produce, alerting you to any changes that may affect your results. You can build tests in dbt to how you want them.
Take an example of sales data being entered into Salesforce. You know, on a daily basis, the sales team can enter up to ten errors out of the thousands inputted. You’re happy with the error rate being anywhere up to ten. Within dbt you can set different tolerances, so if it ever goes over the threshold, you can be alerted to investigate.
Ongoing Maintenance
There could be occasions when new datasets have been introduced into the company that could benefit the existing reports you’ve already created, or maybe a business process has changed how data is captured–thus skewing the results.
Whatever it is, sometimes it pays to get back into the datasets and ensure everything is correct. This could be reviewing the graphs you’ve created or having a 15-minute chat with the executive you’ve made the reports for to determine if they have noticed any differences. With data analysis, you want to be confident of your results, regardless of whether the report is a day old or a year old.
What are the Benefits of Data Discovery?
There are many clear benefits to carrying out data discovery, but ultimately it helps ensure you improve your data understanding, data quality, and confidence.
Better Data Understanding
Whether it’s fresh datasets you’re working on or ones you have been familiar with, it takes time to truly understand your data. Data discovery helps you to know how the data is collected, stored, and structured at the source so you can be aware of any quirky human or system behavior that could produce inconsistent results.
After all, every database has its own unique schema structure that you need to be aware of. The more you understand the data you’re working on, the more accurate your assumptions will be.
Better Data Quality
Understanding the data you’re using helps to create better quality data. If you discover duplicates or missing data in your data discovery process, you can take action to exclude them or attempt to fill in any missing data.
Conducting data discovery will reveal any areas that beyond surface level may not be correct and create a plan of action to remedy them so you can produce as accurate analysis as possible.
Builds Confidence
The work you do as a data analyst is complex. You need to ensure your data product is as accurate as possible, as it will be used to make informed decisions. Data discovery helps you to understand what could go wrong and enables you to identify discrepancies in your data so you can confidently trust the data you’re using.
Data Discovery Examples
Hearing some data discovery examples from real-life data practitioners can be helpful.
I spoke to Meredith Alder, who’s been a data practitioner since 2003, about where data discovery helped her uncover data that was being modified outside of a business process.
Meredith was working on a subscription model to identify the likelihood of customer renewals. In part of her research, she took information from the invoicing system to build a formula that determined how likely a specific user was to renew.
Her results could have been totally wrong if she hadn’t carried out any data discovery. In understanding the business process, she discovered that if a customer called to cancel, people would go around the system to prolong the renewal by reworking the invoice. If Meredith had never found this out, she would be presenting insights that aren’t correct and influence the decisions taken by the actual business users.
Erik Edelmann, who’s been a data practitioner since 2010, shared insights about his process for encountering new datasets and understanding them. Erik runs queries on the dataset to look for uniqueness and cardinality. This way, he can discover what makes each record unique and its identifying attributes. He analyzes the distributions of complex data sets to identify outliers and fields of interest.
Final Thoughts
When under pressure to deliver the reports the executive team wanted yesterday, it can be tough not to want to take shortcuts, and data discovery can be something that gets cut out. Hopefully, you now understand that if you want to be confident with the data you’re working with and trust the data product you’ve produced, data discovery is an important component of data analysis.