Data Quality
Introduction to the Pentaho Data Quality framework ..
Last updated
Was this helpful?
Introduction to the Pentaho Data Quality framework ..
Last updated
Was this helpful?
The Pentaho DQ framework is built around personas:
Foundational Health - Data Engineers / Ops focus on data availibilty / reliability. The data is constantly monitored - triggering alerts - for any Schema / Table changes, duplicate records, NULL values, volume and frequency, etc, based on Rule Sets .. to ensure the datasets meet the business needs.
Semantic Layer - Here the Data Analyst / Scientist / asks the question - Is the data fit for purpose for the business initiatives? Its the Semantic Layer - Domains, Tags, Terms - that plays a crucial role in making data more accessible and meaningful to business users while maintaining Consistency, Validity and Accurancy. It acts as a bridge between the technical complexity of data storage and the business needs for data analysis and decision-making.
Strategic Layer - Aimed at C level decision makers. The Metrics - KPIs - are aligned to corporate objectives to give a 360 degree view of data quality that affecting business units.
Want to dive a little deeper ..?
Once the connection to the data source has been established, PDQ runs a number of Jobs that profile the data based on over 50 out-of-the box rule sets. Any anomalies that occur are automatically captured.
The Dimensions map to the six pillars of data quality, running SQL queries at the back end, enabling you to fine tune and create your own custom Dimensions.
Key Dimensions of Data Quality:
Accuracy: The degree to which data correctly represents the real-world entity or event it describes.
Completeness: The extent to which all required data is present and not missing.
Consistency: The degree to which data is uniform across different datasets or systems.
Timeliness: Whether the data is up-to-date and available when needed.
Validity: The extent to which data conforms to defined business rules or constraints.
Uniqueness: Ensuring there are no duplicate records or redundant data.
You can also add your own Custom Dimensions, for example:
Integrity: The accuracy and consistency of data over its lifecycle.
Checks are performed at the column level, enabling an enhanced profile metrics for a richer business user experience.
Checks are run for the 'Frequency' on a number of Atrributes. For example, the frequency of a regular expression pattern found in a birthdate - the format and enumeration of the data set.
Again statistical checks are being run to determine the 'Accuracy' of the data set. How suitable is the dataset for the tasks I want to perform ..
Data engineers play a crucial role in implementing data quality checks and metrics to assess the integrity of incoming and processed data. To proactively manage data quality, engineers set up alerts that trigger notifications when predefined thresholds are breached, such as unexpected data volumes, schema changes, or processing delays.
Additionally, they create and maintain dashboards that provide real-time visibility into key data quality indicators, enabling stakeholders to quickly identify and address issues, ensuring that downstream analytics and decision-making processes rely on trustworthy data.
x
x