Data Architect Guiding Principles

Data Architect Guiding Principles

What I rely on when evaluating Data Architectures and Solutions

  1. Minimize Data Duplication
  2. Minimize Data Movement
  3. Maximize Automation
I arrived at these three guiding principles over many years of developing and supporting data architectures in a variety of industries and platforms.  Experience taught me that for most situations simpler is better and less is more.
I didn’t know it at the time, but I had had stumbled upon Occam’s Razor principle.  Occam’s Razor is a principle in science and philosophy stating that, among competing hypotheses, the one that makes the fewest number of assumptions should be selected.  In Data Architecture, the Occam’s Razor principle can be applied to guide the design of data models, data storage technologies, and data pipelines.
When designing data models, Occam’s Razor suggests that one should choose the simplest possible data model that can represent the data, as this will make it easier to understand, maintain, and improve over time.  For example, a simple data model with a few well-defined tables and relationships is generally easier to understand and maintain than a complex data model with many tables and relationships.
Similarly, when selecting data storage technologies, Occam’s Razor suggests that one should choose the simplest and most straightforward technology that can meet the needs of the data architecture.  For example, a simple file system may be sufficient for storing small amounts of data, while a more complex distributed file system may be required for storing large amounts of data.
In terms of data pipeline, Occam’s Razor suggests that one should choose the simplest and most straightforward approach to moving and transforming data, as this will make it easier to understand, maintain, and improve over time.

Minimize Duplication

  1. Each copy can become a data silo
  2. Each copy consumes resources – Storage, Compute, and Memory
  3. Each copy must be kept in sync with the data source / system of record
    • Schema Changes and Key Changes must be accounted for in all copies
  4. Each copy must be secured, monitored, and governed
    • Multiple security models are difficult to maintain and keep in sync with security providers
      • Access Request get more complex and less timely
      • Access Removal (e.g., term requests) get more complex and less timely
      • Troubleshooting access becomes more complex and time consuming
    • Multiple access and usage logs are difficult to aggregate to answer simple questions like who has access to my data or who accessed this data.
  5. Opportunities for Data Loss and Exfiltration Increase exponentially as the number of data copies increase.
  6. Each copy of the data increases the complexity of data management and security

Minimize Data Movement

  1. Data Inconsistencies – Every time data is moved it has the potential to be corrupted, dropped, or miss-aligned
    • Reconciliation must be performed to ensure all records made it from source to target and all data returns the same results regardless of which copy is used.
  2. Increased Latency – Data takes time to move from one place to another
  3. Data has Gravity – Moving data consumes resources
    • Server compute and memory resources are consumed during the move for both Source and Target servers as well as during reconciliation
    • Network resources are consumed during the move from Source to Target and during reconciliation
    • Someone must have created and be responsible for maintaining the move between Source and Target
  4. Increased Data Management Complexity – Each move increases data management complexity

Maximize Automation

  1. Human Intervention is Inconsistent and Error Prone
    • Processes are not always started at the scheduled time as vacations, illness, job changes, and person availability impact the ability to kick off the process
    • Humans often miss key values or make data entry mistakes when choosing date ranges, account ranges, well identifiers, etc.
    • Humans may choose the wrong source or target by mistake
  2. Human Intervention is expensive – An automated process can execute and run an automated process in a fraction of time a human can
  3. Automation is consistent, reliable, cost effective
  4. Dependencies are easier to build within an automated process than they are to explain, maintain, and perform consistently for a human operator
  5. Automation avoids problems caused by humans who forgot, or were too busy to handle a manually performed repetitive task

 

Scroll to Top