← Tilbage til koncepter

Data Warehousing

Arkitektur

Central repository designet til at konsolidere og analysere data fra multiple kilder for business intelligence.

Beskrivelse

Et Data Warehouse er et stort, centraliseret data repository optimeret til analytical queries og reporting i stedet for transactional operations. I modsætning til operationelle databaser (OLTP) der håndterer daglige transactions, er data warehouses designet til OLAP (Online Analytical Processing) med komplekse queries på tværs af store datasets. Data warehouses bruger typisk dimensional modeling med fact tables (målbare events som sales) og dimension tables (beskrivende attributter som customer, product, time). Dette star schema eller snowflake schema design gør analytical queries hurtige og intuitive. Data kommer ind via ETL pipelines fra multiple sources og organiseres historisk - data slettes sjældent, så man kan analysere trends over tid. Moderne cloud data warehouses som Snowflake, BigQuery og Redshift tilbyder massive skalering og pay-per-use modeller. Data warehouses er fundamentet for business intelligence, reporting dashboards, og data-driven decision making.

Problem

Organisationer har operationelle databaser optimeret til transactions - hurtige writes, normalized schema, current state. Men analytical queries (rapporter, aggregations, trends) er langsomme og belaster production systemer. Hvordan analyserer man data effektivt uden at påvirke operations?

Løsning

Et separat Data Warehouse optimeret til queries. Data kopieres fra OLTP systemer via ETL, denormaliseres til analytical schema, og gemmes historisk. Analysts kan lave komplekse queries uden at påvirke production databases.

Eksempel

-- OLTP Schema (Operationel Database)
-- Normalized for data integrity
CREATE TABLE customers (
  customer_id INT PRIMARY KEY,
  name VARCHAR(100),
  email VARCHAR(100)
);

CREATE TABLE products (
  product_id INT PRIMARY KEY,
  name VARCHAR(100),
  category_id INT
);

CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_id INT,
  order_date DATE
);

CREATE TABLE order_items (
  order_id INT,
  product_id INT,
  quantity INT,
  price DECIMAL(10,2)
);
-- Problem: Complex joins for analytics

-----------------------------------------------------

-- OLAP Schema (Data Warehouse)
-- Denormalized Star Schema for fast queries

-- Fact Table (målbare events)
CREATE TABLE fact_sales (
  sale_id BIGINT PRIMARY KEY,
  date_key INT,           -- Foreign key til dim_date
  customer_key INT,       -- Foreign key til dim_customer
  product_key INT,        -- Foreign key til dim_product
  store_key INT,          -- Foreign key til dim_store
  quantity INT,
  unit_price DECIMAL(10,2),
  total_amount DECIMAL(10,2),
  discount_amount DECIMAL(10,2),
  tax_amount DECIMAL(10,2)
);

-- Dimension Tables (beskrivende context)
CREATE TABLE dim_date (
  date_key INT PRIMARY KEY,
  full_date DATE,
  day_of_week VARCHAR(10),
  day_of_month INT,
  month_name VARCHAR(10),
  month_number INT,
  quarter INT,
  year INT,
  is_weekend BOOLEAN,
  is_holiday BOOLEAN
);

CREATE TABLE dim_customer (
  customer_key INT PRIMARY KEY,
  customer_id INT,          -- Original ID fra OLTP
  customer_name VARCHAR(100),
  email VARCHAR(100),
  city VARCHAR(50),
  country VARCHAR(50),
  age_group VARCHAR(20),
  customer_segment VARCHAR(30)
);

CREATE TABLE dim_product (
  product_key INT PRIMARY KEY,
  product_id INT,
  product_name VARCHAR(100),
  category VARCHAR(50),
  subcategory VARCHAR(50),
  brand VARCHAR(50),
  price_range VARCHAR(20)
);

-----------------------------------------------------

-- Analytical Queries er nu simple og hurtige

-- Sales by month and category
SELECT 
  d.year,
  d.month_name,
  p.category,
  SUM(f.total_amount) as total_sales,
  COUNT(DISTINCT f.customer_key) as unique_customers
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
WHERE d.year = 2024
GROUP BY d.year, d.month_name, p.category
ORDER BY d.month_number, total_sales DESC;

-- Customer segmentation analysis
SELECT 
  c.customer_segment,
  c.age_group,
  COUNT(DISTINCT c.customer_key) as customers,
  AVG(f.total_amount) as avg_order_value,
  SUM(f.total_amount) as total_revenue
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2024 AND d.quarter = 1
GROUP BY c.customer_segment, c.age_group
ORDER BY total_revenue DESC;

-- Year-over-year comparison
SELECT 
  d.month_name,
  SUM(CASE WHEN d.year = 2024 THEN f.total_amount ELSE 0 END) as sales_2024,
  SUM(CASE WHEN d.year = 2023 THEN f.total_amount ELSE 0 END) as sales_2023,
  (SUM(CASE WHEN d.year = 2024 THEN f.total_amount ELSE 0 END) - 
   SUM(CASE WHEN d.year = 2023 THEN f.total_amount ELSE 0 END)) / 
  NULLIF(SUM(CASE WHEN d.year = 2023 THEN f.total_amount ELSE 0 END), 0) * 100 as growth_pct
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year IN (2023, 2024)
GROUP BY d.month_name, d.month_number
ORDER BY d.month_number;

Fordele

  • Optimeret til komplekse analytical queries
  • Historisk data tracking (ikke bare current state)
  • Ingen impact på production OLTP systems
  • Konsolideret data fra multiple sources
  • Foundation for BI og data science

Udfordringer

  • Høje setup og maintenance costs
  • ETL complexity ved mange sources
  • Data latency (ikke real-time)
  • Storage costs for historisk data
  • Schema changes er dyre

Anvendelsesområder

  • Executive dashboards og KPI tracking
  • Sales og revenue analytics
  • Customer behavior analysis
  • Financial reporting og forecasting
  • Supply chain optimization

Eksempler fra den virkelige verden

  • Retail: Analyse af sales trends, inventory, customer segments
  • Banking: Risk analysis, fraud detection, customer profitability
  • Healthcare: Patient outcomes, resource utilization, cost analysis
  • E-commerce: Conversion funnels, product performance, marketing ROI
  • Telecom: Customer churn prediction, network performance analysis