Companion Notes

Data Infrastructure Audit Results — Source Tracing

This companion document traces every detail in the Data Infrastructure Audit Results to either a specific location on the Cloudcore website or flags it as an assumption invented for the brief.

Part 1: Facts Sourced from the Cloudcore Website

Data Source Identification

All seven data sources are real systems confirmed in the repo:

Data Source System Confirmed In
Infrastructure telemetry Prometheus + Grafana chatbots/_backstories/mark_gonzalez_cto.md
Security event logs Splunk SIEM Same file; also docs/logs/
Support tickets Internal ticketing system chatbots/_backstories/customer_support_lead_samantha_wong.md
Customer records HubSpot CRM chatbots/_backstories/tom_bradley_marketing_manager.md
Billing and financial Internal billing system chatbots/_backstories/aisha_rahman_cfo.md
Sales performance HubSpot + spreadsheets tom_bradley_marketing_manager.md, data/cloudcore-sales-data.csv
HR and access control Auth0 + Active Directory docs/policies/access_control.qmd, chatbots/_backstories/karen_lee_hr_manager.md

Data Owners

All data owner assignments match the organisational structure in backstory files:

Owner Role Source
Martin Nguyen Cloud Service Operations Manager chatbots/_backstories/cloud_service_operations_manager_martin_nugyen.md
Sophia Martines CISO chatbots/_backstories/sophia_martines_ciso.md
Samantha Wong Customer Support Lead chatbots/_backstories/customer_support_lead_samantha_wong.md
Lisa Chen CMO chatbots/_backstories/lisa_chen_cmo.md
Aisha Rahman CFO chatbots/_backstories/aisha_rahman_cfo.md
Karen Lee HR Manager chatbots/_backstories/karen_lee_hr_manager.md
Raj Patel IT Manager chatbots/_backstories/raj_patel_it_manager.md

Data Quality — Sourced Issues

Issue Source
Data siloed across systems, no unified platform chatbots/_backstories/jamal_al_sayed_data_analyst.md, mark_gonzalez_cto.md
CTO rates data readiness 2/5 mark_gonzalez_cto.md
Data “acceptable” for operations but not assessed for ML Same file
No data warehouse or data lake jamal_al_sayed_data_analyst.md
Basic BI tools only (Power BI, Excel) Same file
No formal data governance Same file
Data definitions vary across systems Same file
3-4 years historical data (completeness varies) Same file
Data team: 2 people (stretched thin) Same file
6-12 months data prep needed before ML Same file; also mark_gonzalez_cto.md
CRM has duplicate records, inconsistent formatting, missing fields jamal_al_sayed_data_analyst.md (general); CRM migration issues implied
~40% employees have broader access than required karen_lee_hr_manager.md
RBAC definitions incomplete Same file
Account termination: policy says 24hrs, HR says 2hrs docs/policies/access_control.qmd vs HR backstory
Auth0 migration left policies referencing Okta docs/policies/access_control.qmd
500-800 daily security alerts docs/policies/ (incident response notes)
Products: CloudSync, DataVault, SecureLink, Analytics Pro data/cloudcore-sales-data.csv
Sales regions: North, South, East, West, Central, Metro Same file
Customer segments vary (Small, Medium, Large vs Enterprise, SME) data/cloudcore-customer-data.csv vs data/cloudcore-sales-data.csv
No AI governance framework mark_gonzalez_cto.md, sophia_martines_ciso.md
Data classification policy still in DRAFT docs/policies/data_classification.qmd (POL-DATA-001 v1.2 DRAFT)

Compliance Posture

Framework Status Source
ISO 27001 certified Confirmed sophia_martines_ciso.md, cloudcore_company_overview.md
SOC 2 Type II Confirmed Same files
Australian Privacy Act compliant Confirmed cloudcore_company_overview.md
NDB scheme compliant Confirmed chatbots/_backstories/security_compliance_officer_samuel_torres.md
GDPR compliant (EU data) Confirmed emily_chen_head_of_compliance.md
HIPAA in progress (partial) Confirmed Same file
No AI-specific data impact assessment Confirmed mark_gonzalez_cto.md, sophia_martines_ciso.md, emily_chen_head_of_compliance.md

Compliance Framework Details

Detail Source
ISO 27001 controls A.12.1.2 and A.14.2.2 docs/policies/change_management.qmd
GDPR 72-hour breach notification docs/policies/breach_notification.qmd
GDPR penalties up to 4% revenue or EUR 20M docs/articles/ (risk analysis article)
Privacy Act fines up to $2.2M per violation chatbots/_backstories/data_breach_overview.md
HIPAA penalties $100-$50K per violation docs/policies/breach_notification.qmd
7-year audit trail retention docs/policies/data_management.qmd

Integration Architecture — Confirmed Gaps

Gap Source
No data warehouse jamal_al_sayed_data_analyst.md, mark_gonzalez_cto.md
No unified analytics platform Same files
No real-time analytics pipeline Same files
No ML infrastructure mark_gonzalez_cto.md
No GPU instances david_wilson_cloud_infrastructure_architect.md

Breach Incident Data

The reference to the September 2024 breach as a documented dataset for model training is supported by extensive log files in docs/logs/ including VPN, database, firewall, EDR, SIEM, and application server entries with full timestamps.

Cross-References

All website URLs reference real pages on the Cloudcore site, including the risk assessment frameworks document at docs/support/risk_assessment_frameworks.md.


Part 2: Assumptions and Invented Details

All Data Volume Figures

No data volume figures exist anywhere in the repo. Every volume number was invented:

Data Source Invented Volume Reasoning
Infrastructure telemetry ~2.1M data points/day Plausible for Prometheus monitoring ~2,500 VMs with standard exporter intervals
Security logs ~12 GB/day Plausible for Splunk ingestion across multiple log sources at this scale
Support tickets ~45,000 historical; ~1,200/month The sample CSV has 100 tickets; 500+ clients generating ~2-3 tickets each per month is plausible
CRM contacts ~85,000 records 500+ active clients plus historical contacts, prospects, and marketing list
CRM companies ~4,200 records Includes prospects, former clients, and partners
Billing invoices ~6,000/year 500+ clients, monthly billing cycles

All Quality Scores (1-5 Scale)

The quality scores were invented to follow the design principle that infrastructure/operational data should be high quality and customer-facing data should be messy:

Data Source Score Design Reasoning
Infrastructure telemetry 4.5 Machine-generated, automated, minimal human intervention
Security logs 4.0 Machine-generated but alert classification has human elements
Billing/financial 3.5 Audit requirements enforce some discipline
Support tickets 3.0 Semi-structured; human-entered data with quality variance
HR/access 3.0 HR data accurate; access control data has known drift
Sales performance 2.5 Dual systems, no single source of truth
CRM 2.0 Migration-damaged, never cleaned, poorly adopted

All Specific Quality Percentages

Every percentage figure describing data quality issues was invented. None appear in the repo:

Infrastructure telemetry:

  • 98.5% data completeness — invented
  • 1.5% gaps during maintenance — invented
  • 8% of metrics lacking standardised client attribution labels — invented
  • Inconsistent naming (cpu_usage_percent vs node_cpu_utilisation) — invented example

Security logs:

  • 72% alert classification accuracy — invented
  • 6:1 false positive to true positive ratio — invented
  • Non-standard timestamp formats for pre-2021 sources — invented

Support tickets:

  • 18% miscategorised priority levels — invented
  • 12% descriptions under 20 words — invented
  • 22% use account numbers, 31% use email addresses for customer ID — invented
  • Historical tickets pre-mid-2022 lack satisfaction scores — invented

CRM:

  • 15% confirmed duplicate contacts — invented
  • 8% probable duplicates — invented
  • 26% incomplete industry classification — invented
  • 34% missing job title/role — invented
  • 11.2% email bounce rate — invented
  • 41% missing lead source attribution (pre-migration) — invented

Billing:

  • 6% monthly billing cycle discrepancies — invented
  • Product codes changed twice in three years — invented (products are real)
  • Contracts stored as PDFs — invented

Sales:

  • 35% closed-lost opportunities missing loss reason — invented
  • Regional reporting format inconsistencies — invented

HR/access:

  • ~40% over-provisioned — SOURCED (Karen Lee backstory)
  • Account termination timing inconsistency — SOURCED (policy conflict)
  • Auth0/Okta gap — SOURCED

Data Value Pyramid Assessment

The entire data value pyramid section was composed for the brief. The repo does not contain an analytics maturity assessment. The placement decisions:

  • Descriptive: “partially achieved” — based on Power BI and Excel usage confirmed in Jamal’s backstory
  • Diagnostic: “minimal” — based on manual root cause analysis described in operations backstories
  • Predictive: “not attempted” — confirmed by CTO (“no predictive models exist” equivalent in backstory)
  • Prescriptive: “not attempted” — no evidence of automated decision support anywhere in repo

Integration Architecture Details

Sourced elements:

  • Point-to-point integrations exist (Prometheus to Grafana, CrowdStrike to Splunk, etc.)
  • No integration middleware confirmed
  • Python scripts for data movement — inferred from dev team tech stack

Invented elements:

  • Characterisation as “point-to-point with manual bridges”
  • “Scheduled Python scripts for billing data aggregation” — invented (Python is confirmed as the dev stack; its use for ETL is assumed)
  • “Splunk built-in log collection and normalisation” as the only formal ETL — framing invented
  • “Prometheus federation for metrics consolidation” — standard Prometheus capability, assumed in use
  • “Loss of Jamal or junior analyst would create immediate knowledge gaps” — invented but strongly implied by backstory

Missing capabilities table:

  • All items listed as missing are confirmed absent in backstory files
  • The descriptions of impact (e.g., “every analysis requires manual data assembly”) are framing, not direct quotes

Compliance — Healthcare and Finance Contract Requirements

Entirely invented. The repo confirms that healthcare and finance are key client sectors with strict compliance requirements, but no specific contract terms appear anywhere. The invented terms are plausible for Australian healthcare and finance cloud contracts:

  • Healthcare: data residency, audit logging, client notification for new systems, annual security assessments, 24-hour breach notification
  • Finance: data classification documentation, third-party access approval, pen test sharing, data retention schedules, real-time access monitoring

Australian Privacy Act — AI-Specific Obligations

The Privacy Act and APPs are referenced in compliance backstories. The specific mapping of APP 1, 3, 6, and 11 to AI use cases was composed for the brief. These are accurate representations of the APPs but their application to Cloudcore’s AI plans is analysis, not sourced fact.

AI-Specific Data Impact Assessment Gap

The gap itself is sourced (multiple backstories confirm no AI governance framework). The specific list of AI concerns that the gap leaves unaddressed (training data consent, model bias, explainability, re-identification risk, etc.) was composed for the brief.

All Infrastructure Cost Benchmarks

Every cost figure in the benchmarks section was invented based on typical Australian market rates in 2024-2025:

Component Invented Range Basis
Cloud data warehouse $36-72K/yr Typical Snowflake/BigQuery pricing at moderate scale
ETL platform $18-36K/yr Fivetran/dbt Cloud mid-tier pricing
Data catalogue $12-24K/yr Commercial tooling; open-source alternative noted
MDM $24-48K/yr Implementation-heavy; conservative estimate
ML platform $36-96K/yr SageMaker/Azure ML compute costs are highly variable
ML engineer salary $180-250K/yr SOURCED from Karen Lee backstory
AI contractor $2-3.5K/day Typical Australian specialist consulting rates
MLOps tooling $6-18K/yr Commercial options; MLflow as free alternative noted
API gateway $6-24K/yr AWS API Gateway available through existing partnership
Event streaming $18-48K/yr Managed Kafka pricing at modest scale
Integration middleware $24-60K/yr Significant implementation cost acknowledged

Sensitivity Classifications

The data classification levels (PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED) are sourced from the draft data classification policy (POL-DATA-001). The assignment of specific data sources to these levels was invented, though informed by the policy’s descriptions of each tier.

Retention Periods

Detail Status
Security logs: 12 months online, 7 years archived 7-year audit trail retention is sourced; 12-month online is invented
Infrastructure telemetry: 90 days full, 1-year downsampled Entirely invented
Support tickets: indefinite, no formal policy Invented; plausible given no data retention schedule is documented for tickets
CRM: indefinite, no hygiene schedule Invented
Billing: 7 years Sourced (financial compliance requirement in data management policy)
HR: 7 years post-departure, access logs 12 months 7-year retention sourced; 12-month access log period invented

This companion document is for instructor reference. It is not intended for student distribution unless adapted.