Companion Notes
Data Infrastructure Audit Results — Source Tracing
This companion document traces every detail in the Data Infrastructure Audit Results to either a specific location on the Cloudcore website or flags it as an assumption invented for the brief.
Part 1: Facts Sourced from the Cloudcore Website
Data Source Identification
All seven data sources are real systems confirmed in the repo:
| Data Source | System | Confirmed In |
|---|---|---|
| Infrastructure telemetry | Prometheus + Grafana | chatbots/_backstories/mark_gonzalez_cto.md |
| Security event logs | Splunk SIEM | Same file; also docs/logs/ |
| Support tickets | Internal ticketing system | chatbots/_backstories/customer_support_lead_samantha_wong.md |
| Customer records | HubSpot CRM | chatbots/_backstories/tom_bradley_marketing_manager.md |
| Billing and financial | Internal billing system | chatbots/_backstories/aisha_rahman_cfo.md |
| Sales performance | HubSpot + spreadsheets | tom_bradley_marketing_manager.md, data/cloudcore-sales-data.csv |
| HR and access control | Auth0 + Active Directory | docs/policies/access_control.qmd, chatbots/_backstories/karen_lee_hr_manager.md |
Data Owners
All data owner assignments match the organisational structure in backstory files:
| Owner | Role | Source |
|---|---|---|
| Martin Nguyen | Cloud Service Operations Manager | chatbots/_backstories/cloud_service_operations_manager_martin_nugyen.md |
| Sophia Martines | CISO | chatbots/_backstories/sophia_martines_ciso.md |
| Samantha Wong | Customer Support Lead | chatbots/_backstories/customer_support_lead_samantha_wong.md |
| Lisa Chen | CMO | chatbots/_backstories/lisa_chen_cmo.md |
| Aisha Rahman | CFO | chatbots/_backstories/aisha_rahman_cfo.md |
| Karen Lee | HR Manager | chatbots/_backstories/karen_lee_hr_manager.md |
| Raj Patel | IT Manager | chatbots/_backstories/raj_patel_it_manager.md |
Data Quality — Sourced Issues
| Issue | Source |
|---|---|
| Data siloed across systems, no unified platform | chatbots/_backstories/jamal_al_sayed_data_analyst.md, mark_gonzalez_cto.md |
| CTO rates data readiness 2/5 | mark_gonzalez_cto.md |
| Data “acceptable” for operations but not assessed for ML | Same file |
| No data warehouse or data lake | jamal_al_sayed_data_analyst.md |
| Basic BI tools only (Power BI, Excel) | Same file |
| No formal data governance | Same file |
| Data definitions vary across systems | Same file |
| 3-4 years historical data (completeness varies) | Same file |
| Data team: 2 people (stretched thin) | Same file |
| 6-12 months data prep needed before ML | Same file; also mark_gonzalez_cto.md |
| CRM has duplicate records, inconsistent formatting, missing fields | jamal_al_sayed_data_analyst.md (general); CRM migration issues implied |
| ~40% employees have broader access than required | karen_lee_hr_manager.md |
| RBAC definitions incomplete | Same file |
| Account termination: policy says 24hrs, HR says 2hrs | docs/policies/access_control.qmd vs HR backstory |
| Auth0 migration left policies referencing Okta | docs/policies/access_control.qmd |
| 500-800 daily security alerts | docs/policies/ (incident response notes) |
| Products: CloudSync, DataVault, SecureLink, Analytics Pro | data/cloudcore-sales-data.csv |
| Sales regions: North, South, East, West, Central, Metro | Same file |
| Customer segments vary (Small, Medium, Large vs Enterprise, SME) | data/cloudcore-customer-data.csv vs data/cloudcore-sales-data.csv |
| No AI governance framework | mark_gonzalez_cto.md, sophia_martines_ciso.md |
| Data classification policy still in DRAFT | docs/policies/data_classification.qmd (POL-DATA-001 v1.2 DRAFT) |
Compliance Posture
| Framework | Status | Source |
|---|---|---|
| ISO 27001 certified | Confirmed | sophia_martines_ciso.md, cloudcore_company_overview.md |
| SOC 2 Type II | Confirmed | Same files |
| Australian Privacy Act compliant | Confirmed | cloudcore_company_overview.md |
| NDB scheme compliant | Confirmed | chatbots/_backstories/security_compliance_officer_samuel_torres.md |
| GDPR compliant (EU data) | Confirmed | emily_chen_head_of_compliance.md |
| HIPAA in progress (partial) | Confirmed | Same file |
| No AI-specific data impact assessment | Confirmed | mark_gonzalez_cto.md, sophia_martines_ciso.md, emily_chen_head_of_compliance.md |
Compliance Framework Details
| Detail | Source |
|---|---|
| ISO 27001 controls A.12.1.2 and A.14.2.2 | docs/policies/change_management.qmd |
| GDPR 72-hour breach notification | docs/policies/breach_notification.qmd |
| GDPR penalties up to 4% revenue or EUR 20M | docs/articles/ (risk analysis article) |
| Privacy Act fines up to $2.2M per violation | chatbots/_backstories/data_breach_overview.md |
| HIPAA penalties $100-$50K per violation | docs/policies/breach_notification.qmd |
| 7-year audit trail retention | docs/policies/data_management.qmd |
Integration Architecture — Confirmed Gaps
| Gap | Source |
|---|---|
| No data warehouse | jamal_al_sayed_data_analyst.md, mark_gonzalez_cto.md |
| No unified analytics platform | Same files |
| No real-time analytics pipeline | Same files |
| No ML infrastructure | mark_gonzalez_cto.md |
| No GPU instances | david_wilson_cloud_infrastructure_architect.md |
Breach Incident Data
The reference to the September 2024 breach as a documented dataset for model training is supported by extensive log files in docs/logs/ including VPN, database, firewall, EDR, SIEM, and application server entries with full timestamps.
Cross-References
All website URLs reference real pages on the Cloudcore site, including the risk assessment frameworks document at docs/support/risk_assessment_frameworks.md.
Part 2: Assumptions and Invented Details
All Data Volume Figures
No data volume figures exist anywhere in the repo. Every volume number was invented:
| Data Source | Invented Volume | Reasoning |
|---|---|---|
| Infrastructure telemetry | ~2.1M data points/day | Plausible for Prometheus monitoring ~2,500 VMs with standard exporter intervals |
| Security logs | ~12 GB/day | Plausible for Splunk ingestion across multiple log sources at this scale |
| Support tickets | ~45,000 historical; ~1,200/month | The sample CSV has 100 tickets; 500+ clients generating ~2-3 tickets each per month is plausible |
| CRM contacts | ~85,000 records | 500+ active clients plus historical contacts, prospects, and marketing list |
| CRM companies | ~4,200 records | Includes prospects, former clients, and partners |
| Billing invoices | ~6,000/year | 500+ clients, monthly billing cycles |
All Quality Scores (1-5 Scale)
The quality scores were invented to follow the design principle that infrastructure/operational data should be high quality and customer-facing data should be messy:
| Data Source | Score | Design Reasoning |
|---|---|---|
| Infrastructure telemetry | 4.5 | Machine-generated, automated, minimal human intervention |
| Security logs | 4.0 | Machine-generated but alert classification has human elements |
| Billing/financial | 3.5 | Audit requirements enforce some discipline |
| Support tickets | 3.0 | Semi-structured; human-entered data with quality variance |
| HR/access | 3.0 | HR data accurate; access control data has known drift |
| Sales performance | 2.5 | Dual systems, no single source of truth |
| CRM | 2.0 | Migration-damaged, never cleaned, poorly adopted |
All Specific Quality Percentages
Every percentage figure describing data quality issues was invented. None appear in the repo:
Infrastructure telemetry:
- 98.5% data completeness — invented
- 1.5% gaps during maintenance — invented
- 8% of metrics lacking standardised client attribution labels — invented
- Inconsistent naming (cpu_usage_percent vs node_cpu_utilisation) — invented example
Security logs:
- 72% alert classification accuracy — invented
- 6:1 false positive to true positive ratio — invented
- Non-standard timestamp formats for pre-2021 sources — invented
Support tickets:
- 18% miscategorised priority levels — invented
- 12% descriptions under 20 words — invented
- 22% use account numbers, 31% use email addresses for customer ID — invented
- Historical tickets pre-mid-2022 lack satisfaction scores — invented
CRM:
- 15% confirmed duplicate contacts — invented
- 8% probable duplicates — invented
- 26% incomplete industry classification — invented
- 34% missing job title/role — invented
- 11.2% email bounce rate — invented
- 41% missing lead source attribution (pre-migration) — invented
Billing:
- 6% monthly billing cycle discrepancies — invented
- Product codes changed twice in three years — invented (products are real)
- Contracts stored as PDFs — invented
Sales:
- 35% closed-lost opportunities missing loss reason — invented
- Regional reporting format inconsistencies — invented
HR/access:
- ~40% over-provisioned — SOURCED (Karen Lee backstory)
- Account termination timing inconsistency — SOURCED (policy conflict)
- Auth0/Okta gap — SOURCED
Data Value Pyramid Assessment
The entire data value pyramid section was composed for the brief. The repo does not contain an analytics maturity assessment. The placement decisions:
- Descriptive: “partially achieved” — based on Power BI and Excel usage confirmed in Jamal’s backstory
- Diagnostic: “minimal” — based on manual root cause analysis described in operations backstories
- Predictive: “not attempted” — confirmed by CTO (“no predictive models exist” equivalent in backstory)
- Prescriptive: “not attempted” — no evidence of automated decision support anywhere in repo
Integration Architecture Details
Sourced elements:
- Point-to-point integrations exist (Prometheus to Grafana, CrowdStrike to Splunk, etc.)
- No integration middleware confirmed
- Python scripts for data movement — inferred from dev team tech stack
Invented elements:
- Characterisation as “point-to-point with manual bridges”
- “Scheduled Python scripts for billing data aggregation” — invented (Python is confirmed as the dev stack; its use for ETL is assumed)
- “Splunk built-in log collection and normalisation” as the only formal ETL — framing invented
- “Prometheus federation for metrics consolidation” — standard Prometheus capability, assumed in use
- “Loss of Jamal or junior analyst would create immediate knowledge gaps” — invented but strongly implied by backstory
Missing capabilities table:
- All items listed as missing are confirmed absent in backstory files
- The descriptions of impact (e.g., “every analysis requires manual data assembly”) are framing, not direct quotes
Compliance — Healthcare and Finance Contract Requirements
Entirely invented. The repo confirms that healthcare and finance are key client sectors with strict compliance requirements, but no specific contract terms appear anywhere. The invented terms are plausible for Australian healthcare and finance cloud contracts:
- Healthcare: data residency, audit logging, client notification for new systems, annual security assessments, 24-hour breach notification
- Finance: data classification documentation, third-party access approval, pen test sharing, data retention schedules, real-time access monitoring
Australian Privacy Act — AI-Specific Obligations
The Privacy Act and APPs are referenced in compliance backstories. The specific mapping of APP 1, 3, 6, and 11 to AI use cases was composed for the brief. These are accurate representations of the APPs but their application to Cloudcore’s AI plans is analysis, not sourced fact.
AI-Specific Data Impact Assessment Gap
The gap itself is sourced (multiple backstories confirm no AI governance framework). The specific list of AI concerns that the gap leaves unaddressed (training data consent, model bias, explainability, re-identification risk, etc.) was composed for the brief.
All Infrastructure Cost Benchmarks
Every cost figure in the benchmarks section was invented based on typical Australian market rates in 2024-2025:
| Component | Invented Range | Basis |
|---|---|---|
| Cloud data warehouse | $36-72K/yr | Typical Snowflake/BigQuery pricing at moderate scale |
| ETL platform | $18-36K/yr | Fivetran/dbt Cloud mid-tier pricing |
| Data catalogue | $12-24K/yr | Commercial tooling; open-source alternative noted |
| MDM | $24-48K/yr | Implementation-heavy; conservative estimate |
| ML platform | $36-96K/yr | SageMaker/Azure ML compute costs are highly variable |
| ML engineer salary | $180-250K/yr | SOURCED from Karen Lee backstory |
| AI contractor | $2-3.5K/day | Typical Australian specialist consulting rates |
| MLOps tooling | $6-18K/yr | Commercial options; MLflow as free alternative noted |
| API gateway | $6-24K/yr | AWS API Gateway available through existing partnership |
| Event streaming | $18-48K/yr | Managed Kafka pricing at modest scale |
| Integration middleware | $24-60K/yr | Significant implementation cost acknowledged |
Sensitivity Classifications
The data classification levels (PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED) are sourced from the draft data classification policy (POL-DATA-001). The assignment of specific data sources to these levels was invented, though informed by the policy’s descriptions of each tier.
Retention Periods
| Detail | Status |
|---|---|
| Security logs: 12 months online, 7 years archived | 7-year audit trail retention is sourced; 12-month online is invented |
| Infrastructure telemetry: 90 days full, 1-year downsampled | Entirely invented |
| Support tickets: indefinite, no formal policy | Invented; plausible given no data retention schedule is documented for tickets |
| CRM: indefinite, no hygiene schedule | Invented |
| Billing: 7 years | Sourced (financial compliance requirement in data management policy) |
| HR: 7 years post-departure, access logs 12 months | 7-year retention sourced; 12-month access log period invented |
This companion document is for instructor reference. It is not intended for student distribution unless adapted.