Cloudcore Networks

Data Infrastructure Audit Results

Purpose of This Document

This document presents the results of an internal audit of Cloudcore’s data infrastructure, conducted to assess readiness for AI and advanced analytics initiatives. The audit examined data quality, integration architecture, compliance posture, and infrastructure cost context. Findings are intended to support infrastructure planning decisions.

Data Inventory

The audit identified seven major data sources across Cloudcore’s environment. Each was assessed for volume, quality, ownership, sensitivity, and specific issues.

1. Infrastructure Telemetry

Dimension	Assessment
System	Prometheus and Grafana
Volume	~2.1 million metric data points per day across ~2,500 VMs and ~200 physical/virtual servers
Quality score	4.5 / 5
Data owner	Martin Nguyen, Cloud Service Operations Manager
Sensitivity	INTERNAL
Retention	90 days at full resolution; downsampled to 1-year rolling archive

Quality justification: Infrastructure telemetry is machine-generated, consistently structured, and timestamped. Collection is automated via Prometheus exporters with minimal human intervention. Data completeness is estimated at 98.5%; the remaining 1.5% represents brief gaps during maintenance windows or agent restarts. This is Cloudcore’s cleanest and most AI-ready data source.

Specific issues: Metric naming conventions are inconsistent across older and newer deployments (e.g., cpu_usage_percent vs. node_cpu_utilisation). Approximately 8% of metrics lack standardised labels for client attribution, making per-client analysis difficult without cross-referencing provisioning records.

2. Security Event Logs

Dimension	Assessment
System	Splunk SIEM (aggregating CrowdStrike EDR, Palo Alto firewall, Auth0, VPN, application logs)
Volume	~12 GB per day; 500 to 800 alerts generated daily
Quality score	4.0 / 5
Data owner	Sophia Martines, CISO
Sensitivity	CONFIDENTIAL
Retention	12 months online; 7 years archived (compliance requirement)

Quality justification: Log data is machine-generated and well-structured. Splunk normalises formats across sources. The September 2024 breach provided a fully documented incident with timestamped entries across VPN, database, firewall, EDR, SIEM, and application logs, offering a validated dataset for anomaly detection model training.

Specific issues: Alert classification accuracy is estimated at 72%; the remaining 28% are miscategorised or lack sufficient context for automated triage. Some legacy log sources (pre-2021 systems) use non-standard timestamp formats requiring manual parsing. The ratio of false positives to true positives in daily alerts is approximately 6:1, contributing to alert fatigue.

3. Support Ticket Data

Dimension	Assessment
System	Internal ticketing system
Volume	~45,000 tickets historically (3.5 years); ~1,200 new tickets per month
Quality score	3.0 / 5
Data owner	Samantha Wong, Customer Support Lead (reports to Sarah Thompson, COO)
Sensitivity	INTERNAL (some tickets contain CONFIDENTIAL client details)
Retention	Indefinite; no formal retention policy applied

Quality justification: Ticket data is semi-structured. Category and priority fields are reliably populated (98% complete), and resolution times are tracked. However, free-text descriptions vary significantly in detail and consistency.

Specific issues:

18% of tickets have miscategorised priority levels (identified by comparing resolution urgency against assigned priority)
12% of ticket descriptions contain fewer than 20 words, providing insufficient detail for text-based analysis or automated routing
Customer identifiers are inconsistently formatted: 22% use account numbers, 31% use email addresses, and the remainder use company names with variable spelling
No structured field for root cause; resolution notes are free-text with no controlled vocabulary
Historical tickets prior to mid-2022 lack satisfaction scores (field was added later)

4. Customer Records (CRM)

Dimension	Assessment
System	HubSpot CRM
Volume	~85,000 contact records; ~4,200 company records; ~500+ active client accounts
Quality score	2.0 / 5
Data owner	Lisa Chen, CMO (marketing data); Sales team (pipeline data)
Sensitivity	CONFIDENTIAL
Retention	Indefinite; no data hygiene schedule

Quality justification: The 2022 CRM migration from a legacy contact management system introduced significant data quality problems that have never been systematically addressed.

Specific issues:

15% of contact records are confirmed duplicates (same person, different records created by different teams)
An additional estimated 8% are probable duplicates requiring manual review
26% of company records have incomplete industry classification
34% of contact records are missing job title or role information
Email bounce rate on the full database is 11.2%, indicating substantial stale data
Lead source attribution is missing for 41% of records created before the HubSpot migration
Sales pipeline data is unreliable; many sales staff continue to track opportunities in personal spreadsheets rather than HubSpot
No integration with billing or support systems; customer health must be assessed manually by cross-referencing exports

5. Billing and Financial Data

Dimension	Assessment
System	Internal billing and invoicing system
Volume	~6,000 invoices per year; monthly usage records for 500+ clients
Quality score	3.5 / 5
Data owner	Aisha Rahman, CFO
Sensitivity	RESTRICTED (contains payment information)
Retention	7 years (financial compliance)

Quality justification: Financial data is relatively well-maintained due to audit requirements and regulatory obligations. Invoice records are complete and reconciled monthly.

Specific issues:

Usage metering data requires manual validation against service records; discrepancies found in approximately 6% of monthly billing cycles
Product categorisation has changed twice in the past three years (CloudSync, DataVault, SecureLink, Analytics Pro are current names); historical data uses legacy product codes that are not consistently mapped
Client contract terms are stored as PDF attachments rather than structured data, making automated analysis of contract value, renewal dates, and SLA terms impossible without manual extraction
Revenue attribution by sector relies on manually maintained spreadsheet classifications, not CRM data

6. Sales Performance Data

Dimension	Assessment
System	Combination of HubSpot CRM and regional spreadsheets
Volume	Quarterly records across 6 regions (North, South, East, West, Central, Metro) and 4 product lines
Quality score	2.5 / 5
Data owner	Sales team; no single owner
Sensitivity	INTERNAL
Retention	~2 years structured; earlier data in inconsistent formats

Quality justification: Sales data exists in two parallel systems. Marketing tracks leads and top-of-funnel metrics in HubSpot. Individual sales representatives maintain pipeline and closed-deal data in personal or regional spreadsheets.

Specific issues:

No single source of truth for sales pipeline; HubSpot and spreadsheet figures often conflict
Regional reporting formats are not standardised; Metro and North regions use different column structures
Customer segment definitions (Small, Medium, Large, Enterprise, SME) vary between sales reports and CRM records
Win/loss data is incomplete; approximately 35% of closed-lost opportunities have no recorded reason for loss
Sales representative attribution is clean for 2023 onward but unreliable for earlier periods

7. HR and Access Control Data

Dimension	Assessment
System	Auth0 (identity), Active Directory (on-premise), HR management system
Volume	47 active employee records; ~120 historical records; access permissions across all systems
Quality score	3.0 / 5
Data owner	Karen Lee, HR Manager; Raj Patel, IT Manager (shared)
Sensitivity	RESTRICTED
Retention	Employee records: 7 years post-departure; access logs: 12 months

Quality justification: Employee master data is maintained by HR and is generally accurate. However, access control data has known integrity issues.

Specific issues:

~40% of employees have broader system access than their role requires (identified in quarterly access review)
Role-based access control (RBAC) definitions are incomplete; actual permissions often diverge from documented role templates
Onboarding and offboarding access changes involve manual coordination between HR, IT, and department managers, with no automated workflow
Account termination timing is inconsistent: policy states 24 hours, HR procedure states 2 hours, and actual practice varies
Auth0 migration from Okta (December 2023) left some access policies referencing the old identity provider

Data Quality Summary

Data Source	Quality Score	AI Readiness	Key Barrier
Infrastructure telemetry	4.5 / 5	High	Metric naming inconsistency
Security event logs	4.0 / 5	Medium-High	Alert classification accuracy; false positive ratio
Support tickets	3.0 / 5	Medium	Inconsistent customer identifiers; sparse descriptions
Billing and financial	3.5 / 5	Medium	Contract data unstructured; product code mapping
HR and access control	3.0 / 5	Low	Access permission drift; manual processes
Sales performance	2.5 / 5	Low	Dual systems; no single source of truth
Customer records (CRM)	2.0 / 5	Low	Duplicates, missing fields, no integration

Pattern: Cloudcore’s infrastructure and operational data is relatively clean and well-structured, reflecting mature engineering practices. Customer-facing and commercial data is significantly messier, reflecting the organisational challenges of the CRM migration, rapid growth, and siloed teams. This contrast is the central data quality challenge for any AI initiative targeting customer experience or commercial outcomes.

Data Value Pyramid Assessment

The data value pyramid maps an organisation’s analytics maturity from descriptive (what happened) through diagnostic (why), predictive (what will happen), and prescriptive (what should we do).

Level	Status	Evidence
Descriptive (what happened)	Partially achieved	Power BI dashboards exist for operational metrics. Support metrics (resolution time, satisfaction, ticket volume) are reported weekly. Financial reporting is monthly. However, cross-system views require manual assembly by the data team.
Diagnostic (why it happened)	Minimal	Root cause analysis is performed manually for major incidents. No automated correlation between data sources. Jamal Al-Sayed’s team can investigate specific questions but there is no self-service diagnostic capability.
Predictive (what will happen)	Not attempted	No predictive models exist. Capacity planning uses historical trend extrapolation in spreadsheets. Churn risk is identified reactively (after customers raise concerns), not proactively.
Prescriptive (what should we do)	Not attempted	No automated decision support. Resource allocation, staffing, and investment decisions are based on experience and judgment, not data-driven recommendations.

Assessment: Cloudcore is operating primarily at Level 1 (descriptive) with pockets of Level 2 (diagnostic) for security incidents and major operational issues. Moving to predictive analytics would require solving the data integration challenge first, as no single system currently holds the cross-functional data needed for meaningful prediction.

Integration Architecture Assessment

Current Approach

Cloudcore’s integration architecture is best described as point-to-point with manual bridges. There is no integration middleware, enterprise service bus, or API gateway connecting internal systems.

Integration Type	Examples	Status
Automated point-to-point	Prometheus to Grafana; CrowdStrike to Splunk; GitHub Actions to ArgoCD	Working well within functional silos
Batch file transfer	Service usage data to billing (daily batch); support metrics to Power BI (weekly export)	Functional but error-prone; manual validation required
Manual data transfer	CRM data to financial reporting; support data to customer health assessment; sales data consolidation	Labour-intensive; relies on the 2-person data team
API integration	HubSpot lead capture from website; Auth0 SSO across applications	Limited to a few well-defined use cases

Existing ETL Processes

Cloudcore has no formal ETL platform. Data movement between systems relies on:

Scheduled Python scripts (maintained by the development team) for billing data aggregation
Manual CSV exports from individual systems into Power BI
Splunk’s built-in log collection and normalisation (security data only)
Prometheus federation for infrastructure metrics consolidation

These processes are fragile, undocumented, and maintained by individuals rather than teams. The data team has flagged that the loss of either Jamal Al-Sayed or his junior analyst would create immediate knowledge gaps in how data is extracted and transformed.

What Is Missing

Capability	Current State	Impact
Data warehouse	Does not exist	No single source of truth for cross-functional analytics; every analysis requires manual data assembly
Master data management (MDM)	Does not exist	Customer identifiers, product codes, and segment definitions are inconsistent across systems
Real-time data pipelines	Does not exist	All cross-system data movement is batch or manual; minimum latency is daily
API gateway	Does not exist	No centralised API management, rate limiting, or access control for internal integrations
Data catalogue	Does not exist	No inventory of available datasets, their definitions, or their lineage; tribal knowledge only

Compliance and Data Handling

Current Compliance Posture

Framework	Status	Relevance to AI
ISO 27001	Certified (achieved ~18 months ago)	Requires documented risk assessment for new technology initiatives including AI; controls A.12.1.2 (change management) and A.14.2.2 (system change control) apply
SOC 2 Type II	Compliant (renewed annually)	AI systems processing customer data must meet SOC 2 trust service criteria for security, availability, and confidentiality
Australian Privacy Act (APPs)	Compliant	AI systems using personal information must comply with Australian Privacy Principles; APP 6 (use and disclosure) and APP 11 (security) are most relevant
Notifiable Data Breaches (NDB)	Compliant	Any AI system with access to personal information falls under NDB reporting obligations if compromised
GDPR	Compliant (EU customer data)	AI decisions affecting EU data subjects may trigger Article 22 (automated decision-making) requirements; Data Protection Impact Assessments required
HIPAA	Partially compliant (in progress)	Healthcare client data used for AI training would require Business Associate Agreement coverage and additional safeguards

Healthcare Client Contract Requirements

Cloudcore’s healthcare clients (representing approximately 25% of revenue) operate under contractual terms that include:

All patient-adjacent data must remain within Australian data centres
Data access must be logged and auditable
Any new system processing healthcare data requires prior written notification to the client
Annual security assessments must be provided to the client
Breach notification within 24 hours (stricter than the statutory 72-hour NDB requirement)

AI implication: Any AI system trained on or processing healthcare client data would require individual client notification and potentially contract amendments. Using healthcare data for model training (even anonymised) may require explicit consent depending on contract terms.

Finance Client Contract Requirements

Finance sector clients (representing approximately 20% of revenue) have similarly strict requirements:

Data classification and handling procedures must be documented and provided
Third-party access to client data (including AI vendor platforms) requires prior approval
Regular penetration testing results must be shared
Data retention and deletion must follow agreed schedules
Real-time monitoring of access to financial data is required

AI implication: Sending finance client data to external AI platforms (e.g., cloud-hosted ML services) may breach third-party access clauses unless explicitly approved. On-premise or private-cloud AI deployment may be necessary for finance workloads.

Australian Privacy Act Obligations Relevant to AI

The Australian Privacy Act and Australian Privacy Principles create several obligations relevant to AI deployment:

APP 1 (Open and transparent management): Organisations must have a clearly expressed privacy policy covering how AI uses personal information
APP 3 (Collection): Personal information should only be collected where reasonably necessary; AI training data collection must be justified
APP 6 (Use and disclosure): Personal information collected for one purpose cannot be used for a materially different purpose (e.g., support ticket data collected for service improvement cannot be repurposed for marketing AI without consent)
APP 11 (Security): Organisations must take reasonable steps to protect personal information from misuse, interference, and loss; this extends to AI model security and training data protection
Notifiable Data Breaches scheme: Any eligible data breach involving AI systems must be reported to the OAIC within 72 hours

Gap: No AI-Specific Data Impact Assessment Process

Cloudcore currently has no process for assessing the data protection implications of AI initiatives. The existing Data Protection Impact Assessment (DPIA) process covers new systems and data handling changes but does not address AI-specific concerns including:

Training data sourcing and consent
Model bias and fairness assessment
Automated decision-making transparency
Model output explainability
Training data retention and deletion
Re-identification risk from anonymised datasets
Cross-border data transfer for cloud AI processing

The data classification policy (POL-DATA-001) remains in draft status and has not been formally approved, further complicating the governance foundation for AI data handling.

Infrastructure Cost Benchmarks

The following cost ranges are based on Australian market rates for organisations at Cloudcore’s scale (~500 clients, ~47 employees, two data centres). All figures are annual unless noted.

Data Platform Costs

Component	Estimated Annual Cost (AUD)	Notes
Cloud data warehouse (e.g., Snowflake, BigQuery, Redshift)	$36,000 to $72,000	Based on moderate query volume and ~5 TB storage; scales with usage
ETL/data integration platform (e.g., Fivetran, dbt Cloud)	$18,000 to $36,000	Depends on number of connectors and data volume
Data catalogue and governance tooling	$12,000 to $24,000	Could start with open-source alternatives to reduce cost
Master data management	$24,000 to $48,000	Significant implementation effort beyond licensing

AI/ML Platform Costs

Component	Estimated Annual Cost (AUD)	Notes
Managed ML platform (e.g., SageMaker, Azure ML)	$36,000 to $96,000	Highly variable; depends on compute usage and model training frequency
ML engineer salary	$180,000 to $250,000	Market rate for Perth/Sydney; scarce talent pool
AI/ML contractor or consulting engagement	$2,000 to $3,500 per day	For specialist advisory or implementation support
MLOps tooling (experiment tracking, model registry)	$6,000 to $18,000	Could start with open-source (MLflow) at minimal cost

Integration Infrastructure Costs

Component	Estimated Annual Cost (AUD)	Notes
API gateway (e.g., Kong, AWS API Gateway)	$6,000 to $24,000	AWS API Gateway available through existing partnership
Event streaming platform (e.g., Kafka, managed equivalent)	$18,000 to $48,000	Only needed if real-time pipelines are required
Integration middleware	$24,000 to $60,000	Significant implementation cost beyond licensing

Cost Context

Against the proposed $250,000 AI investment envelope, these benchmarks illustrate the trade-offs:

A data warehouse plus basic ETL tooling would consume $54,000 to $108,000 annually, leaving limited room for AI-specific investment
A single ML engineer at market rate ($180,000 to $250,000) would consume most or all of the budget alone
Leveraging existing AWS or Azure partnerships for managed AI services could reduce platform costs but still requires skilled staff to build and maintain models
The most cost-effective path may involve using AI features already embedded in existing tools (Splunk ML analytics, HubSpot predictive lead scoring, CrowdStrike AI threat detection) while building foundational data infrastructure

Cross-References

For additional context, the following resources are available on the Cloudcore Networks website:

Security policies and data handling: The data classification policy (draft), data protection policy, and access control policy are available at cloudcore.eduserver.au/docs/policies/
Breach incident documentation: Detailed security logs from the September 2024 breach, including database query logs, firewall alerts, and SIEM correlation events, are available at cloudcore.eduserver.au/docs/logs/
Risk assessment frameworks: ISO 27005 and NIST SP 800-30 templates used by Cloudcore are documented at cloudcore.eduserver.au/docs/support/risk_assessment_frameworks

Cloudcore Networks is a fictional company created for educational purposes. Any resemblance to real organisations is coincidental.