# アーキテクチャ構成図

## 1. システム全体構成図

```mermaid
graph TB
    subgraph "Client Applications"
        USER_APP["User Application<br/>(Scala / Java / Python / R)"]
        SPARK_SHELL["spark-shell / pyspark<br/>(REPL)"]
        SPARK_SUBMIT["spark-submit<br/>(Launcher)"]
        CONNECT_CLIENT["Spark Connect Client<br/>(gRPC)"]
    end

    subgraph "Driver Process"
        SC["SparkContext"]
        SS["SparkSession"]
        DAG["DAGScheduler"]
        TS["TaskSchedulerImpl"]
        SB["SchedulerBackend"]
        BM_D["BlockManagerMaster"]
        LLB["LiveListenerBus"]
        UI["Spark Web UI<br/>(Jetty)"]
        CONNECT_SRV["Spark Connect Server<br/>(gRPC)"]
    end

    subgraph "Cluster Manager"
        STANDALONE["Standalone Master/Worker"]
        YARN_RM["YARN ResourceManager"]
        K8S_API["Kubernetes API Server"]
    end

    subgraph "Executor Process (x N)"
        EXEC["Executor"]
        TR["TaskRunner"]
        BM_E["BlockManager"]
        MS["MemoryStore"]
        DS["DiskStore"]
        SM["ShuffleManager"]
    end

    subgraph "External Storage"
        HDFS["HDFS"]
        S3["S3 / GCS"]
        HIVE_MS["Hive Metastore"]
        KAFKA["Apache Kafka"]
    end

    USER_APP --> SC
    USER_APP --> SS
    SPARK_SHELL --> SC
    SPARK_SUBMIT --> SC
    CONNECT_CLIENT --> CONNECT_SRV

    SS --> SC
    CONNECT_SRV --> SS
    SC --> DAG
    DAG --> TS
    TS --> SB
    SC --> BM_D
    SC --> LLB
    LLB --> UI

    SB -->|"Launch/Manage<br/>Executors"| STANDALONE
    SB -->|"Launch/Manage<br/>Executors"| YARN_RM
    SB -->|"Launch/Manage<br/>Executors"| K8S_API

    SB -->|"Send Tasks<br/>(RPC/Netty)"| EXEC
    EXEC --> TR
    EXEC --> BM_E
    BM_E --> MS
    BM_E --> DS
    EXEC --> SM

    BM_E -->|"Block Transfer<br/>(Netty)"| BM_D
    SM -->|"Shuffle Data"| DS

    EXEC -->|"Read/Write"| HDFS
    EXEC -->|"Read/Write"| S3
    SS -->|"Metadata"| HIVE_MS
    EXEC -->|"Consume/Produce"| KAFKA
```

## 2. Spark SQL クエリ実行パイプライン

```mermaid
graph LR
    SQL["SQL / DataFrame API"] --> PARSER["SparkSqlParser<br/>(ANTLR)"]
    PARSER --> UNRESOLVED["Unresolved<br/>LogicalPlan"]
    UNRESOLVED --> ANALYZER["Analyzer<br/>(Catalog Resolution)"]
    ANALYZER --> RESOLVED["Resolved<br/>LogicalPlan"]
    RESOLVED --> OPTIMIZER["Catalyst Optimizer<br/>(Rule-based)"]
    OPTIMIZER --> OPTIMIZED["Optimized<br/>LogicalPlan"]
    OPTIMIZED --> PLANNER["SparkPlanner<br/>(Strategies)"]
    PLANNER --> PHYSICAL["Physical<br/>SparkPlan"]
    PHYSICAL --> CODEGEN["Whole-Stage<br/>CodeGen"]
    CODEGEN --> EXEC["Execution<br/>(RDD)"]

    style PARSER fill:#4a90d9,color:#fff
    style ANALYZER fill:#4a90d9,color:#fff
    style OPTIMIZER fill:#4a90d9,color:#fff
    style PLANNER fill:#4a90d9,color:#fff
    style CODEGEN fill:#4a90d9,color:#fff
```

## 3. モジュール依存関係図

```mermaid
graph TB
    subgraph "High-Level APIs"
        PYSPARK["PySpark<br/>(python/)"]
        SPARKR["SparkR<br/>(R/)"]
        MLLIB["MLlib<br/>(mllib/)"]
        GRAPHX["GraphX<br/>(graphx/)"]
        STREAMING["Spark Streaming<br/>(streaming/)"]
    end

    subgraph "SQL Engine"
        SQL_CORE["sql/core"]
        SQL_CATALYST["sql/catalyst"]
        SQL_API["sql/api"]
        SQL_HIVE["sql/hive"]
        SQL_THRIFT["sql/hive-thriftserver"]
        SQL_CONNECT["sql/connect"]
        SQL_PIPELINES["sql/pipelines"]
    end

    subgraph "Core Engine"
        CORE["core"]
    end

    subgraph "Connectors"
        KAFKA_SQL["connector/<br/>kafka-0-10-sql"]
        AVRO["connector/avro"]
        PROTOBUF["connector/protobuf"]
        KAFKA_DSTREAM["connector/<br/>kafka-0-10"]
    end

    subgraph "Resource Managers"
        YARN["resource-managers/<br/>yarn"]
        K8S["resource-managers/<br/>kubernetes"]
    end

    subgraph "Common Libraries"
        NET_COMMON["common/<br/>network-common"]
        NET_SHUFFLE["common/<br/>network-shuffle"]
        UNSAFE["common/unsafe"]
        KVSTORE["common/kvstore"]
        SKETCH["common/sketch"]
        UTILS["common/utils"]
        VARIANT["common/variant"]
    end

    PYSPARK -.->|"Py4J"| SQL_CORE
    SPARKR -.->|"SparkR JNI"| SQL_CORE
    MLLIB --> SQL_CORE
    MLLIB --> CORE
    GRAPHX --> CORE
    STREAMING --> CORE

    SQL_THRIFT --> SQL_HIVE
    SQL_HIVE --> SQL_CORE
    SQL_CONNECT --> SQL_CORE
    SQL_PIPELINES --> SQL_CORE
    SQL_CORE --> SQL_CATALYST
    SQL_CATALYST --> SQL_API
    SQL_CORE --> CORE
    SQL_API --> UTILS

    KAFKA_SQL --> SQL_CORE
    AVRO --> SQL_CORE
    PROTOBUF --> SQL_CORE
    KAFKA_DSTREAM --> STREAMING

    YARN --> CORE
    K8S --> CORE

    CORE --> NET_COMMON
    CORE --> NET_SHUFFLE
    CORE --> UNSAFE
    CORE --> KVSTORE
    CORE --> SKETCH
    CORE --> UTILS
    SQL_CORE --> VARIANT
```

## 4. ジョブ実行フロー図

```mermaid
sequenceDiagram
    participant User as User Application
    participant SC as SparkContext
    participant DAG as DAGScheduler
    participant TS as TaskScheduler
    participant SB as SchedulerBackend
    participant Exec as Executor
    participant BM as BlockManager

    User->>SC: action (collect, save, etc.)
    SC->>DAG: submitJob(rdd, partitions)
    DAG->>DAG: createStages (split at shuffle boundaries)
    DAG->>TS: submitMissingTasks(stage)
    TS->>TS: Assign tasks by locality
    TS->>SB: launchTasks(taskDescriptions)
    SB->>Exec: LaunchTask (serialized task, via RPC)
    Exec->>Exec: TaskRunner.run()
    Exec->>BM: getOrCompute(blockId)

    alt ShuffleMapTask
        Exec->>BM: ShuffleWriter.write(records)
        Exec->>TS: StatusUpdate(finished, mapStatus)
        TS->>DAG: taskEnded(MapStatus)
    else ResultTask
        Exec->>TS: StatusUpdate(finished, result)
        TS->>DAG: taskEnded(result)
        DAG->>SC: jobCompleted(result)
        SC->>User: return result
    end
```

## 5. メモリ管理モデル図

```mermaid
graph TB
    subgraph "JVM Heap (Executor)"
        subgraph "Spark Managed Memory (spark.memory.fraction)"
            subgraph "Execution Memory"
                EXEC_MEM["Shuffles, Joins,<br/>Sorts, Aggregations"]
            end
            subgraph "Storage Memory"
                STORE_MEM["Cached RDDs,<br/>Broadcast Variables"]
            end
        end
        subgraph "User Memory"
            USER_MEM["User Data Structures,<br/>Internal Metadata"]
        end
        subgraph "Reserved Memory (300MB)"
            RESERVED["System Reserved"]
        end
    end

    subgraph "Off-Heap Memory (optional)"
        OFF_EXEC["Off-Heap<br/>Execution Memory"]
        OFF_STORE["Off-Heap<br/>Storage Memory"]
    end

    EXEC_MEM <-->|"Dynamic<br/>Boundary"| STORE_MEM

    style EXEC_MEM fill:#e74c3c,color:#fff
    style STORE_MEM fill:#3498db,color:#fff
    style USER_MEM fill:#2ecc71,color:#fff
    style RESERVED fill:#95a5a6,color:#fff
    style OFF_EXEC fill:#e74c3c,color:#fff
    style OFF_STORE fill:#3498db,color:#fff
```

## 6. Spark Connect アーキテクチャ図

```mermaid
graph LR
    subgraph "Client Process"
        CLIENT_APP["Client Application"]
        CLIENT_LIB["Spark Connect<br/>Client Library"]
        GRPC_CLIENT["gRPC Client"]
    end

    subgraph "Server Process (Driver)"
        GRPC_SRV["gRPC Server"]
        CONNECT_SVC["SparkConnectService"]
        PLANNER_C["Connect Planner"]
        SS_C["SparkSession"]
    end

    CLIENT_APP --> CLIENT_LIB
    CLIENT_LIB --> GRPC_CLIENT
    GRPC_CLIENT -->|"gRPC / Protobuf"| GRPC_SRV
    GRPC_SRV --> CONNECT_SVC
    CONNECT_SVC --> PLANNER_C
    PLANNER_C --> SS_C

    style GRPC_CLIENT fill:#f39c12,color:#fff
    style GRPC_SRV fill:#f39c12,color:#fff
```
