3T3D — A Vision Transformer Based 2D-to-3D Model for Architectural Design
2025 · team-member · 11-685

3T3D — A Vision Transformer Based 2D-to-3D Model for Architectural Design

ML · Deep Learning · Generative 3D

A vision-transformer pipeline that lifts a single 2D architectural floorplan into a structured 3D massing model. Trained on a synthetic dataset of paired plans and meshes generated procedurally in Grasshopper. Tests whether ViT attention can learn architectural priors directly from 2D-3D pairs.

Given three orthographic “napkin sketches” (top, side, front view) of a building, can a deep model produce a detailed 3D model of that architecture? Built a Vision Transformer (DINOv2-based) encoder + custom decoder that maps sketches to triplane representations. Trained on a custom dataset we generated ourselves because nothing existing was good enough.

architecture

Hook

Designers think in sketches. Three orthographic napkin sketches (top, side, front) capture most of what a building wants to be. We asked: can a 2D-to-3D transformer produce a detailed 3D model from just those three inputs? Answer: yes — using DINOv2 multi-view embeddings + patch fusion + a custom transformer decoder that outputs triplane features. Output is full 3D geometry the designer can refine.

Context

Course: 11-685 Introduction to Deep Learning, Spring 2025 — final project. Team of 4 (flat — no group leader): Graham Felton (1gfelton, holder of the canonical public repo), Chia Hui Yan, David Chen, Karthick Raja. Work and responsibilities were distributed across all four; the repo living under Graham’s account is a logistics choice, not a hierarchy.

Note: this core trio (David + Chia + Karthick) reunited the following semester for the [[2025-Fall—l43d-cad-mllm|CAD-MLLM L43D project]] — 3T3D was the DL warm-up, L43D the multimodal sequel.

Architecture

Four stages, end-to-end:

1. Input Processing

  • 3 orthographic sketches representing top / side / front views (analogous to floorplan + elevations)
  • Rescaled from 256×256 → 512×512 for the encoder
  • Concatenated along channel dim: tensors of shape [B, C, H, W]

2. Encoder — DINOv2 (frozen)

  • Pretrained DINOv2 Vision Transformer
  • Each view split into patches → patches become tokens
  • Patch embeddings extracted per view
  • The cls token is stripped (no classes in our training data)

3. Fusion

  • Patches from the three views that correspond to the same 3D location are summed into a single embedding
  • This yields a single fused feature vector passed to the decoder

fusion

4. Decoder — Custom Transformer

  • 6 decoder layers, 8 attention heads each, self-attention
  • Tuned balance between runtime, model size, and output quality

decoder

5. Progressive Upsampling

  • R¹⁶ → R¹²⁸ via progressive upsampling
  • Output: triplane representation of the 3D object

Dataset — built from scratch

No existing dataset was suitable for architectural design quality. We built a custom pipeline:

dataset creation

  1. Generate thousands of architectural renderings programmatically
  2. Run TripoSR on each rendering to get a 3D model
  3. Render top / front / side views of each 3D model
  4. Convert each view into a sketchy line drawing via Informative Drawings
  5. Result: triplets of (sketch front, sketch right, sketch top) paired with a 3D mesh (.obj)

Dataset samples:

Sketches3D Models
sketches3d models

Dataset structure (public, Google Drive):

Dataset/
├── sketch/
│   ├── front/     # Front view sketches (.png, .jpg)
│   ├── right/     # Right view sketches
│   └── top/       # Top view sketches
└── 3dmodel/       # 3D meshes (.obj files)

Download dataset

Training

  • Loss: L1 between predicted triplane features and ground-truth triplane features
  • Two-stage schedule:
    1. Freeze DINOv2 encoder; train decoder only until predictions are good
    2. Unfreeze entire model; fine-tune with differentiated learning rates (low for encoder, higher for decoder)
  • Compute: 37 epochs on a single A100; avg ~480 s/epoch; ~5 hours total
  • Tracking: Weights & Biases

training loss

Results

Model learns triplane representations — slowly but steadily. With better LR scheduling + hyperparameter tuning, we believe the score could improve substantially. Example output vs ground truth:

comparison

Outcomes

  • Novel contribution: a VT-based, 3-view sketch-to-3D model specifically for architectural design — not just object-level like most prior 2D→3D work
  • Custom dataset pipeline (architectural renderings → TripoSR → sketchy views → triplane 3D) is reusable
  • Published dataset on Google Drive
  • Final writeup in vault at /assets/3t3d-vit-2d-to-3d/3t3d_writeup.pdf (1.8 MB, 2025-04-28 final)
  • Final presentation video on YouTube
  • Flagship portfolio piece — David’s most substantial ML systems work alongside [[2025-Fall—l43d-cad-mllm|L43D CAD-MLLM]]; pairs nicely since they share 3/4 team members and continue the same “designer-facing generative 3D” thread

David’s contributions

From WhatsApp chat evidence + team discussion:

  • Literature review: shared CLIP, TripoSR, DINOv2 papers (arXiv:2109.14124, arXiv:1807.06358)
  • Dataset management — David provisioned + shared Google Drive folders on 2025-03-29 and during iteration
  • Overleaf report co-authoring (team explicitly called David “our overleaf master”)
  • Reference-research + experiment tracking

(David’s GitHub commits don’t all attribute to his chentianle1117 username — Colab sessions often commit under alt identities.)

Reference lineage

ReferenceUse
DINOv2 (arXiv:2304.07193)Encoder backbone
TripoSR (arXiv:2311.04400)Dataset generation
Informative DrawingsSketch conversion
pix2pix3DRelated-work baseline
Triplane representations (arXiv:2302.08509)Output representation
CLIPConsidered for early pipeline
Hunyuan3D-2Considered
TriplaneGaussianConsidered

Artifacts in vault

Under Portfolio//assets/3t3d-vit-2d-to-3d/:

FileSizePurpose
3t3d_writeup.pdf1.8 MBFinal project writeup (pulled from repo /img/)
project_notebook.ipynb5.1 MBFinal project notebook (from team WhatsApp)
dev_triplane.ipynb1.7 MBTriplane representation development notebook
dev_with_validation.ipynb2.9 MBTraining + validation notebook
arch_diagram.jpg, fusion_diagram.jpg, decoder_diagram.jpgArchitecture figures
dataset_creation1.jpg, data_sketch.png, data_3d.pngDataset pipeline + samples
val_train_loss.pngWandb training curves
comparison.jpgModel output vs ground truth
  • [[2025-Fall—l43d-cad-mllm]] — same core trio (David + Chia + Karthick) + Yizhuo Di, extending the generative-3D arc into multimodal CAD
KEEP EXPLORING Explore the latent space