← Back ICRA 2026

SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

Ruogu Li, Sikai Li, Yao Mu, Mingyu Ding

PDF

AI summary

Key figure (auto-extracted from paper)

SldprtNet delivers a large-scale, fully aligned multimodal dataset that proves combining rendered images with parametric text significantly improves semantic-driven CAD generation.

CAD generation multimodal dataset language-driven design parametric modeling 3D geometry industrial parts

Problem

Existing CAD datasets are too small, lack multimodal alignment, and cannot support complex industrial parts or language-driven modeling workflows.

Approach

The authors curated over 242,000 industrial CAD parts and developed automated tools to align each model with parametric text scripts, composite multi-view images, and natural language descriptions.

Key results

242,606 aligned industrial CAD parts
Custom encoder/decoder tools supporting 13 CAD command types
Fully aligned multimodal samples across geometry, images, and text
Empirical validation showing multimodal inputs outperform text-only inputs

Why it matters

It provides a foundational resource for advancing language-driven CAD modeling, geometric deep learning, and automated industrial design workflows.

Abstract

We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic- driven CAD modeling, geometric deep learning, and the training/fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric mod- eling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multi- modal language model Qwen2.5-VL-7B to generate a natural language description of each part’s appearance and function- ality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multi-modal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.

Index terms

Data Sets for Robot Learning Representation Learning Data Sets for Robotic Vision