Draft: Triangle Meshes Inferencing with MeshRWKV

Metadata

Status: Draft
Deciders: V-Sekai
Tags: V-Sekai

Background

MeshGPT uses a transformer model trained on geometric vocabulary to generate triangle meshes, resulting in clean, coherent, and compact meshes.

Problem

Adapting GPT-like generative models for the 3D domain is a challenge. Automated triangle mesh generation needs a robust solution.

Strategy

The MeshRWKV model employs a sequence-based approach that autoregressively generates triangle meshes as sequences of triangles. The conversion process to RWKV4 can be broken down into two primary steps:

Step 1: Learning a Vocabulary for Triangle Meshes

The initial step involves learning a vocabulary that can accurately represent triangle meshes from a large collection of 3D object meshes. This enables the encoding and decoding of triangles from this embedding.

A graph convolution encoder, specifically a series of SAGE-Conv graph convolution layers, is utilized in this process. This encoder processes the mesh in the form of a face graph. For each graph node, input features include the positionally encoded 9 coordinates of the face triangle, its area, the angles between its edges, and the normal of the face.

These geometrically rich features capture the intricate details of 3D shapes and are quantized as embeddings of a codebook using residual quantization. This effectively reduces the sequence lengths of the mesh representation.

These embeddings are then sequenced and decoded by a 1D ResNet-34 network, which interprets the face features as a 1D sequence. It outputs logits corresponding to the 9 discrete coordinates of each face triangle. The size of the codebook C and the discretization space are determined based on the specific requirements and constraints of the model.

Step 2: Autoregressive Mesh Generation

The next step is to train RWKV4, which leverages these quantized geometric embeddings. Given a sequence of geometric embeddings extracted from the triangles of a mesh, the transformer is trained to predict the codebook index of the next embedding in the sequence.

Once trained, the transformer can be auto-regressively sampled to predict sequences of embeddings. These embeddings can then be decoded to generate novel and diverse mesh structures that display efficient, irregular triangulations similar to human-crafted meshes.

Limitations

MeshGPT has made significant strides in the field of direct mesh generation, but it is not without its limitations. The autoregressive nature of the model results in slower sampling performance, with mesh generation times ranging from 30 to 90 seconds. The MeshGPT team’s learned tokenization approach has been successful in reducing sequence lengths for single object generation, but it may not be as effective when applied to scene-scale generation. This presents an opportunity for future enhancement.

Furthermore, MeshGPT’s current computational resources restrict it to using the GPT2-medium transformer, which is less sophisticated than models like Llama2. Given that larger language models tend to benefit from increased data and computational power, expanding these resources could significantly enhance MeshGPT’s performance and capabilities.

It’s important to note that the RWKV (Receptance Weighted Key Value) model’s context size is significantly larger than that of GPT2. In machine learning, context size refers to the amount of information from previous steps that a model can consider while making predictions for the next step. It essentially determines how much ‘memory’ the model has about past data.

The larger context size of RWKV means that it can remember and utilize more past information while generating outputs. This feature potentially enables it to handle larger and more complex tasks, provided there are sufficient computational resources.

However, a larger context size also implies higher computational demands. More processing power and memory are required to store and manage such large amounts of data. This could pose challenges in terms of hardware requirements and efficiency, particularly when dealing with very large datasets or complex tasks.

Benefits

Our approach produces compact meshes with sharp geometric details. It can suggest multiple shape completions for incomplete shapes.

Future Work

Given that the MeshGPT model has already developed a process for converting mesh data into logits, this simplifies the task of applying the MeshRWKV model to skeletal animation.

The fundamental concept would involve representing a sequence of animation frames (from frame0 to frameN) as meshes. The recursive nature of RWKV could potentially offer significant advantages in this scenario. It might facilitate efficient encoding and decoding of sequential frames, which could lead to the generation of smooth and coherent animations.

With the mesh-to-logits conversion challenge addressed by MeshGPT, the primary focus can be on developing new training methodologies and loss functions tailored for animation data. This approach could significantly broaden the applicability of the MeshRWKV model, opening up exciting possibilities for future research and development in the field of 3D animation.

Drawbacks

Training the model may require significant computational resources. Learning a vocabulary for triangle meshes could be complex.

Alternatives

Other methods like neural field-based approaches exist but often miss geometric details or produce over-triangulated meshes.

Special Use Case

Generating complex 3D shapes for niche applications might need additional model training or fine-tuning.

Core & Implemented by Us?

Yes, this proposal is core to our work at V-Sekai and will be implemented by us.

References

Assisted by AI assistant Aria.