TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

In the current landscape of computer vision, the standard operating procedure involves a modular ‘Lego-brick’ approach: a pre-trained vision encoder for feature extraction paired with a separate decoder for task prediction. While effective, this architectural separation complicates scaling and bottlenecks the interaction between language and vision.

The Technology Innovation Institute (TII) research team is challenging this paradigm with Falcon Perception, a 600M-parameter unified dense Transformer. By processing image patches and text tokens in a shared parameter space from the very first layer, TII research team has developed an early-fusion stack that handles perception and task modeling with extreme efficiency.

The Architecture: A Single Stack for Every Modality

The core design of Falcon Perception is built on the hypothesis that a single Transformer can simultaneously learn visual representations and perform task-specific generation.

Hybrid Attention and GGROPE

Unlike standard language models that use strict causal masking, Falcon Perception employs a hybrid attention strategy. Image tokens attend to each other bidirectionally to build a global visual context, while text and task tokens attend to all preceding tokens (causal masking) to enable autoregressive prediction.

To maintain 2D spatial relationships in a flattened sequence, the research team uses 3D Rotary Positional Embeddings. This decomposes the head dimension into a sequential component and a spatial component using Golden Gate ROPE (GGROPE). GGROPE allows attention heads to attend to relative positions along arbitrary angles, making the model robust to rotation and aspect ratio variations.

Minimalist Sequence Logic

The basic architectural sequence follows a Chain-of-Perception format:

[Image] [Text] <coord> <size> <seg> … <eos>.

This ensures that the model resolves spatial ambiguity (position and size) as a conditioning signal before generating the final segmentation mask.

Engineering for Scale: Muon, FlexAttention, and Raster Ordering

TII research team introduced several optimizations to stabilize training and maximize GPU utilization for these heterogeneous sequences.

Muon Optimization: The research team report that employing the Muon optimizer for specialized heads (coordinates, size, and segmentation) led to lower training losses and improved performance on benchmarks compared to standard AdamW.

FlexAttention and Sequence Packing: To process images at native resolutions without wasting compute on padding, the model uses a scatter-and-pack strategy. Valid patches are packed into fixed-length blocks, and FlexAttention is used to restrict self-attention within each image sample’s boundaries.

Raster Ordering: When multiple objects are present, Falcon Perception predicts them in raster order (top-to-bottom, left-to-right). This was found to converge faster and produce lower coordinate loss than random or size-based ordering.

The Training Recipe: Distillation to 685GT

The model uses multi-teacher distillation for initialization, distilling knowledge from DINOv3 (ViT-H) for local features and SigLIP2 (So400m) for language-aligned features. Following initialization, the model undergoes a three-stage perception training pipeline totaling approximately 685 Gigatokens (GT):

In-Context Listing (450 GT): Learning to ‘list’ the scene inventory to build global context.

Task Alignment (225 GT): Transitioning to independent-query tasks using Query Masking to ensure the model grounds each query solely on the image.

Long-Context Finetuning (10 GT): Short adaptation for extreme density, increasing the mask limit to 600 per expression.

During these stages, the task-specific serialization is used:

<image>expr1<present><coord><size><seg> <eoq>expr2<absent> <eoq> <eos>.

The <present> and <absent> tokens force the model to commit to a binary decision on an object’s existence before localization.

PBench: Profiling Capabilities Beyond Saturated Baselines

To measure progress, TII research team introduced PBench, a benchmark that organizes samples into five levels of semantic complexity to disentangle model failure modes.

Main Results: Falcon Perception vs. SAM 3 (Macro-F1)

Benchmark SplitSAM 3Falcon Perception (600M)L0: Simple Objects64.365.1L1: Attributes54.463.6L2: OCR-Guided24.638.0L3: Spatial Understanding31.653.5L4: Relations33.349.1Dense Split58.472.6

Falcon Perception significantly outperforms SAM 3 on complex semantic tasks, particularly showing a +21.9 point gain on spatial understanding (Level 3).

FalconOCR: The 300M Document specialist

TII team also extended this early-fusion recipe to FalconOCR, a compact 300M-parameter model initialized from scratch to prioritize fine-grained glyph recognition. FalconOCR is competitive with several larger proprietary and modular OCR systems:

olmOCR: Achieves 80.3% accuracy, matching or exceeding Gemini 3 Pro (80.2%) and GPT 5.2 (69.8%).

OmniDocBench: Reaches an overall score of 88.64, ahead of GPT 5.2 (86.56) and Mistral OCR 3 (85.20), though it trails the top modular pipeline PaddleOCR VL 1.5 (94.37).

Key Takeaways

Unified Early-Fusion Architecture: Falcon Perception replaces modular encoder-decoder pipelines with a single dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer. It utilizes a hybrid attention mask—bidirectional for visual tokens and causal for task tokens—to act simultaneously as a vision encoder and an autoregressive decoder.

Chain-of-Perception Sequence: The model serializes instance segmentation into a structured sequence (⟨coord⟩→⟨size⟩→⟨seg⟩)(\langle coord\rangle \rightarrow \langle size\rangle \rightarrow \langle seg\rangle), which forces it to resolve spatial position and size as a conditioning signal before generating the pixel-level mask.

Specialized Heads and GGROPE: To manage dense spatial data, the model uses Fourier Feature encoders for high-dimensional coordinate mapping and Golden Gate ROPE (GGROPE) to enable isotropic 2D spatial attention. The Muon optimizer is employed for these specialized heads to balance learning rates against the pre-trained backbone.

Semantic Performance Gains: On the new PBench benchmark, which disentangles semantic capabilities (Levels 0-4), the 600M model demonstrates significant gains over SAM 3 in complex categories, including a +13.4 point lead in OCR-guided queries and a +21.9 point lead in spatial understanding.

High-Efficiency OCR Extension: The architecture scales down to Falcon OCR, a 300M-parameter model that achieves 80.3% on olmOCR and 88.64 on OmniDocBench. It matches or exceeds the accuracy of much larger systems like Gemini 3 Pro and GPT 5.2 while maintaining high throughput for large-scale document processing.

Check out the Paper, Model Weight, Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Source link