Publications

NVIDIA

NVIDIA Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Large NVIDIA collaboration including I Karmanov

arXiv 2026

Open multimodal model spanning text, image, video, and audio, with strong long-context document understanding (MMLongBench-Doc, OCRBench-V2).

[Paper]

NVIDIA

Nemotron Parse 1.1

K Chumachenko, A Deshmukh, J Seppanen, I Karmanov, C Chen, L Voegtle, P Fischer, et al.

arXiv 2025

Follow-up to Eclair. 885M parameter lightweight model adding a token-compressed variant (20% speed gain), improved reading order for floating elements, and longer output sequences. Released as open weights with optimized NIM container.

[Paper] [Code / Demo]

NVIDIA

NVIDIA Nemotron Nano V2 VL

140+ authors including I Karmanov

arXiv 2025

Vision-language model on hybrid Mamba-Transformer architecture for document understanding, long video comprehension, and reasoning. 128K token context with token reduction for higher throughput.

[Paper]

NVIDIA

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

199 authors including I Karmanov

arXiv 2025

8B and 56B hybrid Mamba-Transformer models with up to 3x faster inference than comparable Transformers (Qwen-2.5, Llama-3.1) with equal or better accuracy. Eclair was used for PDF-to-text extraction in the pre-training data pipeline.

[Paper]

First Author

Eclair: Extracting Content and Layout with Integrated Reading Order for Documents

I Karmanov (first author), A Deshmukh, L Voegtle, P Fischer, K Chumachenko, T Roman, J Seppanen, J Parmar, J Jennings, A Tao, K Sapra

arXiv 2025

Multimodal encoder-decoder for document understanding. Extracts formatted text (markdown/LaTeX), bounding boxes with semantic classes, and reading order. Asymmetric design (larger vision encoder, lightweight decoder), no positional embeddings in the decoder, multi-token prediction, and a single prompt-controlled output format. Adopted across NVIDIA's pipelines for pre-training data and VLM distillation. Introduces the DROBS benchmark.

[Paper]

NeurIPS

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

Z Li, G Chen, S Liu, ... I Karmanov, L Voegtle, P Fischer, ... Z Yu (27 authors)

NeurIPS 2025

Data-centric approach to VLM post-training. Eagle2-9B matches models with up to 70B parameters. Later adopted as the VLM backbone of NVIDIA's GR00T-N1 robotic foundation model.

[Paper] [Code / Demo]

BMVC

Revisiting Single-gated Mixtures of Experts

A Royer, I Karmanov, A Skliar, B Ehteshami Bejnordi, T Blankevoort

BMVC 2022

Revisits simple single-gate MoE architectures with base model branch for early-exit and regularization. Achieves efficiency-accuracy trade-offs comparable to more complex MoE approaches.

[Paper]

NeurIPS

Modality-Agnostic Topology Aware Localization

FG Zanjani, I Karmanov, H Ackermann, D Dijkman, S Merlin, M Welling, F Porikli

NeurIPS 2021

Unsupervised positioning using optimal transport on isometric embeddings, agnostic to input modality. Applied to WiFi and visual positioning.

[Paper] [Supplemental]

NeurIPS

Deep Learning Frameworks for Weakly-Supervised Indoor Localization

FG Zanjani, H Ackermann, D Dijkman, I Karmanov, et al.

NeurIPS 2021 (Competition & Demos)

Deep learning frameworks for weakly-supervised indoor positioning using WiFi and visual data.

First Author

WiCluster: Passive Indoor 2D/3D Positioning using WiFi without Precise Labels

I Karmanov (first author), F Zanjani, S Merlin, I Kadampot, D Dijkman

IEEE GLOBECOM 2021

First weakly-supervised passive indoor positioning using WiFi CSI without precise location labels.

[Paper] [Video]

ICCV

Motion-Augmented Self-Training for Video Recognition at Smaller Scale

K Gavrilyuk, M Jain, I Karmanov, C Snoek

ICCV 2021

Self-training approach for video recognition that uses motion information to improve performance with limited labeled data.

[Paper]

WCNC

Hand Gesture Recognition using 802.11ad mmWave Sensor in the Mobile Device

Y Ren, J Lu, A Beletchi, Y Huang, I Karmanov, D Dijkman

IEEE WCNC 2021

Hand gesture recognition using mmWave radar sensing on mobile devices.

[Paper]

Economics

European Identity and Redistributive Preferences

J Costa-Font, F Cowell (with research contributions from I Karmanov)

CESifo Working Paper / LSE

Empirical causal inference (diff-in-diff) examining how changes in European identity affect preferences for redistribution. Contributed data generation, simulations, and econometric analysis.

[Paper]

2026

NVIDIA Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

2025

Nemotron Parse 1.1

NVIDIA Nemotron Nano V2 VL

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

Eclair: Extracting Content and Layout with Integrated Reading Order for Documents

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

2022

Revisiting Single-gated Mixtures of Experts

2021

Modality-Agnostic Topology Aware Localization

Deep Learning Frameworks for Weakly-Supervised Indoor Localization

WiCluster: Passive Indoor 2D/3D Positioning using WiFi without Precise Labels

Motion-Augmented Self-Training for Video Recognition at Smaller Scale

Hand Gesture Recognition using 802.11ad mmWave Sensor in the Mobile Device

2015

European Identity and Redistributive Preferences