Research Scientist / AI Systems

Building systems that put AI on a bicycle.

Humans did not beat the condor by becoming better animals; they did it by building the bicycle. I see machine intelligence similarly: progress comes not only from stronger models, but from the algorithms, tools, memory, feedback loops, and data systems that amplify them.

Masafumi Oyamada

Masafumi Oyamada

小山田 昌史

Chief Scientist, NEC Corporation

Research Interests

I study AI capability as a layered systems problem: Algorithm, Model, Agent, and Data.

At the algorithmic layer, I work on inference-time methods that draw more reasoning from a fixed model. At the model layer, I study search and feedback loops that help models improve. At the agent layer, I build composable systems that combine memory, tools, observations, and planning. At the data layer, I study how retrieval, documents, tables, and external knowledge ground AI in real tasks.

Recent projects include test-time scaling, self-improving agents, workplace-learning browser automation, retrieval-augmented generation, and data-centric AI.

Layered AI systems diagram: algorithm, model, agent, and data connected by search, feedback, and evaluation.

Latest Notes

Think Deeper, Generate Less

Accepted to AI4Science @ ICML

Under the same token budget, algorithm discovery improved more by spending tokens on stronger individual program edits than by generating more candidates.

Peer Review Is Not Just a Checklist

Accepted to ACL (Findings)

Automated reviewers aligned best with human judgments when guided by official conference criteria, not by prompts that imitate reviewer behavior.

Research Archive

Selected Publications

Research highlights on the algorithms, agents, retrieval systems, and data infrastructure that amplify language models.

AI4Science @ ICML 2026Think Deeper, Generate Less

Effective Harness Engineering for Algorithm Discovery with Coding Agents

Yoichi Ishibashi, Taro Yano, Masafumi Oyamada

Runs that spend more tokens per algorithm cluster near higher Circle Packing scores.

Under the same token budget, algorithm discovery improved more by spending tokens on stronger individual program edits than by generating more candidates. On Circle Packing, this suggests a quality-per-iteration bottleneck for evolutionary search with coding agents.

  1. 01Vesper with gpt-5.2-codex reached about 2.636 on Circle Packing using 40M tokens.
  2. 02Broader search produced many more candidates, but with less work per algorithm.
  3. 03Higher tokens per algorithm aligned with stronger final scores in these runs.
ACM CAIS (Demos) 2026Agents That Learn the Workplace

cotomi Act: Learning to Automate Work by Watching You

Masafumi Oyamada, Kunihiro Takeoka, Kosuke Akimoto, Ryoma Obara, Masafumi Enomoto, Haochen Zhang, Daichi Haraguchi, Takuya Tamura

Ordinary browsing is distilled into a shared knowledge workspace that both users and agents can use.

Browser agents usually know the page but not the organization behind the task. cotomi Act turns ordinary browsing into shared memory: task boards, timelines, and wiki knowledge that the agent can consult as future work unfolds.

  1. 01Passive behavior logs are distilled into editable organizational artifacts.
  2. 02The user and agent read and write the same workspace.
  3. 03Measured success improves as behavior-derived knowledge accumulates.
ACL (Findings) 2026Peer Review Is Not Just a Checklist

Evaluating the Impact of Reviewer Guideline Design on LLM-Based Automated Peer Review

Haowen Li, Yoichi Ishibashi, Masafumi Oyamada

RMSE comparisons show that rubric-style reviewer-imitating guidelines often move LLM scores farther from human judgments.

Automated reviewers aligned best with human judgments when guided by official conference criteria, not by prompts that imitate reviewer behavior. The result suggests that peer-review automation needs carefully designed evaluation standards, while overly rigid rubrics can push LLM scores away from human assessments.

  1. 01Official guidelines produced the strongest score alignment across several model and conference settings.
  2. 02Reviewer-imitating rubric prompts often increased error instead of reducing it.
Preprint 2026Observation Design Is Model-Dependent

Read More, Think More: Revisiting Observation Reduction for Web Agents

Masafumi Enomoto, Ryoma Obara, Haochen Zhang, Masafumi Oyamada

Lower-capability web agents improve with compact observations, while stronger models benefit from richer page context and more thinking budget.

Web-agent context is not just a compression problem. The best page representation changes with model capability and thinking budget, so observation pipelines should adapt to the agent rather than standardize on one reduced format.

  1. 01Accessibility trees are compact and often better for lower-capability models.
  2. 02HTML gives stronger models extra structure for grounding actions.
  3. 03Observation history adds value, especially when represented compactly as diffs.
ICLR 2026A Ceiling for Majority-Vote Sampling

Best-of-∞ -- Asymptotic Performance of Test-Time Compute

Junpei Komiyama, Daisuke Oba, Masafumi Oyamada

Accuracy improves with more generations but flattens near the asymptotic majority-vote limit.

More generations can improve LLM answers, but majority voting has a limit. Best-of-infinity studies the N → ∞ regime to separate the accuracy a model could reach with unlimited sampling from the finite compute needed to get close.

  1. 01Analyzes best-of-N majority voting in the large-sample limit.
  2. 02Shows why gains from extra generations taper toward an asymptote.
  3. 03Frames test-time scaling as a question of marginal value.
EACL (Findings) 2026Chunk Documents by Meaning

SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation

Nobuhiro Ueda, Yuyang Dong, Krisztián Boros, Daiki Ito, Takuya Sera, Masafumi Oyamada

SCAN groups fragmented page elements into larger semantic regions that preserve document context.

Rich documents are hard for RAG because a page can mix charts, text, titles, and images. SCAN shows that retrieval improves when chunks follow semantic regions instead of tiny layout fragments, giving LLMs and VLMs the context they need without processing whole pages.

  1. 01Groups contiguous document components into coherent semantic boxes.
  2. 02Improves end-to-end textual RAG by up to 9.4 points and visual RAG by up to 10.4 points.
  3. 03Evaluated across English and Japanese document QA datasets.
NeurIPS 2025Let the Model Choose the Step Size

DISC: Dynamic Decomposition Improves LLM Inference Scaling

Jonathan Light, Wei Cheng, Benjamin Riviere, Wu Yue, Masafumi Oyamada, Mengdi Wang, Yisong Yue, Santiago Paternain, Haifeng Chen

DISC adapts reasoning step boundaries instead of committing to a fixed whole-solution, token-level, or sentence-level split.

LLM inference scaling often depends on where a solution is split into steps. DISC makes that boundary adaptive: it uses rollout feedback to use coarser steps when progress is clear and finer steps when a decision looks risky, improving compute efficiency across math and coding benchmarks.

  1. 01Adapts step sizes during inference instead of fixing them by tokens, sentences, or whole solutions.
  2. 02Spends more sampling effort on difficult prefixes rather than distributing compute uniformly.
  3. 03Reduces pass@10 error rate by 5.0%, 6.7%, and 10.5% against static decomposition baselines.
EMNLP 2025Agents for Post-Training Design

LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents

Taro Yano, Yoichi Ishibashi, Masafumi Oyamada

LaMDAgent searches post-training pipelines by looping through action enumeration, action selection, model evaluation, and memory updates.

LaMDAgent reframes model adaptation as an iterative design loop: enumerate possible training actions, build a candidate model, evaluate it, and remember what worked. The key idea is that an LLM agent can search over complete post-training pipelines, not just tune isolated hyperparameters.

  1. 01Searches over SFT, model merging, datasets, prompts, and hyperparameters.
  2. 02Uses task feedback and memory to guide the next model-building action.
COLM 2025A Crow Rarely Hatches a Falcon

Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance

Takuya Tamura, Taro Yano, Masafumi Enomoto, Masafumi Oyamada

The lineage-aware method tracks actual benchmark performance more closely across BBH, IFEval, MATH, MMLU-Pro, and the aggregate score.

Model ancestry carries performance information. When LLMs are fine-tuned or merged, their descendants often inherit enough structure from their parents that a lineage-aware predictor can forecast benchmark behavior more accurately.

  1. 01Models lineage as a family graph rather than treating each model as independent.
  2. 02Compares against matrix factorization, collaborative filtering, and model-lineage averaging baselines.
  3. 03Covers BBH, GPQA, IFEval, MATH, MMLU-Pro, and MuSR.
ACL 2025Attribution as a Data Design Problem

On Synthesizing Data for Context Attribution in Question Answering

Gorjan Radevski, Kiril Gashteovski, Shahbaz Syed, Christopher Malon, Sebastien Nicolas, Chia-Chien Hung, Timo Sztyler, Verena Heußer, Wiem Ben Rim, Masafumi Enomoto, Kunihiro Takeoka, Masafumi Oyamada, Goran Glavaš, Carolin Lawrence

SynQA reaches high attribution F1 with a 1B model while much larger baselines trail behind.

A small attribution model can become competitive when the training data has the right structure. SynQA shows that synthetic evidence paths can train a 1B-parameter model to identify answer-supporting context across several QA settings.

  1. 01Reframes attribution quality as a question of supervision design, not only model scale.
  2. 02Reports strong F1 compared with larger zero-shot and ensemble baselines.
  3. 03Useful for QA systems that need evidence highlighting without relying only on very large models.
Preprint 2025LLM Judges Need Rubrics

An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability

Yusuke Yamauchi, Taro Yano, Masafumi Oyamada

Across BIGGENBench and EvalBiasBench, removing criteria or references changes LLM-judge reliability across evaluator models.

LLM judges are more reliable when they receive explicit evaluation criteria, not just a reference answer. For open-ended tasks, reliability is shaped by how scoring is specified, and that design choice can materially change results even with the same evaluator model.

  1. 01Tested on BIGGENBench and EvalBiasBench.
  2. 02Removing criteria and references sharply reduced reliability.
  3. 03Clear rubrics made Chain-of-Thought scoring less important.
SIGIR 2025When Expansion Needs Knowledge

LLM-based Query Expansion Fails for Unfamiliar and Ambiguous Queries

Kenya Abe, Kunihiro Takeoka, Makoto P. Kato, Masafumi Oyamada

Expansion improves retrieval mainly when the query matches knowledge the model already has.

LLM query expansion is not a free retrieval boost. It helps when the model already has enough knowledge about the query, but can hurt search quality when the query asks about unfamiliar concepts or under-specified intent.

  1. 01Tested across sparse and dense retrievers, including BM25, Contriever, and E5-base-v2.
  2. 02The failure is not just hallucination; missing or biased knowledge can produce expansion terms that move retrieval away from relevant documents.
Preprint 2025Reasoning from Ordinary Text

Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

Yoichi Ishibashi, Taro Yano, Masafumi Oyamada

Ordinary legal text is augmented with reconstructed hidden thoughts for reasoning-oriented training.

Can reasoning data be mined from documents that were never written as step-by-step solutions? Reasoning CPT treats ordinary text as the visible trace of an author’s hidden thinking, reconstructs that process synthetically, and turns broad corpora into training data for reasoning.

  1. 01Uses Law and STEM corpora without task-specific reasoning labels.
  2. 02Improves MMLU accuracy across STEM, social sciences, humanities, and other categories.
  3. 03Connects a simple intuition, text reflects thought, to a scalable training recipe.
BioNLP@ACL 2024RAG Changes Confidence, Not Just Answers

Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain

Shintaro Ozaki, Yuta Kato, Siyuan Feng, Masayo Tomita, Kazuki Hayashi, Wataru Hashimoto, Ryoma Obara, Masafumi Oyamada, Katsuhiko Hayashi, Hidetaka Kamigaito, Taro Watanabe

Retrieval shifts answer probabilities from an overconfident wrong choice toward the evidence-supported medical answer.

A retrieved passage can make a model more or less certain even when the final answer is the same. This work treats output probability as a signal for how medical LLMs respond to evidence, revealing behavior that accuracy alone can miss.

  1. 01Measures entropy, best probability, accuracy, and calibration error.
  2. 02Frames confidence as a separate axis for evaluating RAG systems.
NAACL 2025LLM-Designed Merging Rules Improve Math Reasoning

Can Large Language Models Invent Algorithms to Improve Themselves?

Yoichi Ishibashi, Taro Yano, Masafumi Oyamada

LLM-designed merging rules outperform the seed model and human-designed baselines on GSM8k while matching the best MATH score.

The work asks whether a language model can invent useful model-improvement algorithms, not just apply human-written ones. In model merging, self-generated Python algorithms improve GSM8k from 70.1% to 76.1% and match the strongest MATH result among the compared human-designed methods.

  1. 01Top discovered GSM8k result: 76.1%, versus 71.9% for Task Arithmetic.
  2. 02MATH rises from 0.5% for the seed model to 8.5% for the best discovered model.
EMNLP 2024Small Models, Large Preprocessing Gains

Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing.

Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada

Across six data-preprocessing tasks, Jellyfish turns 7B-13B base models into much stronger local solvers with large average gains.

Jellyfish shows that local 7B-13B LLMs can become strong data-preprocessing solvers when tuned on task descriptions, injected knowledge, and reasoning traces. The key lesson is practical: for cleaning tables, targeted instruction data can unlock large gains without relying only on bigger hosted models.

  1. 01Improves over the base models by average DP gains of +35.97, +18.60, and +21.61 points for the 7B, 8B, and 13B variants.
  2. 02Covers error detection, data imputation, schema matching, entity matching, column type annotation, and attribute value extraction.
CIKM 2024A Roadmap for Table-Aware LLMs

On the Use of Large Language Models for Table Tasks.

Yuyang Dong, Masafumi Oyamada, Chuan Xiao, Haochen Zhang

The material moves from table-task basics to five LLM strategies: prompting, fine-tuning, RAG, agents, and vision-language models.

How should LLMs be used with tabular data? Table-aware LLM work can be organized around five practical routes: prompting, fine-tuning, RAG, agents, and vision-language models, making a fast-moving design space easier to compare and apply.

  1. 01Covers table understanding, text-to-SQL, preprocessing, querying, and cleansing.
  2. 02Connects academic methods with practical data-analysis and data-quality workflows.
Preprint 2024Indirect Evidence Matters

LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization.

Masafumi Enomoto, Kunihiro Takeoka, Kosuke Akimoto, Kiril Gashteovski, Masafumi Oyamada

LightPAL moves from a directly retrieved passage to indirectly relevant context by walking a passage graph.

Broad summarization queries often need passages that do not match the query directly. LightPAL treats retrieval as navigation through a passage graph, using random walks to surface indirectly relevant context while improving retrieval and summarization metrics in many settings.

  1. 01Initial retrieval supplies seed passages; graph walks add contextual neighbors.
  2. 02The design targets cases where relevant passages are sparse across large document collections.

As target-language data becomes scarcer, multilingual two-stage training overtakes the single-stage alternatives in this setting.

Low-resource LLM training is not simply a matter of repeating scarce data more often. As target-language data shrinks, the best setup can shift from monolingual single-stage training to multilingual two-stage training, with the switch point depending on compute.

  1. 01Compares multi-epoch, multilingual, and two-stage training in one search space.
  2. 02Identifies a data-dependent transition in the best training strategy.
  3. 03Helps practitioners choose what to tune first when target text is scarce.
VLDB Workshops 2024A Stress Test for LLM Data Work

Large Language Models as Data Preprocessors.

Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada

LLMs rival specialized preprocessors on several tabular benchmarks, while gaps remain across tasks and datasets.

Tabular preprocessing asks whether LLMs can operate over structured records, not just fluent text. Across error detection, imputation, schema matching, and entity matching, the results are promising but uneven: LLMs can rival specialized tools on some benchmarks, while efficiency and reliability still shape their practical value.

  1. 01Evaluates four preprocessing categories across public tabular datasets.
  2. 02GPT-4 reaches top scores on several imputation and entity-matching benchmarks.
*SEM 2024Three Criteria for Better Zero-Shot Keywords

Relevance, Diversity, and Exclusivity: Designing Keyword-augmentation Strategy for Zero-shot Classifiers.

Taro Yano, Kunihiro Takeoka, Masafumi Oyamada

Keyword augmentation is framed as a two-stage process: generate candidate terms from class names, then rerank them for relevance, exclusivity, and diversity.

Effective keyword augmentation for zero-shot text classification depends on more than semantic similarity. REDEX ranks candidate keywords by task-aware relevance, inter-class exclusivity, and intra-class diversity, improving performance across fully zero-shot and generalized zero-shot settings.

  1. 01Defines three keyword properties that connect class-name augmentation to classification behavior.
  2. 02Automatically generates and reranks keywords without extra knowledge bases or labeled data.
  3. 03Improves average zero-shot accuracy over non-augmented and NPPrompt baselines in the reported experiments.
IEEE Big Data 2023No Single Best LLM Team

Towards Large Language Model Organization: A Case Study on Abstractive Summarization.

Krisztián Boros, Masafumi Oyamada

Single-call, networked, and hierarchical agent organizations route summarization work through different communication patterns.

Can LLMs summarize better by working as an organization? This work models summarization as a DAG of cooperating LLM agents and shows a practical lesson: network-like and hierarchical workflows can improve faithfulness or quality in some settings, but the best structure depends on the task.

  1. 01Tested on 5 summarization datasets across news, patents, and dialogue.
  2. 02Compared similarity, G-Eval, and factuality metrics.
  3. 03Organizational structure matters more than adding agents blindly.
Preprint 2023Local LLMs for Private Data Cleaning

Jellyfish: A Large Language Model for Data Preprocessing.

Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada

Raw data preprocessing tasks are converted into instruction data for tuning compact local Jellyfish models.

Can data preprocessing get LLM-style flexibility without sending sensitive tables to an external API? Jellyfish shows that instruction-tuned 7B-13B local models can handle multiple data cleaning tasks on modest hardware while keeping data closer to where it lives.

  1. 01Covers error detection, data imputation, schema matching, and entity matching.
  2. 02Uses DP-specific instruction data, knowledge injection, and reasoning distillation.
  3. 03Targets local, single-GPU deployment for privacy-sensitive preprocessing.
VLDB 2023Joinable-Table Discovery as Vector Search

DeepJoin: Joinable Table Discovery with Pre-trained Language Models.

Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, Masafumi Oyamada

DeepJoin combines training, offline indexing, and online search into a vector-retrieval pipeline for joinable columns.

DeepJoin reframes joinable-table discovery as learned retrieval. Instead of comparing raw column values one by one, it trains a language model so joinable columns land near each other in vector space, then uses approximate nearest-neighbor search to find candidates quickly.

  1. 01The same embedding pipeline supports exact equi-joins and fuzzier semantic joins.
  2. 02Column-to-text transformations let names, values, statistics, titles, and context inform the representation.
  3. 03Experiments report stronger precision than approximate baselines on Webtable and Wikitable datasets.
eCom@SIGIR 2023Transporting Preferences Across Domains

Cross-Domain User Similarity without Overlapping Attributes via Optimal Transport Theory.

Genki Kusano, Masafumi Oyamada

Optimal transport spreads the music attribute "opera" across nearby movie attributes instead of forcing a single label match.

If “opera” in music is close to several movie genres, its influence should be distributed across them, not forced into one hand-made label match. ATP uses optimal transport to translate preference weight between attribute spaces, then compares users after that translation.

  1. 01Works without requiring identical attribute names across datasets.
  2. 02Evaluated on user matching and cross-domain recommendation tasks.
EMNLP (Findings) 2023Training Retrieval QA for Mismatched Context

Context Quality Matters in Training Fusion-in-Decoder for Extractive Open-Domain Question Answering.

Kosuke Akimoto, Kunihiro Takeoka, Masafumi Oyamada

Exact-match accuracy shifts sharply when FiD is trained on contexts whose evidence quality differs from the evaluation setting.

Retrieval-augmented QA depends not only on the passages supplied at inference time, but also on the quality of passages seen during training. FiD can overfit to that context quality, so a model trained on clean evidence may lose accuracy when retrieval becomes noisier.

  1. 01Separates context quality from context quantity during FiD training.
  2. 02Shows train-test mismatch in retrieval quality as a hidden source of QA brittleness.
SIGIR 2022Testing Whether Added Data Helps

Table Enrichment System for Machine Learning.

Yuyang Dong, Masafumi Oyamada

The interface compares model performance before and after enrichment while showing the added columns that feed the prediction task.

Table augmentation is only valuable if it improves the model that will use it. The system makes enrichment an evaluable workflow: add external columns, train with the enriched table, and compare prediction quality before and after enrichment.

  1. 01Connects table discovery to ML evaluation in one workflow.
  2. 02Reports before-and-after metrics such as F1, precision, and recall.
  3. 03Uses feature-importance views to inspect how added columns affect the model.
EMNLP (1) 2021Pretrained Knowledge Helps Small Taxonomies Grow

Low-resource Taxonomy Enrichment with Pretrained Language Models.

Kunihiro Takeoka, Kosuke Akimoto, Masafumi Oyamada

Musubu maintains higher Edge-F1 than the baselines across low-resource training sizes.

The useful signal for taxonomy enrichment is not only in the seed taxonomy. Musubu uses pretrained language models as a source of implicit parent-child knowledge, improving parent prediction when only a small number of hierarchy examples are available.

  1. 01Evaluated on SemEval and real commerce taxonomies.
  2. 02Achieved the strongest overall Edge-F1 and Hierarchical-F1 among compared methods.
SFDI 2021 2021Features From Change, Not Similarity

Entity Matching with String Transformation and Similarity-Based Features

Kazunori Sakai, Yuyang Dong, Masafumi Oyamada, Kunihiro Takeoka, Takeshi Okadome

A transformed pair can move toward stronger agreement; the size of that movement becomes a matching feature.

Entity matching breaks when two names look similar for the wrong reason, or different for a superficial one. The key idea is to measure how similarity changes after string transformations, so a gain or drop becomes evidence about whether two records name the same entity.

  1. 01Similarity gain captures the largest useful increase after transformation.
  2. 02Dissimilarity gain captures the largest useful decrease after transformation.
  3. 03The same signals can support both classification and sample selection.
ICWSM 2021Linking Users Without Profiles

User Identity Linkage for Different Behavioral Patterns across Domains.

Genki Kusano, Masafumi Oyamada

Linking purchase and browsing histories reveals cross-domain behavior that a single service would miss.

Can two accounts be matched when all you see is behavior, not demographics? Purchase logs and browsing logs can support likely identity matches when their histories are mapped into a shared behavioral representation.

  1. 01Targets cross-domain linkage where the two services record different kinds of actions.
  2. 02Uses NLP-style representations to compare behavioral histories rather than profile attributes.
  3. 03Evaluated on Instacart, CTR, and Amazon datasets.
PAKDD (3) 2021Quality Control Without Throwing Data Away

Quality Control for Hierarchical Classification with Incomplete Annotations.

Masafumi Enomoto, Kunihiro Takeoka, Yuyang Dong, Masafumi Oyamada, Takeshi Okadome

Partial annotations still identify useful regions of the class hierarchy, even when the exact leaf label is missing.

Crowdsourced hierarchical labels often stop before the most specific class. Instead of discarding those incomplete answers, this method combines the label hierarchy with worker-specific reliability to improve leaf-label estimation.

  1. 01Treats partial annotations as structured evidence, not missing data.
  2. 02Models workers as stronger for some branches of the hierarchy than others.
  3. 03Improves hierarchical F-score on real and synthetic datasets.
ICDE 2021Finding Joins Beyond Exact Matches

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach.

Yuyang Dong, Kunihiro Takeoka, Chuan Xiao, Masafumi Oyamada

PEXESO builds an offline vector index so online queries can retrieve joinable tables by similarity.

Data lakes hide useful joins behind spelling variants, formatting differences, and semantic near-matches. PEXESO embeds textual table values as high-dimensional vectors, so join discovery can search by similarity in value space rather than exact string equality.

  1. 01Targets joinable columns, not only duplicate cell values.
  2. 02Uses vector similarity to recover joins that equi-joins miss.
  3. 03Designed for data integration, augmentation, and analysis workflows.
VLDB J. 2021Repair Top-k Results Instead of Recomputing Them

Continuous top-k spatial-keyword search on dynamic objects.

Yuyang Dong, Chuan Xiao, Hanxiong Chen, Jeffrey Xu Yu, Kunihiro Takeoka, Masafumi Oyamada, Hiroyuki Kitagawa

Each object update is first matched to affected queries, then only those top-k lists are refilled.

When people, posts, or devices move and change keywords, continuous spatial-keyword search can be maintained efficiently by treating each update as a targeted repair. The system identifies only the registered queries whose top-k answers may change, then refills those lists with grid-indexed spatial and keyword evidence.

  1. 01Splits maintenance into affected-query detection and top-k refilling.
  2. 02Indexes both dynamic objects and standing queries inside grid cells.
  3. 03Supports single-object updates and batched updates over time intervals.
AAAI 2020Learning from Hesitation

Learning with Unsure Responses.

Kunihiro Takeoka, Yuyang Dong, Masafumi Oyamada

Unsure responses cluster near the dog-wolf decision boundary, showing how hesitation can identify difficult training examples.

When annotators choose Unsure, they may be flagging the hardest examples rather than adding unusable noise. This work treats those responses as boundary information, letting classifiers learn from ambiguity instead of throwing it away.

  1. 01Introduces an unsure loss based on distance to the decision boundary.
  2. 02Uses both confident labels and unsure responses during training.
  3. 03Improves over baselines in synthetic and real annotation settings.
IEEE BigData 2019Reusable Feature Engineering Patterns

Extracting Feature Engineering Knowledge from Data Science Notebooks.

Masafumi Oyamada

Notebook code is transformed into typed syntax trees so repeated feature-engineering operations can be discovered across projects.

Expert feature engineering often hides in ordinary notebook code, but naming differences make those operations difficult to reuse. This work turns code into typed syntax patterns so recurring transformations can be discovered across data-analysis projects.

  1. 01Converts source code and notebooks into ASTs.
  2. 02Uses type and semantic inference to normalize symbols.
  3. 03Mines frequent subgraphs as candidates for reusable operations.
AAAI 2019Column Meaning Is Contextual

Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables.

Kunihiro Takeoka, Masafumi Oyamada, Shinji Nakadai, Takeshi Okadome

Neighboring columns help resolve an ambiguous numeric Height column against labels such as Age or Temperature.

A column’s meaning is rarely contained in its header alone. Meimei uses cell values and neighboring columns as context, so a numeric field like height is not confused with age or temperature just because the values look plausible.

  1. 01Models dependencies among column concepts and the table title concept.
  2. 02The full model reports stronger nDCG@5 than the version without column interdependency.
IEEE BigData 2018Reuse the Aggregates Analysts Repeat

Accelerating Feature Engineering with Adaptive Partial Aggregation Tree.

Masafumi Oyamada

Frequently accessed ranges get finer cached summaries, letting later aggregation queries reuse more prior work.

Feature engineering often asks the same range-aggregation question in slightly different ways. APA-tree caches partial aggregate results in an adaptive tree, so later queries can reuse prior work instead of rereading the same records from disk.

  1. 01Targets repeated range queries for statistics such as maximum and standard deviation.
  2. 02Stores partial results over data groups in an imbalanced binary tree.
  3. 03Refines cached summaries where the workload repeatedly returns.
J. Inf. Process. 2018One Format for Sparse and Dense Vectors

Compressed Vector Set: A Fast and Space-Efficient Data Mining Framework.

Masafumi Oyamada, Jianquan Liu, Shinji Ito, Kazuyo Narita, Takuya Araki, Hiroyuki Kitagawa

Runtime speedups vary by compression scheme, but the compressed representations extend useful acceleration from sparse bag-of-words data to dense datasets where ordinary sparse storage performs poorly.

Sparse formats are efficient only when the data really stays sparse. CVS asks whether one compressed representation can handle both sparse and dense vector sets efficiently, while still supporting the operations needed by learning algorithms.

  1. 01Uses a block-based compressed vector representation
  2. 02Runs primitive vector operations without full decompression
  3. 03Reports lower memory use and faster processing than conventional sparse vector representations
ICDM 2017Behavior-Specific Demographic Experts

Relational Mixture of Experts: Explainable Demographics Prediction with Behavioral Data.

Masafumi Oyamada, Shinji Nakadai

Customer groups and item groups are learned together, then each customer group gets its own demographic prediction model.

A single demographic predictor assumes the same behavioral signals matter for every customer. R-iSVM learns customer and item clusters from purchase histories, then trains a local expert for each customer group, using behavior as interpretable model structure rather than hand-built features.

  1. 01Replaces one global classifier with cluster-specific prediction models.
  2. 02Uses purchase behavior to reveal readable customer and item groups.
  3. 03Evaluated on MovieLens 1M, Ta-Feng, and BeiRen.
SAC 2013Transactions for Continuous Queries

Continuous query processing with concurrency control: reading updatable resources consistently.

Masafumi Oyamada, Hideyuki Kawashima, Hiroyuki Kitagawa

The same stream events can produce different aggregates depending on whether the query reads one resource version or mixes versions.

Streaming systems often need mutable reference data, but ordinary continuous-query execution does not guarantee a consistent view of it. This work adapts database concurrency control to streaming: two-phase locking, snapshots, and optimistic validation preserve serializable results while processing continues.

  1. 01Defines read-only transactions derived from continuous-query outputs.
  2. 02Coordinates those transactions with resource-update transactions.
  3. 03Compares three concurrency-control strategies for stream processing.
3PGCIC 2011Streaming Transactions Without Waiting

Efficient Invocation of Transaction Sequences Triggered by Data Streams.

Masafumi Oyamada, Hideyuki Kawashima, Hiroyuki Kitagawa

Asynchronous invocation lets later stream tuples move forward while earlier transaction results are still pending.

High-rate data streams can stall when every tuple waits for its database transaction to finish. Decoupling transaction invocation from result waiting improved throughput by an order of magnitude in the reported experiments, while an order-preserving variant kept results aligned with the stream.

  1. 01Models transaction sequences triggered by incoming stream tuples.
  2. 02Compares synchronous, asynchronous, and order-preserving asynchronous invocation.
  3. 03Shows that waiting strategy can dominate end-to-end stream transaction throughput.

Archive

News Archive

Paper Accepted to AI4Science @ ICML 2026

Our paper "Effective Harness Engineering for Algorithm Discovery with Coding Agents" has been accepted to AI4Science @ ICML 2026.

Publication

Paper Accepted to ACM CAIS 2026 🤖

Our paper "cotomi Act: Learning to Automate Work by Watching You" has been accepted to ACM CAIS'26 (Demos).

Publication

Paper Accepted to ACL 2026 (Findings) 📋

Our paper "Evaluating the Impact of Reviewer Guideline Design on LLM-Based Automated Peer Review" has been accepted to ACL 2026 (Findings).

Publication

Paper Accepted to ICLR 2026

Our paper "Best-of-∞ -- Asymptotic Performance of Test-Time Compute" has been accepted to ICLR 2026.

Publication

Paper Accepted to EACL 2026 (Findings)

Our paper "SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation" has been accepted to EACL 2026 (Findings).

Publication

Paper Accepted to NeurIPS 2025

Our paper "DISC: Dynamic Decomposition Improves LLM Inference Scaling" has been accepted to NeurIPS 2025.

Publication

NEC's cotomi Act Web Agent Beats Humans on WebArena Benchmark! 🤖

NEC's cotomi Act achieved world-first success rate of 80.4% on WebArena benchmark, surpassing human performance (78.2%) in web browser automation tasks! Press Release

News

Paper Accepted to EMNLP 2025 🤖

Our paper "LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents" has been accepted to EMNLP 2025.

Publication

Paper Accepted to COLM 2025 🦅

Our paper "Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance" has been accepted to COLM 2025.

Publication

Paper Accepted to ACL 2025 📝

Our paper "On Synthesizing Data for Context Attribution in Question Answering" has been accepted to ACL 2025.

Publication

Paper Accepted to SIGIR 2025 🔍

Our paper "LLM-based Query Expansion Fails for Unfamiliar and Ambiguous Queries" has been accepted to SIGIR 2025.

Publication

Paper Accepted to BioNLP2025 🏥

Our paper "Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain" has been accepted to BioNLP2025 (Workshop colocated with ACL2025).

Publication

Paper Accepted to NAACL 2024 main track 🤖

Our paper "Can Large Language Models Invent Algorithms to Improve Themselves?" has been accepted to NAACL 2024 main track!

Publication

KeyNote at ComSys 2024 🗣️

Presented on Self-Improving LLM, RAG, and Action Model at the ComSys 2024 Conference.

Talk

Paper Accepted to IEEE Big Data 2024 🤖

Our poster paper "Towards Automated Workflow Construction for AI Agents: A Preliminary Study" has been accepted to IEEE Big Data 2024.

Publication

Paper Accepted to EMNLP 2024 main track 🪼

Our paper "Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing" has been accepted to EMNLP 2024.

Publication

Lecture at Kobe University 🎓

Presented on Real-world Large Language Model Development and Recent Research Trends (such as Test-time Compute) at Kobe University.

Talk

Talk at ACM MM 2024 🎯

Presented at ACM MM'24 on NEC's large language model development.

Talk

Tutorial Accepted to CIKM 2024 📊

Our tutorial "On the Use of Large Language Models for Table Tasks" has been accepted to CIKM 2024.

Publication

Talk at IPSJ Seminar 🎤

Presented at IPSJ Seminar on NEC's large language model development.

Talk

Paper Accepted to *SEM@NAACL 2024 🎯

Our paper "Relevance, Diversity, and Exclusivity: Designing Keyword-augmentation Strategy for Zero-shot Classifiers" has been accepted to *SEM@NAACL 2024.

Publication

Paper Accepted to VLDB 2023 🔍

Our paper "DeepJoin: Joinable Table Discovery with Pre-trained Language Models" has been accepted to VLDB 2023.

Publication

Paper Accepted to IEEE Big Data 2023 📚

Our paper "Towards Large Language Model Organization: A Case Study on Abstractive Summarization" has been accepted to IEEE Big Data 2023.

Publication

Paper Accepted to EMNLP Findings 2023 🔍

Our paper "Context Quality Matters in Training Fusion-in-Decoder for Extractive Open-Domain Question Answering" has been accepted to EMNLP Findings 2023.

Publication

Paper Accepted to PAKDD 2023 🤝

Our paper "QA-Matcher: Unsupervised Entity Matching Using a Question Answering Model" has been accepted to PAKDD 2023.

Publication

Paper Accepted to SIGIR 2022 📊

Our paper "Table Enrichment System for Machine Learning" has been accepted to SIGIR 2022.

Publication