主席研究員

NEC

Biography

「誰もが自らの能力を最大限に発揮できる世界」に向けて、知識・言語の領域で研究開発に取り組んでいます。ときどき怠惰なプログラマです。

現在は NEC において大規模言語モデル (LLM) を中心とした AI の研究開発グループを主宰しています。LLM を作り、理解し、世に役立てる活動にご興味のある方はぜひご連絡ください。

興味・関心

🤖🏢 LLM Organization & AI Agents (BigData 2023)
🔍🤖 Retrieval-Augmented Language Models (EMNLP Findings 2023)
🔍 Data Profiling and Data Discovery (AAAI 2019, ICDE 2021, SIGIR 2022, VLDB 2023)
💽 Heterogeneous Data Integration (ICWSM 2021, SFDI 2021, PAKDD 2023, eCom 2023)
📰 Information Extraction from Unstructured / Semi-structured Data (BIGDATA 2019, EMNLP 2021)
🧠 Knowledge Acquisition (AAAI 2020, PAKDD 2021)
(past) 🕸 Statistical Relational Learning (ICDM 2017, PAKDD 2017)
(past) 💽 Light-weight Materialization of Queries (BIGDATA 2018)
(past) ⌨ Source Code Analysis (BIGDATA 2019)
(past) 🤖 Machine Learning on Compressed Data (JIP 2018, APWEB 2014)
(past) ⚡ Transactional Data Stream Processing (ACR 2013, SAC 2013)

受賞等

情報処理学会論文誌データベース優秀論文賞, 2021

情報処理学会
情報処理学会山下記念研究賞, 2019

情報処理学会
優秀論文賞, 2018

WebDB Forum
人工知能学会全国大会優秀賞, 2016

JSAI
最優秀インタラクティブ賞, 2015

DEIM
Best paper runner up award, 2015

APWeb
優秀インタラクティブ賞, 2014

DEIM
コンピュータサイエンス専攻長賞, 2013

筑波大学
学生プレゼンテーション賞, 2013

DEIM

経歴

博士 (工学), 2018

筑波大学
修士 (工学), 2013

筑波大学
学士 (情報科学), 2011

筑波大学

We are hiring (Interneship / Full-time)

Our research team (knowledge-based learning) at NEC Corporation is seeking for motivated full-time researchers / internship students who are passionate in working on interdisciplinary research issues that arise from real-world enterprise business. We aim to contribute to both industries and academics. Our research results are commercialized and used in various enterprise companies such as retailers and consumer products companies. We publish our results in top-level venues of computer science (e.g., AAAI, ICDE, ICDM, BigData).

Research topics include but not limited to:

Highly scalable ML-aided data integration (DB + ML + HPC)
Crowd-sourcing for data cleaning (DB + ML + HCI)
Machine learning on data sketches (ML + DB)
Knowledge-driven AutoML based on source code analysis (SE + NLP + ML)

Preferred Skills

Business-level English (writing & speaking)
Basic coding skills for data science tasks
Basic knowledge on data structures and algorithms
Academic experience (publication) in one of the following research area:
- Database Systems (DB)
- Machine Learning (ML)
- Software Engineering (SE)
- Natural Language Processing (NLP)
- Human-Computer Interaction (HCI)
- Information Retrieval (IR)

Contact

Please drop me an E-mail with your CV if you are interested in working with us.

Software

生産性を向上させるツールを好んで作ります。 GitHubに一覧があります。

iKeySnail

Provides fully-configurable hardware keyboard functionalities for web browsing on iOS (iPadOS)

xKeySnail

Yet another keyboard remapping tool for X environment

percol

An interactive grep tool in your terminal

KeySnail

Allows you to bind commands to key sequences in Mozilla Firefox

MiSPLi

A Lisp implementation and REPL written in JavaScript, which supports static-scoping, lexical-closure, macro, and basic special forms

org-js

A parser for org-mode notation written in JavaScript

zlc.el

Provides zsh like completion for minibuffer in Emacs

Chaotic Canvas

A chaos fractal generator written in JavaScript

js-doc.el

Add jsdoc-related functionalities for Emacs

lemon-mode.el

A major-mode for LEMON Parser Generator.

Publications

一覧

Genki Kusano, Masafumi Oyamada

2023 SIGIR eCom

Cross-Domain User Similarity without Overlapping Attributes via Optimal Transport Theory

Yuyang Dong, Chuan Xiao, Masafumi Enomoto, Takuma Nozawa, Masafumi Oyamada

2023 VLDB

DeepJoin: Joinable Table Discovery with Pre-trained Language Models

Shogo Hayashi, Yuyang Dong, Masafumi Oyamada

2023 PAKDD

QA-Matcher: Unsupervised Entity Matching Using a Question Answering Model

Yuyang Dong, Masafumi Oyamada

2022 SIGIR

Table Enrichment System for Machine Learning (Demo)

Kunihiro Takeoka, Kosuke Akimoto, Masafumi Oyamada

2021 EMNLP

Low-resource Taxonomy Enrichment with Pretrained Language Models

Kazunori Sakai, Yuyang Dong, Masafumi Oyamada, Kunihiro Takeoka, Takeshi Okadome

2021 SFDI

Entity Matching with String Transformation and Similarity-Based Features

Genki Kusano, Masafumi Oyamada

2021 ICWSM

User Identity Linkage for Different Behavioral Patterns across Domains

Yuyang Dong, Kunihiro Takeoka, Chuan Xiao, Masafumi Oyamada

2021 ICDE

Efficient Joinable Table Discovery in Data Lake: A High-Dimensional Similarity-Based Approach

Masafumi Enomoto, Kunihiro Takeoka, Yuyang Dong, Masafumi Oyamada, Takeshi Okadome

2021 PAKDD

Quality Control for Hierarchical Classification with Incomplete Annotations

Yuyang Dong, Chuan Xiao, Hanxiong Chen, Jefferey Xu Yu, Kunihiro Takeoka, Masafumi Oyamada, Hiroyuki Kitagawa

2020 VLDB Journal

Continuous Top-k Spatial-Keyword Search on Dynamic Objects

Kunihiro Takeoka, Yuyang Dong, Masafumi Oyamada

2020 AAAI 2020

Learning from Unsure Responses

PDF DOI

Many annotation systems provide to add an unsure option in the labels, because the annotators have different expertise, and they may not have enough confidence to choose a label for some assigned instances. However, all the existing approaches only learn the labels with a clear class name and ignore the unsure responses. Due to the unsure response also account for a proportion of the dataset (e.g., about 10-30% in real datasets), existing approaches lead to high costs such as paying more money or taking more time to collect enough size of labeled data.

Masafumi Oyamada

2019 IEEE Big Data 2019

Extracting Feature Engineering Knowledge from Data Science Notebooks

Designing good features for machine learning models, which is called feature-engineering, is one of the most important tasks in data analysis. Well-designed features, which capture the characteristics of data, improve the predictive performance and explainability of the model. Since good features generally reflect the deep knowledge on business domains of the data and the analysis task, feature engineering is considered as one of the most difficult phases in data analysis.

Kunihiro Takeoka, Masafumi Oyamada, Shinji Nakadai, Takeshi Okadome

2019 AAAI 2019

Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables

PDF DOI

Given a large amount of table data, how can we find the tables that contain the contents we want? A naive search fails when the column names are ambiguous, such as if columns containing stock price information are named “Close” in one table and named “P” in another table. One way of dealing with this problem that has been gaining attention is the semantic annotation of table data columns by using canonical knowledge.

Masafumi Oyamada

2018 IEEE Big Data 2018

Accelerating Feature Engineering with Adaptive Partial Aggregation Tree

PDF DOI

Range aggregation query is a fundamental operation in the feature engineering phase of the machine learning tasks, which computes statistics, such as the maximum and the standard deviation of a subset of records. Since the feature-engineering process is a trial-and-error process, data analysts repeatedly conduct tons of the range aggregation queries by changing the range conditions, which results in a heavy workload. To accelerate such repetitive range aggregation queries, we propose Adaptive Partial Aggregation Tree (APA-tree), which drastically reduces the amount of I/Os that happen in executing the range aggregation queries.

Masafumi Oyamada, Jianquan Liu, Shinji Ito, Kazuyo Narita, Takuya Araki, Hiroyuki Kitagawa

2018 Journal of Information Processing (JIP)

Compressed Vector Set: A Fast and Space-Efficient Data Mining Framework

PDF DOI

In this paper, we present CVS (Compressed Vector Set), a fast and space-efficient data mining framework that efficiently handles both sparse and dense datasets. CVS holds a set of vectors in a compressed format and conducts primitive vector operations, such as lp-norm and dot product, without decompression. By combining these primitive operations, CVS accelerates prominent data mining or machine learning algorithms including k-nearest neighbor algorithm, stochastic gradient descent algorithm on logistic regression, and kernel methods.

Masafumi Oyamada, Shinji Nakadai

2017 ICDM 2017

Relational Mixture of Experts: Explainable Demographics Prediction with Behavioral Data

PDF DOI

Given a collection of basic customer demographics (e.g., age and gender) and their behavioral data (e.g., item purchase histories), how can we predict sensitive demographics (e.g., income and occupation) that not every customer makes available? This demographics prediction problem is modeled as a classification task in which a customer’s sensitive demographic y is predicted from his feature vector x. So far, two lines of work have tried to produce a “good” feature vector x from the customer’s behavioral data: (1) application-specific feature engineering using behavioral data and (2) representation learning (such as singular value decomposition or neuralembedding) on behavioral data.

Katsufumi Tomobe, Masafumi Oyamada, Shinji Nakadai

2017 PAKDD 2017

Link Prediction for Isolated Nodes in Heterogeneous Network by Topic-Based Co-clustering

PDF DOI

This paper presents a new probabilistic generative model (PGM) that predicts links for isolated nodes in a heterogeneous network using textual data. In conventional PGMs, a link between two nodes is predicted on the basis of the nodes’ other existing links. This method makes it difficult to predict links for isolated nodes, which happens when new items are recommended. In this study, we first naturally expand the relational topic model (RTM) to a heterogeneous network (Hetero-RTM).

Masafumi Oyamada, Jianquan Liu, Kazuyo Narita, Takuya Araki

2014 APWeb 2014

MOARLE: Matrix Operation Accelerator Based on Run-Length Encoding

PDF DOI

Masafumi Oyamada, Hideyuki Kawashima, Hiroyuki Kitagawa

2013 ACM SIGAPP Applied Computing Review

Data Stream Processing with Concurrency Control

PDF DOI

Masafumi Oyamada, Hideyuki Kawashima, Hiroyuki Kitagawa

2013 ACM SAC 2013

Continuous Query Processing with Concurrency Control: Reading Updatable Resources Consistently

PDF DOI

A recent trend in data stream processing shows the use of advanced continuous queries (CQs) that reference non-streaming resources such as relational data in databases and machine learning models. Since non-streaming resources could be shared among multiple systems, resources may be updated by the systems during the CQ execution. As a consequence, CQs may reference resources inconsistently, and lead to a wide range of problems from inappropriate results to fatal system failures.

Masafumi Oyamada, Hideyuki Kawashima, Hiroyuki Kitagawa

2011 SMDMS 2011

Efficient Invocation of Transaction Sequences Triggered by Data Streams

PDF DOI

Experience

Research Fellow (Executive Professional)

NEC Corporation

Jan 2024 – Present Tokyo, Japan

Research Fellow and the Head of a Generative AI Research Group.

Senior Principal Researcher / Director

NEC Corporation

Apr 2022 – Dec 2023 Tokyo, Japan

Senior Principal Researcher and the Director of a research group that focus on data management, natural language processing, and data mining issues. Also the product manager of NEC Data Enrichment service, an ML-based data preparation platform.

Principal Researcher

NEC Corporation

Apr 2020 – Mar 2022 Tokyo, Japan

Principal investigator of a research team (knowledge-based learning) and product manager of a ML-based data management software.

Senior Researcher

NEC Corporation

Apr 2017 – Mar 2020 Tokyo, Japan

Principal investigator of a research team (knowledge-based learning). Research topics include

Data Management (Data Integration, Data Indexing, …)
Machine Learning (Multi-label Classification)
Human-Computer Interaction (Crowd Computing)
Information Extraction (Knowledge Extraction)

Researcher

NEC Corporation

Apr 2013 – Mar 2017 Tokyo, Japan

Research on customer behavior data analytics. Research topics include

Bayesian Modeling of Customer Behavior
Context-aware Recommendation
Statistical Relational Learning

Research Staff (Internship)

NTT Research Laboratories

Aug 2011 – Aug 2011 Tokyo, Japan

Research and development on a data stream processing system.

Software Engineer (Part-time)

Clear-Code Inc.

May 2011 – Mar 2013 Tokyo, Japan

Software development. Developed web browser extensions, E-mail reader extensions, and search engine backends. (JavaScript, Ruby, C++).

Software Engineer (Internship)

Hatena Inc.

Aug 2010 – Aug 2010 Tokyo, Japan

Developed a browser extension of social bookmark web service (Hatena Bookmark) for Safari.