「誰もが自らの能力を最大限に発揮できる世界」に向けて、知識・言語の領域で研究開発に取り組んでいます。ときどき怠惰なプログラマです。
現在は NEC において大規模言語モデル (LLM) を中心とした AI の研究開発グループを主宰しています。LLM を作り、理解し、世に役立てる活動にご興味のある方はぜひご連絡ください。
情報処理学会論文誌 データベース 優秀論文賞, 2021
情報処理学会
情報処理学会 山下記念研究賞, 2019
情報処理学会
優秀論文賞, 2018
WebDB Forum
人工知能学会 全国大会優秀賞, 2016
JSAI
最優秀インタラクティブ賞, 2015
DEIM
Best paper runner up award, 2015
APWeb
優秀インタラクティブ賞, 2014
DEIM
コンピュータサイエンス専攻長賞, 2013
筑波大学
学生プレゼンテーション賞, 2013
DEIM
博士 (工学), 2018
筑波大学
修士 (工学), 2013
筑波大学
学士 (情報科学), 2011
筑波大学
Our research team (knowledge-based learning) at NEC Corporation is seeking for motivated full-time researchers / internship students who are passionate in working on interdisciplinary research issues that arise from real-world enterprise business. We aim to contribute to both industries and academics. Our research results are commercialized and used in various enterprise companies such as retailers and consumer products companies. We publish our results in top-level venues of computer science (e.g., AAAI, ICDE, ICDM, BigData).
Research topics include but not limited to:
Please drop me an E-mail with your CV if you are interested in working with us.
生産性を向上させるツールを好んで作ります。 GitHubに一覧があります。
*Provides fully-configurable hardware keyboard functionalities for web browsing on iOS (iPadOS)
Yet another keyboard remapping tool for X environment
An interactive grep tool in your terminal
Allows you to bind commands to key sequences in Mozilla Firefox
A Lisp implementation and REPL written in JavaScript, which supports static-scoping, lexical-closure, macro, and basic special forms
A parser for org-mode notation written in JavaScript
Provides zsh like completion for minibuffer in Emacs
A chaos fractal generator written in JavaScript
Add jsdoc-related functionalities for Emacs
A major-mode for LEMON Parser Generator.
Many annotation systems provide to add an unsure option in the labels, because the annotators have different expertise, and they may not have enough confidence to choose a label for some assigned instances. However, all the existing approaches only learn the labels with a clear class name and ignore the unsure responses. Due to the unsure response also account for a proportion of the dataset (e.g., about 10-30% in real datasets), existing approaches lead to high costs such as paying more money or taking more time to collect enough size of labeled data.
Designing good features for machine learning models, which is called feature-engineering, is one of the most important tasks in data analysis. Well-designed features, which capture the characteristics of data, improve the predictive performance and explainability of the model. Since good features generally reflect the deep knowledge on business domains of the data and the analysis task, feature engineering is considered as one of the most difficult phases in data analysis.
Given a large amount of table data, how can we find the tables that contain the contents we want? A naive search fails when the column names are ambiguous, such as if columns containing stock price information are named “Close” in one table and named “P” in another table. One way of dealing with this problem that has been gaining attention is the semantic annotation of table data columns by using canonical knowledge.
Range aggregation query is a fundamental operation in the feature engineering phase of the machine learning tasks, which computes statistics, such as the maximum and the standard deviation of a subset of records. Since the feature-engineering process is a trial-and-error process, data analysts repeatedly conduct tons of the range aggregation queries by changing the range conditions, which results in a heavy workload. To accelerate such repetitive range aggregation queries, we propose Adaptive Partial Aggregation Tree (APA-tree), which drastically reduces the amount of I/Os that happen in executing the range aggregation queries.
In this paper, we present CVS (Compressed Vector Set), a fast and space-efficient data mining framework that efficiently handles both sparse and dense datasets. CVS holds a set of vectors in a compressed format and conducts primitive vector operations, such as lp-norm and dot product, without decompression. By combining these primitive operations, CVS accelerates prominent data mining or machine learning algorithms including k-nearest neighbor algorithm, stochastic gradient descent algorithm on logistic regression, and kernel methods.
Given a collection of basic customer demographics (e.g., age and gender) and their behavioral data (e.g., item purchase histories), how can we predict sensitive demographics (e.g., income and occupation) that not every customer makes available? This demographics prediction problem is modeled as a classification task in which a customer’s sensitive demographic y is predicted from his feature vector x. So far, two lines of work have tried to produce a “good” feature vector x from the customer’s behavioral data: (1) application-specific feature engineering using behavioral data and (2) representation learning (such as singular value decomposition or neuralembedding) on behavioral data.
This paper presents a new probabilistic generative model (PGM) that predicts links for isolated nodes in a heterogeneous network using textual data. In conventional PGMs, a link between two nodes is predicted on the basis of the nodes’ other existing links. This method makes it difficult to predict links for isolated nodes, which happens when new items are recommended. In this study, we first naturally expand the relational topic model (RTM) to a heterogeneous network (Hetero-RTM).
A recent trend in data stream processing shows the use of advanced continuous queries (CQs) that reference non-streaming resources such as relational data in databases and machine learning models. Since non-streaming resources could be shared among multiple systems, resources may be updated by the systems during the CQ execution. As a consequence, CQs may reference resources inconsistently, and lead to a wide range of problems from inappropriate results to fatal system failures.
Principal investigator of a research team (knowledge-based learning). Research topics include
Research on customer behavior data analytics. Research topics include