计算机科学与技术系

Department of Computer Science and Technology

Education background

Bachelor of Automation, University of Science and Technology Beijing, China, 1981;

Master of Automation, Kobe University, Japan, 1989;

Ph.D. in Information Engineering, Nagoya Institute of Technology, Japan, 1990.

Social service

Beijing Computer Federation: Deputy Director (2004-);

Department of Computer Science and Technology, Tsinghua University: Vice Dean (2004-2007);

State Key Laboratory of Intelligent Technology and System, Tsinghua University: Deputy Director (2002-).

Areas of Research Interests/ Research Projects

Intelligent Information Processing, Machine Learning, Text Mining

Bioinformatics

National Basic Research Program of China (The 973 Program): Large-Scale and Real-Text Oriented Chinese Computing: Theories, Methods and Tools (1998-2002);

National 863 High-Tech Program: Research on Bio-Information: Extraction, Evaluation and Integration (2007-2008);

International Project funded by International Development and Research Center (Canada): Breaking the Barriers to Internet Access (2009-2014).

Research Status

I am interested in a wide range of research topics in intelligent information processing. Before 2004, my group focused on Optical Character Recognition (OCR), Speech Signal Processing and Human-Computer Interaction. The theoretical achievements of our work have been applied to two systems-OCR Engine (a system for handwritten character recognition) and Aurora (a system developed for blind and visually impaired computer users). Thanks to our OCR Engine's high accuracy, it had been used in the Fifth National Census of China in 2000. The Aurora system provides users with a set of helpful tools including screen reader, speech recognition server, Braille translator, and email editor. Aurora has a large population of organizational (schools for the blind) and individual users in China, and was adopted by China's First Computer Certification Test for the Blind. Starting 2004, our work has been focusing on Text Mining-Biomedical Text mining and Question Answering (QA) in particular.

Information distance is a fundamental concept in our theoretical research. To measure informational similarity between individual objects, we proposed conditional information distance model, MIN information distance model and multiple-object information distance model, respectively. These models extend the traditional information distance theory, and provide creative and effective ways to practical application. We have successfully applied these models to many text mining tasks such as pattern optimization in biomedical text mining, information distance between a question and its answer in QA, content similarity measurement in multiple document summarization, and typical/overall document extraction in product review summarization. Our multiple-document summarization system successively ranked 1st in TAC (Text Analysis Conference) 2008 and 2009, a well-recognized international evaluation for Natural Language Processing methodologies.

Improving user experience in internet information acquisition is the long-term goal of our research. Based our research results in text mining, we have developed online systems to facilitate users in acquiring domain-specific or general information from the internet. These systems include ONBIRES (Ontology-Based Biological Relation Extraction System), QUANTA (Question Answering System), and PROCAR (Product Review Mining System). Our bioinformatics text mining system ranked 1st in several tests of BioCreative II-a critical assessment of information extraction systems in bioinformatics domain, and has attracted the attentions of NCBI (United States National Center for Biotechnology Information)-a prestigious institute which provides biomedical databases and research tools for scientific users. Beginning in 2008, our collaboration with NCBI focuses on the areas of function summarization, and gene/protein name entity identification. At present, NCBI is planning to apply our biomedical texting mining techniques in its biomedical literature search engine named PubMed. Fairly soon, our techniques will serve world-wide biomedical scientific users through PubMed. Our Question-Answering (QA) system aimed at providing users with direct answers to any general questions (e.g. weather report, stock quote or currency exchange). These answers are extracted from the internet. Our QA system can be customized to accommodate different information sources, forming search engine-based QA, encyclopedia-based QA and community-based QA, respectively. This promotes its efficiency while retaining generality, and is most useful in a mobile scenario. We expect our internet information acquisition service would be available to anyone, in anytime and at anywhere.

My internet information acquisition research is currently supported by Canada's International Development Research Centre (IDRC). IRDC provided us one million Canadian dollars as financial support, and appointed me as the IDRC Research Chair in Information Technology to conduct this research

Honors And Awards

Science and Technology Progress Award by Ministry of Education, Second Class-An Off-line Handwriting Recognition System for Chinese Characters and Numbers (1997);

National Award for Science and Technology Progress, Second Class-OCR Character Input System for the Fifth Census in China (2004);

IDRC Research Chair in Information Technology (2009).

Academic Achievement

[1] M. Huang, S. Ding, H. Wang and X. Zhu. Mining Physical Protein-protein Interactions from Literature. Genome Biology 2008, 9 (Suppl 2):S12

[2] H.N. Wang, M.L. Huang, and X.Y. Zhu. Extract Interaction Detection Methods from the Biological Literature. BMC Bioinformatics 2009, 10(Suppl 1):S55

[3] H.N. Wang, S.L. Ding, M.L. Huang and X.Y. Zhu. Exploiting and Integrating Rich Features for Biological Literature Classification. BMC Bioinformatics, 2008, 9(Suppl 3):S4

[4] Y. Hao, X.Y. Zhu, M.L. Huang, and M.Li. Discovering patterns to extract protein-protein interactions from the literature: Part II. Bioinformatics, August 1, 2005; 21(15): 3294-3300.(5.7)3294-3300.(5.7)

[5] M.L. Huang, X.Y. Zhu, Y. Hao, D.G. Payan, K.B Qu and M. Li. Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics, July, 2004.

[6] S.L. Ding, G. Cong, C. Yew and X,Y. Zhu. Using Conditional Random Fields to Extract Contexts and Answer of Questions from Online Forums. Proc. Intl. Conf. on ACL, Columbus, Ohio, America, 2008.

[7] F.T. Li, Y. Tang, M.L. Huang and X.Y. Zhu. Answering Opinion Questions with 

Random Walks on Graphs. Proc. the Joint Conf. ACL and Intl. Conf. on Natural Language Processing (ACL-IJCNLP 2009), Singapore, pp. 737--745. 2009.

[8] X. Zhang, Y. Hao, X.Y. Zhu and M. Li. Information Distance from a Question to an Answer. Proc. ACM SIG KDD, 2007, California, United State, pp.874-883.

[9] H.N. Wang, M.L. Huang, and X.Y. Zhu. A Generative Probabilistic Model for Multi-Label Classification. Proc. IEEE Intl. Conf. on Data Mining (ICDM2008),Pisa, Italy, 2008.

[10] C. Long, M.L. Huang, X.Y. Zhu, and M. Li. Multi-document Summarization by Information Distance. Proc. IEEE Intl. Conf. on Data Mining (ICDM 2009), Miami, USA, 2009.

[11] C. Long, X.Y. Zhu, M. Li, and B. Ma. Information Shared by Many Objects. Proc. Intl. ACM Conf. on Information and Knowledge Management, (CIKM2008), California, USA, 2008.

[12] L. Zhuang, F. Jing and X.Y. Zhu: Movie review mining and summarization, Proc. Intl. ACM Conf. on Information and Knowledge Management, (CIKM2006), Arlington, VA, USA, 2006.