Extracting Multiwords From Large Document Collection Based N-Gram

M. Nirmala; Dr. E. Ramaraj

doi:https://doi.org/10.14445/22492615/IJPTT-V3I5P104

Research Article | Open Access | Download PDF

Volume 3 | Issue 3 | Year 2013 | Article Id. IJPTT-V3I5P104 | DOI : https://doi.org/10.14445/22492615/IJPTT-V3I5P104

Extracting Multiwords From Large Document Collection Based N-Gram

M. Nirmala , Dr. E. Ramaraj

Citation :

M. Nirmala , Dr. E. Ramaraj, "Extracting Multiwords From Large Document Collection Based N-Gram," International Journal of P2P Network Trends and Technology (IJPTT), vol. 3, no. 3, pp. 38-41, 2013. Crossref, https://doi.org/10.14445/22492615/IJPTT-V3I5P104

Abstract

Multiword terms (MWTs) are relevant strings of words in text collections. Once they are automatically extracted, they may be used by an Information Retrieval system, suggesting its users possible conceptual interesting refinements of their information needs. As a matter of fact, these multiword terms point to relevant information, often corresponding to topics and subtopics in the text collection, and maybe quite useful specially for highly refining generic queries. A new approach is proposed to find collocation from text document. As mentioned earlier, a collocation is just a set of words occurring together more often than by chance in a corpus. Collocations are extracted based on the frequency of the joint occurrence of the words as well as that of the individual occurrences of each of the words in the whole text. Intuitively, when a set of words is extracted as a collocation, then the joint occurrence of the words must be high in comparison to that of the constituent individual words.

Keywords

Multiword terms (MWTs), Information, Collocations, Extraction, Text Document.

References

[1] Efficient in-memory data structures for n-grams indexing . Daniel Robenek, Jan Plato_s, and V_aclav Sn_a_sel, fdaniel.robenek.st, jan.platos, vaclav.snasel.
[2] Evaluating N-gram based Evaluation Metrics for Automatic Keyphrase Extraction. Su Nam Kim, Timothy Baldwin, MinYen Kan. sunamkim@gmail.com, tb@ldwin.net, kanmy@comp.nus.edu.sg.
[3] n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure, Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee, Min-Jae Lee. mskim, kywhang, jglee, mjlee @mozart.kaist.ac.kr.
[4] Extracting Multiword Terms from Document collections. Quinta da Torre, 2725, Monte da Caparica, Quinta da Torre, 2725, Monte da Caparica.
[5] Automatic Keyword Extraction From Any Text Document Using N-gram Rigid Collocation,Bidyut Das, Subhajit Pal, Suman Kr. Mondal, Dipankar Dalui, Saikat Kumar Shome.International Journal of Soft Computing and Engineering (IJSCE)ISSN: 2231-2307, Volume-3, Issue-2.
[6] Advanced Information Extraction with n-gram based LSI,Ahmet Güven, Ö. Özgür Bozkurt, and Oya Kalıpsız.World Academy of Science, Engineering and Technology 17 2008.
[7] Evaluating N-gram based Evaluation Metrics for automatic Keyphrase Extraction, Su Nam Kim, Timothy Baldwin, CSSE University of Melbourne,sunamkim@gmail.com, tb@ldwin.net.Min-Yen Kan School of Computing . National University of Singapore ,kanmy@comp.nus.edu.sg
[8] Information Extraction from Web-Scale N-Gram Data, Niket Tandon. ntandon@mpi-inf.mpg.de ,Gerard de Melo Max Planck Institute for Informatics Saarbrücken, Germany gdemelo@mpi-inf.mpg.de.
[9] A Distributed N-Gram Indexing System to Optimizing Persian Information Retrieval, Mohadese Danesh, Behrouz Minaei, and Omid Kashefi.International Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013.
[10] Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System, Ethan Miller, Dan Shen, Junli Liu, and Charles Nicholas.University of Maryland Baltimore County,elm,dshen,jliu,nicholas@csee.umbc.edu.