Afinar Preguntas De Investigación



Descargar 71,08 Kb.
Fecha de conversión02.03.2017
Tamaño71,08 Kb.
TALLER IV

Bibliografía Anotada

Búsquedas Estructuradas De Información

Afinar Preguntas De Investigación


  1. Dado que existen múltiples tipos de clasificadores de textos, dependiendo de los métodos que se utilizan para crearlos, es posible determinar cuál es más eficiente para tareas determinadas?

  2. Influye la combinación de tipos de clasificadores en un desempeño y beneficio mayor?

Bibliografía Anotada

Referencia

Justificación

Notas


1

@ARTICLE{Sebastiani02,

author = {Fabrizio Sebastiani},

title = {Machine learning in automated text categorization},

journal = {ACM Computing Surveys},

year = {2002},

volume = {34},

pages = {1--47},

number = {1},

url = {http://www.math.unipd.it/~fabseb60/Publications/ACMCS02.pdf} }


Este documento presenta una revisión muy completa de la literatura sobre la clasificación automática de textos. Sirve para contextualizar el tema de interés y ver de manera general como se desarrolla esta aplicación del aprendizaje de máquina.


Su autor Fabricio Sebastiani se encuentra entre los más destacados del área.

El artículo presenta de manera clara y ordenada los antecedentes y la definición de la tarea de clasificación de textos, los campos de aplicación más relevantes, las nociones de aprendizaje de máquina que involucra el área y las cuestiones relacionadas con indexación. Se centra después en la construcción de un clasificador.

También revisa varios de los algoritmos utilizados para construir clasificadores y posteriormente los evalúa y compara entre sí.

Por último, discute el panorama del área a futuro.


A pesar de estar dirigido a la comunidad científica, la forma en que está escrito el artículo hace fácil su lectura y los ejemplos que se dan son bastante claros.

2

@ARTICLE{Attardi98,

author = {Attardi, Giuseppe and Di Marco, Sergio and Salvi, Davide},

title = {Categorization by context},

journal = {Journal of Universal Computer Science},

year = {1998},

volume = {4},

pages = {719--736},

number = {9},

url = {http://www.jucs.org/jucs_4_9/categorisation_by_context}}


Este artículo describe la técnica de categorización por contexto. Presenta resultados experimentales obtenidos de una implementación preliminar de la técnica.


La técnica que presenta este artículo utiliza el contexto que percibe de la estructura de documentos HTML para extraer información útil para clasificar los documentos a los que se refiere.



3

@ARTICLE{Bayer98,

author = {Thomas Bayer and Ulrich Kressel and Heike Mogg-Schneider and Ingrid

Renz},


title = {Categorizing paper documents. A generic system for domain and language

independent text categorization},

journal = {Computer Vision and Image Understanding},

year = {1998},

volume = {70},

pages = {299--306},

number = {3},

url = {http://www.idealibrary.com/links/doi/10.1006/cviu.1998.0687/pdf}}

Se propone el uso de un modelo genérico para categorización de textos basado en un análisis estadístico de estructuras de textos representativos.


Es útil para comprender este tipo de modelo.


En el sistema propuesto, las características representativas se derivan automáticamente de los textos de entrenamiento aplicando información estadística y conocimiento lingüístico general.


El sistema presentado puede ser fácilmente adaptado a nuevos dominios o diferentes idiomas.



4

@ARTICLE{Bekkerman03,

author = {Ron Bekkerman and Ran El-Yaniv and Naftali Tishby and Yoad Winter},

title = {Distributional word clusters vs.\ words for text categorization},

journal = {Journal of Machine Learning Research},

year = {2003},

volume = {3},

pages = {1183--1208},

url = {http://www.jmlr.org/papers/volume3/bekkerman03a/bekkerman03a.pdf}}

Se describe una forma para clasificar textos combinando un clustering distributivo de palabras un clasificador de Support Vector Machine (SMV)


Muestra que al combinar estos métodos se produce un alto rendimiento en la categorización de textos

Utiliza el método Information Bottleneck que genera una representación compacta y eficiente de los documentos.


Se compara la combinación propuesta en este artículo con una basada solamente en SVM por medio de una simple representación de bolsa de palabras, (bag-of-words, BOW)

Esta comparación se hizo con 3 bases de datos conocidas en una de las cuales (20 Newsgroups) el método basado en cluster de palabras supera significativamente al de la representación basada en palabras en mayor precisión en la categorización o eficiencia en la representación.

En las otras dos comparaciones sucede lo contrario.


5

@ARTICLE{Bennett05,

author = {Paul N. Bennett and Susan T. Dumais and Eric Horvitz},

title = {The Combination of Text Classifiers Using Reliability Indicators},

journal = {Information Retrieval},

year = {2005},

volume = {8},

pages = {67--100},

number = {1},

url = {http://www.kluweronline.com/issn/1386-4564}}


Defiende la idea de entre la gama de clasificadores de texto, estos se comportan de diferentes maneras y por esto se han hecho varios esfuerzos para construir un mejor metaclasificador tratando de combinar clasificadores.

Propone una manera para combinarlos.
Sirve para centrarse en el tema de la construcción de metaclasificadores pues hace una introducción del background del campo, explica procedimientos para generarlos, y una revisión completa de los estudios comparativos para evaluar esta metodología.

El artículo presenta un método probabilístico para combinar clasificadores que considera la puntualidad de las contribuciones de los clasificadores en cuanto a su sensibilidad contextual (context-sensitive)


Expone indicadores sobre el rendimiento de los clasificadores en diferentes situaciones.



6

@INPROCEEDINGS{Cardoso03,

author = {Ana Cardoso-Cachopo and Arlindo L. Oliveira},

title = {An Empirical Comparison of Text Categorization Methods},

booktitle = {Proceedings of SPIRE-03, 10th International Symposium on String Processing

and Information Retrieval},

year = {2003},

editor = {Mario A. Nascimento and Edleno S. De Moura and Arlindo L. Oliveira},

pages = {183--196},

address = {Manaus, BR},

publisher = {Springer Verlag, Heidelberg, DE},

url = {http://www.gia.ist.utl.pt/~acardoso/spire03.pdf}}



Expone una comparación de rendimiento de varios métodos para categorización de textos en referencia a 2 bases de datos diferentes.


Los resultados que obtuvieron los reportaron usando el Mean Reciprocal Rank como medida del desempeño general, considerando que este Rank es una medida de evaluación comúnmente utilizada para esas tareas.

Los autores se concentraron en evaluar los siguientes métodos:

Vector and Latent Semantic Analysis (LSA), un clasificador basado en Support Vector Machines (SVM), las variaciones de los modelos Vector y LSA con k-Nearest Neighbor.
Entre sus resultados resaltan que en general que SMVs y k-NN LSA se desempeñaron mejor que otros metodos de una manera estadística significativa.



7

@INPROCEEDINGS{Cohen96a,

author = {William W. Cohen and Yoram Singer},

title = {Context-sensitive learning methods for text categorization},

booktitle = {Proceedings of SIGIR-96, 19th ACM International Conference on Research

and Development in Information Retrieval},

year = {1996},

editor = {Hans-Peter Frei and Donna Harman and Peter Sch{\"{a}}uble and Ross

Wilkinson},

pages = {307--315},

address = {Z{\"{u}}rich, CH},

publisher = {ACM Press, New York, US},

note = {An extended version appears as~\cite{Cohen99}},

url = {http://www.research.whizbang.com/~wcohen/postscript/sigir-96.ps}}


Compara dos algoritmos para categorización de texto: RIPPER y



sleeping experts for phrases.

Explica los conceptos en que radican sus diferencias y concluye que a pesar de esto ambos métodos tienen un rendimiento óptimo dentro de varios problemas de categorización, y en muchas ocasiones superan métodos de aprendizaje anteriormente utilizados.


Es valioso el artículo porque confirma la utilidad de clasificadores que representan información contextual.

Los algoritmos que menciona este trabajo construyen clasificadores que permiten que el contexto de una palabra w afecte cómo la presencia o ausencia de w va a contribuir en la clasificación.


Expone en que conceptos se distinguen estos métodos como lo son diferentes nociones de lo que constituye un contexto, diferentes maneras de combinar contextos para construir un clasificador, diferentes métodos para buscar combinaciones de contextos y diferentes criterios sobre cuáles contextos deben ser incluidos en esta combinación.

8

@INPROCEEDINGS{Larkey1996,

author = {Leah S. Larkey and W. Bruce Croft},

title = {Combining classifiers in text categorization},

booktitle = {Proceedings of SIGIR-96, 19th ACM International Conference on Research

and Development in Information Retrieval},

year = {1996},

editor = {Hans-Peter Frei and Donna Harman and Peter Sch{\"{a}}uble and Ross

Wilkinson},

pages = {289--297},

address = {Z{\"{u}}rich, CH},

publisher = {ACM Press, New York, US},

url = {http://cobar.cs.umass.edu/pubfiles/1combo.ps.gz}}

Después de analizar resultados experimentales el artículo concluye que una combinación de diferentes clasificadores producen mejores resultados que cualquiera utilizado individualmente.


Se evaluaron 3 tipos diferentes de clasificadores para categorización de textos (K-nearest-neighbour, relevance feedback y Bayesian independence) en el campo médico. Los clasificadores se aplicaron individualmente y en combinación.


Para este específico problema médico de categorización, la nueva formulación de preguntas y métodos de weighting methods usados en el clasificador k-nearest-neighbor influyeron en un mejor comportamiento del mismo.


9

@INPROCEEDINGS{Bilenko2004,

author = {Mikhail Bilenko and Sugato Basu and Raymond J. Mooney},

title = {Integrating constraints and metric learning in semi-supervised clustering},

booktitle = {ICML '04: Proceedings of the twenty-first international conference

on Machine learning},

year = {2004},

pages = {11},

address = {New York, NY, USA},

publisher = {ACM Press},

doi = {http://doi.acm.org/10.1145/1015330.1015360},

isbn = {1-58113-828-5},

location = {Banff, Alberta, Canada}}

Este documento presenta nuevos métodos para dos enfoques utilizados en clustering semi-supervisado a la vez que presenta un nuevo algoritmo de clustering semi-supervisado.


Sus resultados experimentales demostraron que el enfoque unificado produce mejores clusters que ambos enfoques individuales así como también algoritmos de clustering semi-supervisado propuestos anteriormente.

El Clustering semi-supervisado usa una pequeña cantidad de datos etiquetados para ayudar al aprendizaje no supervisado


Los trabajos previos en el área han utilizado datos supervisados en uno de dos enfoques: 1) métodos basados en constricción que guían al algoritmo de clustering hacia un mejor agrupamiento de datos y 2) métodos de aprendizaje distance-function que adaptan la métrica de similitud esencial usada para el algoritmo de clustering.

10

@ARTICLE{Lam99a,

author = {Lam, Wai and Ruiz, Miguel E. and Srinivasan, Padmini},

title = {Automatic text categorization and its applications to text retrieval},

journal = {IEEE Transactions on Knowledge and Data Engineering},

year = {1999},

volume = {11},

pages = {865--879},

number = {6},

url = {http://www.cs.uiowa.edu/~mruiz/papers/IEEE-TKDE.ps}}


Los experimentos realizados en este trabajo demuestran que la categorización automática mejora el desempeño de la recuperación de datos comparado con el caso cuando no hay categorización.

De aquí, que la categorización automática de textos sea importante como aplicación para mejorar la recuperación de textos.

Los autores desarrollan un acercamiento a la categorización automática de textos e investigan su aplicación en la recuperación de textos. Este acercamiento se deriva de la combinación de un paradigma de aprendizaje (llamado aprendizaje basado en instancia) con una técnica avanzada de recuperación de documentos (retrieval feedback)


Para demostrar la efectividad del acecamiento propuesto los autores utilizaron 2 colecciones de documentos de la base de datos MEDLINE


11

@INPROCEEDINGS{Lewis94,

author = {Lewis, David D. and Marc Ringuette},

title = {A comparison of two learning algorithms for text categorization},

booktitle = {Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis

and Information Retrieval},

year = {1994},

pages = {81--93},

address = {Las Vegas, US},

url = {http://www.research.att.com/~lewis/papers/lewis94b.ps}}




Previous research on automated text categorization has mixed machine learning and knowledge engineering methods, making it difficult to draw conclusions about the performance of particular methods. In this paper we present empirical results on the performance of a Bayesian classifier and a decision tree learning algorithm on two text categorization data sets. We find that both algorithms achieve reasonable performance and allow controlled tradeoffs between false positives and false negatives. The stepwise feature selection in the decision tree algorithm is particularly effective in dealing with the large feature sets common in text categorization. However, even this algorithm is aided by an initial prefiltering of features, confirming the results found by Almuallim and Dietterich on artificial data sets. We also demonstrate the impact of the time-varying nature of category definitions.



12

@INPROCEEDINGS{Zaiane2002,

author = {Osmar R. Zaiane and Maria-Luiza Antonie},

title = {Classifying text documents by associating terms with text categories},

booktitle = {ADC '02: Proceedings of the 13th Australasian database conference},

year = {2002},

pages = {215--222},

address = {Darlinghurst, Australia, Australia},

publisher = {Australian Computer Society, Inc.},

isbn = {0-909925-83-6},

location = {Melbourne, Victoria, Australia}}





Today, text categorization is a necessity due to the very large amount of text documents that we have to deal with daily. Many techniques and algorithms for automatic text categorization have been devised and proposed in the literature. However, there is still much room for improving the effectiveness of these classifiers, and new models need to be examined. We propose herein a new approach for automatic text categorization. This paper explores the use of association rule mining in building a text categorization system and proposes a new fast algorithm for building a text classifier. Our approach has the advantage of a very fast training phase, and the rules of the classifier generated are easy to understand and manually tuneable. Our investigation leads to conclude that association rule mining is a good and promising strategy for efficient automatic text categorization.

13

@INPROCEEDINGS{Moulinier96a,

author = {Isabelle Moulinier and Jean-Gabriel Ganascia},

title = {Applying an existing machine learning algorithm to text categorization},

booktitle = {Connectionist, statistical, and symbolic approaches to learning for

natural language processing},

year = {1996},

editor = {Stefan Wermter and Ellen Riloff and Gabriele Scheler},

pages = {343--354},

publisher = {Springer Verlag, Heidelberg, DE},

note = {Published in the ``Lecture Notes in Computer Science'' series, number

1040},


url = {http://www-poleia.lip6.fr/~moulinie/wijcai.ps.gz}}




The information retrieval community is becoming increasingly interested in machine learning techniques, of which text categorization is an application. This paper describes how we have applied an existing similarity-based learning algorithm, CHARADE, to the text categorization problem and compares the results with those obtained using decision tree construction algorithms. From a machine learning point of view, this study was motivated by the size of the inspected data in such applications. Using the same representation of documents, CHARADE offers better performance than earlier reported experiments with decision trees on the same corpus. In addition, the way in which learning with redundancy influences categorization performance is also studied.

14

@ARTICLE{Mooney2005,

author = {Raymond J. Mooney and Razvan Bunescu},

title = {Mining knowledge from text using information extraction},

journal = {SIGKDD Explor. Newsl.},

year = {2005},

volume = {7},

pages = {3--10},

number = {1},

address = {New York, NY, USA},

doi = {http://doi.acm.org/10.1145/1089815.1089817},

issn = {1931-0145},

publisher = {ACM Press}}





An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns. We discuss methods and implemented systems for both of these approaches and summarize results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions. We also discuss challenges that arise when employing current information extraction technology to discover knowledge in text.

15

@INPROCEEDINGS{Larkey98,

author = {Leah S. Larkey},

title = {Automatic essay grading using text categorization techniques},

booktitle = {Proceedings of SIGIR-98, 21st ACM International Conference on Research

and Development in Information Retrieval},

year = {1998},

editor = {W. Bruce Croft and Alistair Moffat and Van Rijsbergen, Cornelis J.

and Ross Wilkinson and Justin Zobel},

pages = {90--95},

address = {Melbourne, AU},

publisher = {ACM Press, New York, US},

url = {http://cobar.cs.umass.edu/pubfiles/ir-121.ps}}





Several standard text-categorization techniques were applied to the problem of automated essay grading. Bayesian independence classifiers and k-nearest-neighbor classifiers were trained to assign scores to manually-graded essays. These scores were combined with several other summary text measures using linear regression. The classifiers and regression equations were then applied to a new set of essays. The classifiers worked very well. The agreement between the automated grader and the final manual grade was as good as the agreement between human graders.


16

@INPROCEEDINGS{Lam97,

author = {Wai Lam and Kon F. Low and Chao Y. Ho},

title = {Using a Bayesian Network Induction Approach for Text Categorization},

booktitle = {Proceedings of IJCAI-97, 15th International Joint Conference on Artificial

Intelligence},

year = {1997},

editor = {Martha E. Pollack},

pages = {745--750},

address = {Nagoya, JP},

publisher = {Morgan Kaufmann Publishers, San Francisco, US},}





We investigate Bayesian methods for automatic document categorization and develop a new approach to this problem. Our new approach is based on a Bayesian network induction which does not rely on some major assumptions found in a previous method using the Bayesian independence classifier approach. The design of the new approach as well as its justification are presented. Experiments were conducted using a large scale document collection from Reuters news articles. The results show that our approach outperformed the Bayesian independence classifier as measured by a metric that combines precision and recall measures.

17

@INPROCEEDINGS{Amati97a,

author = {Gianni Amati and Fabio Crestani and Flavio Ubaldini and Stefano De

Nardis},


title = {Probabilistic Learning for Information Filtering},

booktitle = {Proceedings of RIAO-97, 1st International Conference ``Recherche

d'Information Assistee par Ordinateur''},

year = {1997},

editor = {Luc Devroye and Claude Chrisment},

pages = {513--530},

address = {Montreal, CA},

note = {An extended version appears as~\cite{Amati99}},

url = {http://www.cs.strath.ac.uk/~fabioc/papers/97-riao.pdf}}





In this paper we describe and evaluate a learning model for information filtering which is an adaptation of the generalised probabilistic model of Information Retrieval. The model is based on the concept of ``uncertainty sampling'', a technique that allows for relevance feedback both on relevant and non relevant documents. The proposed learning model is the core of a prototype information filtering system called ProFile.


18

@INPROCEEDINGS{Liu:2004:TCL,

author = {Bing Liu and Xiaoli Li and Wee Sun Lee and Yu, Philip S.},

title = {Text Classification by Labeling Words},

booktitle = {Proceedings of the Eighteenth National Conference on Artificial Intelligence},

year = {2005},

address = {San Jose, CA},

month = {July},

publisher = {AAAI Press},}





Traditionally, text classifiers are built from labeled training examples. Labeling is usually done manually by human experts (or the users), which is a labor intensive and time consuming process. In the past few years, researchers investigated various forms of semi-supervised learning to reduce the burden of manual labeling. In this paper, we propose a different approach. Instead of labeling a set of documents, the proposed method labels a set of representative words for each class. It then uses these words to extract a set of documents for each class from a set of unlabeled documents to form the initial training set. The EM algorithm is then applied to build the classifier. The key issue of the approach is how to obtain a set of representative words for each class. One way is to ask the user to provide them, which is difficult because the user usually can only give a few words (which are insufficient for accurate learning). We propose a method to solve the problem. It combines clustering and feature selection. The technique can effectively rank the words in the unlabeled set according to their importance. The user then selects/labels some words from the ranked list for each class. This process requires less effort than providing words with no help or manual labeling of documents. Our results show that the new method is highly effective and promising.

19

@INPROCEEDINGS{Ramakrishnan:2005:MHA,

author = {Ganesh Ramakrishnan and Chitrapura, Krishna Prasad and Raghu Krishnapuram

and Pushpak Bhattacharyya},

title = {A Model for Handling Approximate, Noisy or Incomplete Labeling in

Text Classification},

booktitle = {Proceedings of the Twenty-Second International Conference on Machine

Learning},

year = {2005},

address = {Bonn, Germany},

month = {August},

url = {http://www.machinelearning.org/proceedings/icml2005/papers/086_HandlingApproximate_RamakrishanEtAl.pdf}}





We introduce a Bayesian model, BayesANIL, that is capable of estimating uncertainties associated with the labeling process. Given a labeled or partially labeled training corpus of text documents, the model estimates the joint distribution of training documents and class labels by using a generalization of the Expectation Maximization algorithm. The estimates can be used in standard classification models to reduce error rates. Since uncertainties in the labeling are taken into account, the model provides an elegant mechanism to deal with noisy labels. We provide an intuitive modification to the EM iterations by re-estimating the empirical distribution in order to reinforce feature values in unlabeled data and to reduce the influence of noisily labeled examples. Considerable improvement in the classification accuracies of two popular classification algorithms on standard labeled data-sets with and without artificially introduced noise, as well as in the presence and absence of unlabeled data, indicates that this may be a promising method to reduce the burden of manual labeling.

20

@INPROCEEDINGS{Chai02,

author = {Kian M. Chai and Hwee T. Ng and Hai L. Chieu},

title = {Bayesian online classifiers for text classification and filtering},

booktitle = {Proceedings of SIGIR-02, 25th ACM International Conference on Research

and Development in Information Retrieval},

year = {2002},

editor = {Micheline Beaulieu and Ricardo Baeza-Yates and Sung Hyon Myaeng and

Kalervo J{\"{a}}rvelin},

pages = {97--104},

address = {Tampere, FI},

publisher = {ACM Press, New York, US},



url = {http://doi.acm.org/10.1145/564376.564395}}





This paper explores the use of Bayesian online classifiers to classify text documents. Empirical results indicate that these classifiers are comparable with the best text classification systems. Furthermore, the online approach offers the advantage of continuous learning in the batch-adaptive text filtering task.



La base de datos está protegida por derechos de autor ©absta.info 2016
enviar mensaje

    Página principal