Large-scale text classification methods in natural language processing

Topic > Large-scale text classification methods in natural language processing

IndexIntroductionText classification processWord-based representationGraph-based representationSemantic relationshipApplication of text classification algorithmObservationConclusionText classification is the task of classifying natural language documents not labeled into a predefined set of categories. The classification task may depend on various factors such as data structure, size of processed data, etc. Many real-world problems, however, must consider a huge amount of data to classify from many sources. Large-scale text classification classifies text into thousands of classes, and in some cases each document may belong to only one class while in others to more than one class. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original Essay Hierarchical relationships can offer additional information to a classification system that can improve scalability and accuracy. The work aims at an investigation of various methods used for text classification in NLP which include both machine learning and deep learning techniques. It also describes the evaluation measures commonly used for the classification system. Introduction Text classification addresses the problem of assigning documents to a predefined set of classes. Let's consider the case of binary classification in which there is only one class and each document either belongs to it or not. Spam filtering is an example where emails are classified as fraudulent or not. A classifier can be trained using positive and negative instances to perform classification automatically in machine learning, but it has rarely been found to be 100% correct, even in the simplest case. In large-scale text classification, the volume of documents to be processed is also calculated to be very large (hundreds of thousands or even millions), leading to a large vocabulary (different and unique words in documents, also known as types). One of the aspects of multilabel classification is that classes are linked to each other. So this can be a parent-child relationship that makes up a hierarchy. A class taxonomy offers additional information to a classification system, which can be leveraged to improve scalability or to improve the accuracy of the classification system. Text Classification Process The goal of text classification is to automatically classify text documents into one or more defined categories. Classes are selected from a previously established taxonomy (a hierarchy of categories or classes). The task of representing a given document in a form suitable for the data mining system is called document representation. Since data can be structured or unstructured, the form of representation is very important for the classification process, i.e. in the form of instances with a fixed number of attributes. Plain text documents are converted to a fixed number of attributes in a training set. This process can be done in several ways. Word-based representation The process of matching one of the parts of speech to the given word in the document is called part-of-speech tagging. It is commonly referred to as POS tagging. Parts of speech can be nouns, verbs, adverbs, adjectives, pronouns, conjunctions and their subcategories. Parts of speech tagger or POS tagger tags words automatically. Taggers use different types of information for the process of tagging words likedictionaries, lexicons, rules and so on. Dictionaries contain categories or categories of a particular word. That is, a word can belong to more than one category. For example, run is both a noun and a verb. Taggers use probabilistic information to resolve this ambiguity. Graph-based representation Bag of words is a typical and standard way to handle model content records, reasonable to capture word frequency. But BOW neglects auxiliary and semantic data. In graph representation, mathematical constructs are used to display basic relationships and data validly. Here, a content can be conveniently represented as a graph where the feature term is represented by the inverse and the edge connection can be the connection between the feature terms. Calculations identified with different tasks like term weighting, classification which is useful in numerous applications in data retrieval are provided by this model. Graph-based representation is a suitable method for representing content recording and improves the side effect of survey model too usual for various content applications. The document is modeled as a Graph where the term is represented by the vertices and the relationship between the terms is represented by the edges: G ={Vertex,Edge Relation}There are generally five different types of vertices in the Graph representation: Vertex = {F,S ,P,D,C}, where feature term F, sentence S, paragraph P, document D, concept C. EdgeRelation = {Syntax,Statistical,Semantic}The edge relationships between two feature terms can be different in the context of the graph. Occurrences of words together in a sentence or paragraph or section or document.Common words in a sentence or paragraph or section or document.Semantic relationshipWords have similar meanings, words spelled the same way but with different meanings, opposite words. The meaning of terms is not effectively captured by the bag of words approach. The relationship between writes can be maintained by maintaining the auxiliary representation of the data which will require higher order framework execution.B. Building the vector space modelThe vector space model or VSM is a representation of a set of documents as vectors in a common vector space and is fundamental to a number of IR operations ranging from document query scoring, document classification and clustering of documents. VSM is an algebraic model for representing text documents as vectors of identifiers, such as index terms. Feature subset selection for the text document classification task uses an evaluation function applied to a single word. Scoring of individual words can be done using some measures like Document Frequency (DF), Term Frequency (TF) etc. The feature extraction approach does not weight the terms in order to discard the selection of features with lower weight, but compacts the vocabulary based on the co-occurrences of the features. TF -IDF: Inverse Frequency Document Term Frequeny uses all tokens in the dataset as a vocabulary. TF is the frequency of a token in each document. IDF is the number of documents the token is found in. The intuition of this measure is: an important word in a document will occur frequently and should be assigned a high score. But if the word occurrence is too high, it is probably not unique and therefore is given a lower score. The mathematical formula for this measurement: . tfidf(t,d,D) = tf(t,d) * tf (t,D), where t denotes the term, d denotes each document, and D denotes a collection of documents. Advantages Ease of calculationIt has some basic metrics to extract the most descriptive terms in a document It can easily calculate the similarity between 2 documents using it Disadvantages TF-IDF is based on the bag of words (BoW) model. Since it uses a lot of words, it does not capture the position of words in the text, semantics, co-occurrences of indifferent documents, etc. TF-IDF is only useful as a lexical level feature It cannot capture semantics (e.g. with respect to topic models, word embeddings) Principal component analysis: PCA is a classic multivariate data analysis tool, a great technology for data dimension reduction processing. Suppose there are N data samples, each sample is expressed with n observed variables x1, x2,. . . , xn we can get a sample of data matrix. PCA uses the variance of each feature to maximize its separability. It is an unsupervised algorithm. The steps of PCA are Standardize the data Obtain eigenvectors and eigenvalues from the covariance matrix or co-relation matrix. Sort the eigenvectors in descending order and choose k eigenvectors that correspond to the k largest eigenvalues where k is the number of dimensions of the new feature subspace k ≤ d Construct matrix projection W from k selected eigenvectors. Transform the original dataset text. It is a set of heuristics and calculations that creates a model from data. The algorithm first analyzes the data provided, and then specific types of patterns or trends are identified. The algorithm then uses the results of this analysis over many iterations and the optimal parameters for creating the mining model have been found. These parameters are then applied to the entire dataset to extract actionable patterns and detailed statistics. Machine Learning (or ML) is an area of Artificial Intelligence (AI) that is a set of statistical techniques for solving problems. To apply ML techniques to NLP problems, unstructured text is converted into a structured format. Deep Learning (which includes Recurrent Neural Networks, Convolutional Neural Networks, and others) is a type of approach to Machine Learning. It is an extension of NeuralNetworks. Deep Learning can also be used for NLP tasks. Fig. 2. Relationship between ML, Deep Learning and NLPA. Machine Learning Techniques for Text Classification Machine learning is a set of algorithms that analyze data, learn from it, and then apply what they have learned to make intelligent decisions. Modeling two techniques are briefly discussed below: Naive Bayes Classification: Naive Bayes classifier is a supervised classifier that provides an approach to express positive, negative and neutral sentiments in content. The NaiveBayes classifier classifies words into their labels using the idea of conditional probability. The advantage of using Nave Bayes on content classification is that it needs a small information index for preparation. Raw information from the web experiences preparation, evacuation of external number words, HTML tags and unusual images that produce word arrangement. Words with positive, negative and impartial word signs are labeled and are physically performed by human specialists. This pre-management produces word classification sets to prepare the set. Consider a set of test words (set of words withoutlabel) and a window of n words (x1, x2,. . . . . , xn) from a document. The conditional probability that a given data point y is in the n word category of the training set is given by: 2) J48 algorithm used for sentiment prediction: J48 is a decision tree-based classifier used to produce rules for identification of the target terms. The feature space is isolated into unique areas pursued by the classification of evidence into classification votes in the progressive mechanism. Larger training set collections are handled more productively by this strategy than multiple classifiers. In the test set inevitably, a node's level is raised when a neighboring element qualifies the state of the internal component name in a similar part of the tree. Two different branches of the decision tree are created step by step from the word labeling task. The J48 calculation uses entropic work to test the order of terms from the test set. J48's extra highlights represent missing qualities, pruning of chosen trees, constant trait value ranges, inference of principles, etc. Where (Term) can be unigram, bigram and trigram. B. Deep Learning Techniques for Text Classification Deep Learning is a machine learning technique that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts and more complex representations. abstract ones calculated in terms of less abstract concepts. Two of the deep learning techniques are discussed below: Convolution Neural Network: CNN has been widely used in image handling and has demonstrated relatively exact results. However in NLP, where the data sources are text or sentences related to a matrix, when CNN handles it, each column of the lattice is compared to a token, which is a word, yet could be a character. That is, each line is a vector that talks about a word. Commonly, these vectors are word embeddings (low-dimensional representations), but they could also be one-hot vectors that store the word in a vocabulary. For a 10-word sentence using a 100-dimensional embedding, we would have a 10100 grid as input. For example, consider a sentence classification using the CNN method shown in Figure 2.3. Outlined here are three channel areas of sizes: 2, 3 and 4, each of which has 2 filters. Then a univariate highlight vector is created from each of the six maps, and these 6 highlights are connected to form a component vector for the second to last layer. The last softmax layer then receives this vector component as input and uses it to classify the sentence; quibinary characterization is expected and consequently two imaginable output states are shown. Recurrent Neural Network: The concept behind RNNs is to use consecutive data. In a habitual neural system we expect all sources of input (and output) not to depend on each other. However, for some assignments this is a fruitless thought. In case you need to anticipate the next word in a sentence, it is best to know which words preceded it. RNNs are called repetitive because they perform a similar task for each member of a group, with the performance based on previous calculations. Another approach to looking at RNNs is that they have a "memory" that captures data about what has been calculated so far. In words, RNNs can use data in subjectively long successions, but with action, they are limited to thinking back to only one point in time. couple of steps. There are two uses of the RNN system models: in..