This paper introduce an interesting topic, ordinal regression, which can be regarded as a categorized learning to rank problem. Learning to rank, though less familar to us than classification or regression, is receiving more and more attention as the popularity of web search. The proposed approach, RankSVM, deals with the problem of preference learning, i.e., whether object A is more preferable than object B.
To tackle this special machine learning problem, the paper proposed to use "feature difference" as input, and "relative perference label" as output, to convert preference learning to a pair-wise preference classification problem. Based on this idea, a support vector machine based algorithm is also described and theoretically explored. Sythetic examples and real-world problem evaluation demonstrate the effectiveness of the proposing rankSVM algorithm in such problems than traditional multi-classification or regression methods.
2008年6月10日星期二
Lecture 15 The pagerank citation ranking: Bringing order to the web
This paper describes the famous algorithm: PageRank, for ranking websites. Contrary to conventional text-based methods, pagerank utilizes the link structure of the Internet as an auxiliary information. The basic idea is the web page which an authoritative website has a link to it, or a large quantity of websites have a link to, should also be important. This is similar to the scenario that an academic paper is more important if it is cited by many papers, or cited by some important papers. This can be formulated as a directed graph, each node means a webpage, and the edge means a link. Each webpage carries a rank score, which is accumulated from the inedges (backlinks) and shared by outedges (forward links). Experimental result shows pagerank indeed imporves the quality of search result. Some said that the success of google is largely due to its utilization of pagerank.
2008年5月27日星期二
Lecture 14 Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
In the early years of data mining, most works adopt Apriori-like approaches to mine the association rules. These approaches need to generate candidate sets of different lengths, which is prohibitedly space and time consuming. For example, to generate candidate sets of length 2 from N transactions, the amount of candidates is C^N_2. In addition, to generate or prune such candidates, we need to scan the whole database numerous times. Many operations here are almost redundant.
To avoid generating candidate patterns, the famous FP-tree (frequent pattern tree) is proposed in this paper. FP-tree compressed the whole data set into a compact tree-like structure, and therefore avoids costly, repeated database scans.
The construction of FP tree is as follows:
Step 1. Scan the transaction database DB once (the first time), and derives a list of frequent items. Sort frequent items in frequency descending order. This ordering is important since each path of a tree will follow this order.
Step 2. Create a root of a tree, label with “null,”and scan the database the second time to generate a set of item prefix subtrees and the frequent item header table. Each node in the prefix subtree consists of item-name, count, and node-link, while each entry in the frequent-item header table consists of item-name and head of node-link.
Once the FP-tree is constructed, association rule mining can be performed with a rather compact data structure. We perform the identification of frequent patterns from the leaf part of the tree first.Examine the paths from the leaves to the root, and identify the frequent patterns from the paths.
As aknowledged in the paper, FP-tree does not automatically guarantee that it will be highly efficient since one may still encounter the combinatorial problem of candidate generation if one simply uses this FP-tree to generate and check all the candidate patterns. However, FP-tree revolutes the conventional school of thinking in data mining and has spawned many subsequent studies.
To avoid generating candidate patterns, the famous FP-tree (frequent pattern tree) is proposed in this paper. FP-tree compressed the whole data set into a compact tree-like structure, and therefore avoids costly, repeated database scans.
The construction of FP tree is as follows:
Step 1. Scan the transaction database DB once (the first time), and derives a list of frequent items. Sort frequent items in frequency descending order. This ordering is important since each path of a tree will follow this order.
Step 2. Create a root of a tree, label with “null,”and scan the database the second time to generate a set of item prefix subtrees and the frequent item header table. Each node in the prefix subtree consists of item-name, count, and node-link, while each entry in the frequent-item header table consists of item-name and head of node-link.
Once the FP-tree is constructed, association rule mining can be performed with a rather compact data structure. We perform the identification of frequent patterns from the leaf part of the tree first.Examine the paths from the leaves to the root, and identify the frequent patterns from the paths.
As aknowledged in the paper, FP-tree does not automatically guarantee that it will be highly efficient since one may still encounter the combinatorial problem of candidate generation if one simply uses this FP-tree to generate and check all the candidate patterns. However, FP-tree revolutes the conventional school of thinking in data mining and has spawned many subsequent studies.
2008年5月19日星期一
Lecture-13B On Spectral Clustering: Analysis and an algorithm
This paper presents a spectral clustering algorithm which I have actually implemented. Basically speaking, spectral clustering is different from conventional clustering in that spectral clustering clusters in the "eigen-space," and because of this fact, data points can be clustered together even they are not neibors in the conventional space. One important application is to apply spectral clustering to music sturcture analysis [1].
The aglorithmic description of the approach is as follows:
1. Contrstuct an affinity matrix A from S (the scaling parameter needs to be controlled).
2. Compute the Laplacian matrix L=D-1/2AD-1/2 (symmetric ans positive-definite).
3. Compute the egienvectors of L, and normalize them to Y.
4. Apply clustering algorithms such as kmeans to Y.
5. Apply the partition in Y to the original data points S.
Theoretical derivations and experimental results are also provided in the paper.
[1] Lie Lu, and Alan Hanjalic."Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval", to appear IEEE Trans. on Multimedia, 2007-2008.
The aglorithmic description of the approach is as follows:
1. Contrstuct an affinity matrix A from S (the scaling parameter needs to be controlled).
2. Compute the Laplacian matrix L=D-1/2AD-1/2 (symmetric ans positive-definite).
3. Compute the egienvectors of L, and normalize them to Y.
4. Apply clustering algorithms such as kmeans to Y.
5. Apply the partition in Y to the original data points S.
Theoretical derivations and experimental results are also provided in the paper.
[1] Lie Lu, and Alan Hanjalic."Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval", to appear IEEE Trans. on Multimedia, 2007-2008.
Lecture-13A Normalized Cuts and Image Segmentation
This paper presents a normalized-cut-based image segmentation method. It first describes the drawbacks of previous methods: 1. Image partitioning is to be done from the big picture downward, 2. greedy or gradient descent type approaches fail to find global optima for these high-dim, non-linear problems. 3. minimum spanning tree or limited neighborhood set approaches fail to capture global information.
The authors then describe normalized cut in the graph theoretic language. Normalized cut is different from "cut" in that it avoids unnatural bias for partitioning out small sets of points (by adding association to the denominator). Through the derivation of Section 2.1, the graph partitioning problem can be translated into a generalized eignevalue problem, and we can achieve the optimal partition by taking the second smallest eigenvalue.
The eigenvalues of the generalized eigenvalue problem also shows the other interesting property, i.e., one can subdivide the existing graphs, each time using the eigenvector with the next smallest eigenvalue. In practice, however, we cannot do this because of the "discrete-valued" constraint of the eigenvalues. The authors actually proposed a brute-forth way to transform continuous eigenvalue to the discrete one (in Section 3.1.3), and propose a simultanous K-way cut with multiple eigenvectors (in Section 3.3). The simultanous K-way cut seems interesting, as it oversegments first, and then merge by kmeans; however, its relationship with the typical normalized cut mehtod seems unclear and not intuitive.
Being such an excellent work, the authors further add values to this paper by discussing efficiency issue (in Section 3.1.2) and comparing the normalized cut to other graph theoretic approaches such as average cut. Moreover, amazingly the authors are able to interpret the approach in a physical sense. A good-written paper with such solid mathematical and physical background is really awesome.
The authors then describe normalized cut in the graph theoretic language. Normalized cut is different from "cut" in that it avoids unnatural bias for partitioning out small sets of points (by adding association to the denominator). Through the derivation of Section 2.1, the graph partitioning problem can be translated into a generalized eignevalue problem, and we can achieve the optimal partition by taking the second smallest eigenvalue.
The eigenvalues of the generalized eigenvalue problem also shows the other interesting property, i.e., one can subdivide the existing graphs, each time using the eigenvector with the next smallest eigenvalue. In practice, however, we cannot do this because of the "discrete-valued" constraint of the eigenvalues. The authors actually proposed a brute-forth way to transform continuous eigenvalue to the discrete one (in Section 3.1.3), and propose a simultanous K-way cut with multiple eigenvectors (in Section 3.3). The simultanous K-way cut seems interesting, as it oversegments first, and then merge by kmeans; however, its relationship with the typical normalized cut mehtod seems unclear and not intuitive.
Being such an excellent work, the authors further add values to this paper by discussing efficiency issue (in Section 3.1.2) and comparing the normalized cut to other graph theoretic approaches such as average cut. Moreover, amazingly the authors are able to interpret the approach in a physical sense. A good-written paper with such solid mathematical and physical background is really awesome.
2008年5月12日星期一
Lectur-12 A comparison of Algorithms for Inference and Lerning in Probabilitstic Graphical Models
Graphical is one of the most important probabilistic models which owns the following desirable properties: 1. reflect prior knowledge, 2. derive efficient algorithms for infrerence and learning, 3. easily translate the model to a different form, and 4. communicate the model to other researchers and users.
The notion of modularity is a central aspect of graphical models. A related concpet is the Markov blanket. A graphical model identfies the modules in the system and can bu used to derive algorithms that acheive expeonential speedups. Moreover, graphical models provide a way to link simpler models together in a principled fashion that respects the rules of probability theory. Famous graphical modesls include BN(bayesain network, undierected), MRF(Markov random field, directed), and FG(factor graph, subsumes BN and MRF).
It is often helpful to parameterize a graphical model. Moreover, if we can find the expoenential family paramterization, we can easily translate the problem at hand to some well-establish graph models, and employ the known solvers to solve it.
Well known solvers include ICM(iterated conditional modes), EM(expectation-maximization), and many variation forms of EM. ICM is simle, but often too greedy, and often lead to local optima. EM provides a more principled way to estimate hidden variable, and thus becomes the major solver in the propalistic field. Many of its variants are also detailed in this paper. For example, by Gibbs sampling and Monte Carlo methods, we can circumvent local minima. by variational techniques and the mean field method, the uncertainty in the neighboring random variables can be better acounted; by structured variational techniques, the dependencies between hidden random variables can be better modeled. A good conceptual comparison is tabulated in Table 1, and some empirical compasisons are also provided.
The notion of modularity is a central aspect of graphical models. A related concpet is the Markov blanket. A graphical model identfies the modules in the system and can bu used to derive algorithms that acheive expeonential speedups. Moreover, graphical models provide a way to link simpler models together in a principled fashion that respects the rules of probability theory. Famous graphical modesls include BN(bayesain network, undierected), MRF(Markov random field, directed), and FG(factor graph, subsumes BN and MRF).
It is often helpful to parameterize a graphical model. Moreover, if we can find the expoenential family paramterization, we can easily translate the problem at hand to some well-establish graph models, and employ the known solvers to solve it.
Well known solvers include ICM(iterated conditional modes), EM(expectation-maximization), and many variation forms of EM. ICM is simle, but often too greedy, and often lead to local optima. EM provides a more principled way to estimate hidden variable, and thus becomes the major solver in the propalistic field. Many of its variants are also detailed in this paper. For example, by Gibbs sampling and Monte Carlo methods, we can circumvent local minima. by variational techniques and the mean field method, the uncertainty in the neighboring random variables can be better acounted; by structured variational techniques, the dependencies between hidden random variables can be better modeled. A good conceptual comparison is tabulated in Table 1, and some empirical compasisons are also provided.
2008年5月5日星期一
Lecture-11B Learning Low-Level Vision
This paper seems to be the most cited paper for the famous Markov Random Field (MRF). The biggest assumption of MRF is the markovian property, i.e., the value of current state is only influenced by its neighbor. Under this assumption, we can construct a graph as the Fig. 2 in this paper, and applied MRF solvers, such as graph cut, beliefe propagation, etc, to solve it.
Imazingly, in fact many real problems can be formulated as an MRF problem, and once we can make such formulation, MRF solvers can often get you good result. What critical is to define the smoothness term and data term for the MRF model. Smoothness term defines the relationship between x, and data terms define the relationship between x and y. However, defining these two terms can be very difficult, and it worth taking patience to go through the examples in this paper to get some basic principles, but uh..., let's do it next time when I want to use MRF >_<
Imazingly, in fact many real problems can be formulated as an MRF problem, and once we can make such formulation, MRF solvers can often get you good result. What critical is to define the smoothness term and data term for the MRF model. Smoothness term defines the relationship between x, and data terms define the relationship between x and y. However, defining these two terms can be very difficult, and it worth taking patience to go through the examples in this paper to get some basic principles, but uh..., let's do it next time when I want to use MRF >_<
訂閱:
文章 (Atom)
