I read the poster MSSF: A Multi-Document Summarization Framework based on Submodularity (SIGIR 2011). It talked solving the problem using a budgeted maximum coverage problem based on submodularity. Some points to read about
- Let E be a finite set. The function f:Pow(E)->R is submodular if, for any S \subset T \subset E, and s \in E \ T we have
f(T U {s}) - f(T) <= f(S U {s}) - f(S).
Submodular functions are closed under non-negative linear combinations.
- Budgeted Maximum Coverage Problem (NP-Hard): Given a set E of sentences, where each subset S has an influence i(S) and a cost c(S), and a budget b, find a subset S of E which has the larget possible influence while the total cost does not exceed b.
- For example, b is the maximum number of sentences in the summary, c(S) =
number of sentences in S, and i(S) is a submodular function measuring
the influence of the summary S, then there exist algorithms to solve
this problem approximately. Different functions i(S) were proposed in
the poster, for different settings (simple summarization, query-focused
~, update ~, comparitive ~). For the simple case
i(S) = \sum{s_i \in E\S} \sum{s_j \in S} sim(s_i, s_j) - \sum{s_i,s_j\in S, s_i != s_j} sim(s_i, s_j) = f1(S) - f2(S)
- f1(S) measures to what degree S represents E, and f2(S) measures redundancies within S. f1(S) and -f2(S) are both submodular, so i(S) is also submodular.
Things to look up
- Budgeted Maximum Coverage Problem. Khuller, Moss, Naor. 1999, Minoux 1978.
- Maximum Marginal Relevance, Update summarization, Comparitive summarization, Query-focused summarization