联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2018-07-25 07:33


Programming Assignment #6 CS671

1st submission due 7/27/2017 make-up submission due 8/2/2017

This assignment focuses on some more advanced features of Scala, including

? stackable traits and abstract override methods;

? iterators and the difficulties involved in working with them;

? streams and the danger of memory leaks;

? implicit conversions.

Framework description

Document sources

The package you are to develop implements a notion of a document source. A document source is similar to an iterator but produces its contents either as characters or as words, whichever is most convenient to the underlying application. As with iterators, words and characters are produced “on the fly” and are not stored. Correct implementations of document sources should be able to handle large documents without loading them entirely in memory.

A document source offers two methods:

? chars: produces the contents of the document as an iterator of characters;

? words: produces the contents of the document as an iterator of words.

Application code is expected to use one or the other of these methods, but not both. Since they draw their contents from the same source, using one after the other has already produced elements would be unreliable. A typical use of a document source has the form:

for (char <- docSource.chars) {...}

or

for (word <- docSource.words) {...}

In addition to these two methods, a document source can be enriched by adding the so-called “monadic” methods: foreach, filter, map and flatMap.1 These work in terms of characters and can be used to process or transform the document. Like standard iterators, document sources usually become invalid once one of their methods has been called. Unless stated otherwise, only one of the six methods chars, words, foreach, filter, map or flatMap can be called on a document source, and it can only be called once.

The monadic methods are parts of a “rich” document source type. Regular document sources can be converted into rich sources via implicit conversions. This allows an application to write:

for (c <- docSource) ...

instead of the equivalent:

for (c <- docSource.chars) ...

1 For the sake of simplicity, they have slightly-reduced signatures compared to their counterparts in trait FilterMonadic.

2

Document transformations

Document sources can be transformed at the level of characters or at the level of words. These transformations are handled through stackable traits. Traits that extend the Transformer trait can be mixed in document sources to introduce processing on both words and characters (filtering, transformation, etc.).

One difficulty that arises is that documents can be transformed in terms of words and/or characters and then consumed as words or as characters. For instance, a document can have its characters converted to uppercase and then fed as words to a document reader that will arrange words in lines. Or a document can have its accented letters converted, some of its words censored, and then be sent through a socket as packets of characters. Without knowing how exactly a document is to be transformed, an implementation runs the risk of introducing cycles: derive the characters from the words (to send character packets), which are derived from the characters (to convert accented letters), which are derived from the words (to remove censored words), etc.

This assignment relies on a base trait CharSource, in which a document’s characters are defined in terms of its words, which are defined in terms of its source (of characters). The characters of the document always derive from the words and the words always derive from the source. Character-level transformations are applied to the source—and are therefore reflected in both the words and the characters of the document—while word-level transformations are applied directly to the words and are reflected in the characters (but not in the source).

Example

Consider a document that starts as follows:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce vel.

and a class IteratorDocumentSource that produces a document source from it. Method words can be used to obtain the words from the text:

val doc = new IteratorDocumentSource(lorem)

doc.words // "Lorem ", "ipsum ", "dolor ", "sit ", "amet, ", "consectetur ", "adipiscing ", "elit. ", "Fusce ", "vel."

Trait PunctuationRemover operates at the character level and removes punctuation. If it is mixed in class IteratorDocumentSource, method words produces the words without punctuation:

val doc = new IteratorDocumentSource(lorem) with PunctuationRemover

doc.words // "Lorem ", "ipsum ", "dolor ", "sit ", "amet ", "consectetur ", "adipiscing ", "elit ", "Fusce ", "vel"

Note how the effects of a character-level filter is visible as the level of words.

Trait Capitalization works at the level of words and capitalizes them. If mixed in the class, method words produces instead:

val doc = new IteratorDocumentSource(lorem) with Capitalization

doc.words // "Lorem ", "Ipsum ", "Dolor ", "Sit ", "Amet, ", "Consectetur ", "Adipiscing ", "Elit. ", "Fusce ", "Vel."

Both traits can be mixed in together:2

val doc = new IteratorDocumentSource(lorem) with PunctuationRemover with Capitalization

2 In this particular case, mixing in the traits in reverse order would produce the same result, but in general, the order in which traits

are mixed in is significant.

3

doc.words // "Lorem ", "Ipsum ", "Dolor ", "Sit ", "Amet ", "Consectetur ", "Adipiscing ", "Elit ", "Fusce ", "Vel"

Finally, the resulting document can be processed at the character level using one of the monadic functions:

val doc = new IteratorDocumentSource(lorem) with PunctuationRemover with Capitalization map {

case 'u' => 'v'

case c => c

}

doc.words // "Lorem ", "Ipsvm ", "Dolor ", "Sit ", "Amet ", "Consectetvr ", "Adipiscing ", "Elit ", "Fvsce ", "Vel"

Note how the map method produces a document in which the words reflect all the transformations applied to the characters and words of the original document, as well as the character-level transformation due to the application of map.

Trait CharSource

This trait represents the base type that can be modified my mixing in transformers. Its source of characters is abstract. From it are derived words and from words are derived characters.

There are three types of words in a document:

? regular words: These words start with a non-whitespace character, continues with more non-whitespace characters and end with zero or more whitespace characters (but no newlines). For example, "ipsum " and "vel." are regular words.

? newlines: A sequence of one or more consecutive newlines is considered to form a word.

? whitespace: A sequence of one or more whitespace characters can also form a word, but only at the beginning of a line.

As an illustration, the following document:

Lorem?ipsum?dolor?sit?amet,?consectetur?adipiscing?elit.?Vivamus?porta?scelerisque?dolor.?Vestibulum?facilisis,?lorem?vel?porttitor?porttitor,?tortor?nisi?egestas?lorem,?vel?tempus?libero?nisi?ac?neque.

??Quisque?id?nibh?vulputate,?suscipit?elit?ullamcorper,?ullamcorper?lacus.???

contains the words:

"Lorem?", "ipsum?", "dolor?", "sit?", "amet,?", "consectetur?", "adipiscing?", "elit.?", "Vivamus?", "porta?", "scelerisque?", "dolor.", "\n", "Vestibulum?", "facilisis,?", "lorem?", "vel?", "porttitor?", "porttitor,?", "tortor?", "nisi?", "egestas?", "lorem,?", "vel?", "tempus", "\n", "libero?", "nisi?", "ac?", "neque.", "\n\n", "??", "Quisque?", "id?", "nibh?", "vulputate,?", "suscipit?", "elit?", "ullamcorper,?", "ullamcorper?", "lacus.???".

Method words produces the document’s words from the source; method chars produces the document’s characters from the words. Note that nothing is removed or modified at this stage: The concatenation of all the characters from any iterator—source, words or chars—produces the same string.

Note also that both methods should work lazily and without memory leaks. The contents of the document should never be loaded entirely in memory.

15 pts.

4

Transformers

Transformers are implemented as extensions to the trait Transformer:

trait Transformer { self: CharSource =>

def source: Iterator[Char]

def words: Iterator[String]

}

Transformers have a self-type of CharSource, which restricts the type of classes they can be mixed in. Transformers can override the behavior of source to implement character-level transformations, or the behavior of words to implement word-level transformations, or both. They have to make sure they rely on the behavior of the overridden method (super.source or super.words) to achieve stackability. Some transformers are trivial to implement (e.g., Capitalization) while others can be more difficult (e.g., LineBreaking). The transformers to implement are described below.

Trait Capitalization

This is a word-level transformer that capitalizes all the regular words (capitalization has no effect on whitespace). For instance, "ipsum." becomes "Ipsum." and "Lorem " remains "Lorem ".

Trait PunctuationRemover

This is a character-level transformer that removes punctuation marks from the text. Punctuation marks are any one of these characters:

!"#$%&’()*+,-./:;<=>?@[\]^_‘{|}~

Trait JustWords

This transformation removes the non-regular words (whitespace and newlines) and trims the regular words. For instance, JustWords applied to the document above produces the words:

"Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit.", "Vivamus", "porta", "scelerisque", "dolor.", "Vestibulum", "facilisis,", "lorem", "vel", "porttitor", "porttitor,", "tortor", "nisi", "egestas", "lorem,", "vel", "tempus", "libero", "nisi", "ac", "neque.", "Quisque", "id", "nibh", "vulputate,", "suscipit", "elit", "ullamcorper,", "ullamcorper", "lacus.".

Trait Censoring

This trait transforms the words of a document by censoring some of them. It relies on a predicate isCensored, which is left abstract. Censoring works by replacing the inside of a word by a series of X marks if the string passes the isCensored test. Note that words, in the context of this library, can contain non-letter characters like punctuation and whitespace. The inside of a word is defined to be the longest substring of the word that begins and ends with an alphabetic character. Alphabetic characters are those that satisfy Java’s Character.isAlphabetic method.3

For example, the inside of "Lorem " is "Lorem" and the inside of "ipso-facto!" is "ipso-facto".

If any word that contains at least one "o" is censored, the example text would become:

XXXXX ipsum XXXXX sit amet, XXXXXXXXXXX adipiscing elit. Vivamus XXXXX scelerisque XXXXX. Vestibulum facilisis, XXXXX vel XXXXXXXXX XXXXXXXXX, XXXXXX nisi egestas XXXXX, vel tempus XXXXXX nisi ac neque.

Quisque id nibh vulputate, suscipit elit XXXXXXXXXXX, XXXXXXXXXXX lacus.

3 See also \p{Alpha} if you prefer to use regular expressions.

8 pts.

8 pts.

8 pts.

8 pts.

5

Trait LineBreaking

This trait is used to reformat a document according to a desired line width. Words are trimmed to remove leading and trailing whitespace. Single newlines are ignored, but words that consist of multiple newlines are kept unchanged, as they represent paragraph breaks. Words are then produced by the iterator according to the following rules:

? A word is separated from the next word on a line by a single space character. The space is part of the first word (i.e., words never start with a space).

? Lines have no leading or trailing spaces.

? Lines are as long as they can be without exceeding the maximum line width.4 In order to achieve this property, additional, single-newline words are emitted by the iterator. Lines are not justified.

? No other newlines are added. In particular, no newlines are added around a multiple-newline word, since it is already starting a new paragraph (hence a new line).

As an illustration, the text above, with a maximum line width of 30, becomes:

Lorem ipsum dolor sit amet,

consectetur adipiscing elit.

Vivamus porta scelerisque

dolor. Vestibulum facilisis,

lorem vel porttitor porttitor,

tortor nisi egestas lorem, vel

tempus libero nisi ac neque.

Quisque id nibh vulputate,

suscipit elit ullamcorper,

ullamcorper lacus.

Note that all the lines are trimmed and, in particular, the trailing spaces after "lacus." have been removed.

Class IteratorDocumentSource

This class provides us with a default implementation of the type RichDocumentSource. It can be used to mix in traits in various combinations and observe the results. Note that this class implements the full (“rich”) document source interface, including the monadic operators.

Implicit function source2RichSource

This implicit function makes a RichDocumentSource out of a regular DocumentSource, thus adding to it the monadic functions. It does this by creating an instance of IteratorDocumentSource from the characters of the given document source.

One more page to go… ?

4 The only exception to this rule is the case of a word that is longer than the maximum line width: It is emitted alone as a line (that is too long).

6 pts.

16 pts.

4 pts.

6

Notes

? There will be several integration tests (whole-system) worth 27 points. The entire assignment is worth 100 points.

? More information of the functions to implement can be found at http://cs.unh.edu/~cs671/Programming/

? The “library” developed in this assignment doesn’t necessarily make much sense as a way to handle text documents. It is defined as it is for illustration purposes.

? As an example, the Capitalization transformer can be implemented as follows:

trait Capitalization extends Transformer { self: CharSource =>

abstract override def words = super.words map (_.capitalize)

}

? The most difficult transformer to implement is LineBreaking. All the other transformers are pretty straightforward.

? All transformers have to be implemented lazily, i.e., from iterator to iterator, without expanding entire documents. Given the limitations of scala.collection.Iterator, the easiest way is to use streams. Be careful, however, with memory leaks. They are avoided by making sure that the iterators you implement do not keep a reference to the head of their underlying stream.5

? It might be a good idea to turn off implicit conversions between document sources while debugging. The easiest way is to remove function source2RichSource. But remember to put it back before submitting or the grading code will not compile.

5 Something, apparently, that method Stream.iterator fails to do.

27 pts.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp