## Tuesday, July 12, 2011

### Xtreams version of the optimized prime factors factorial

I now want to use Xtreams on the optimized version of primeFactorFactorial:

optimizedPrimeFactorFactorialViaXtreams
| primes uniqPowers highestPower prevPower primeGroup primesGroupedByPower bitIterator terms numGroupsEstimate |

Forming a generator of primes:

primes := [:out | self orLessAllPrimesDo: [:prime | out put: prime ] ] reading.

Removing first prime, because we will handle power of 2 by a final bitShift:

primes get.

Forming two collections, the group of primes having the same power, and their unique powers.

numGroupsEstimate := 1 << (self highBit // 2).
uniqPowers := (Array new: numGroupsEstimate) writing.
primesGroupedByPower := (Array new: numGroupsEstimate) writing.
prevPower := nil.
primeGroup := Array new writing.
primes do: [:p |
| primePower |
primePower := self - (self sumDigitsInBase: p) // (p - 1).
prevPower
ifNil:
[uniqPowers put: primePower.
prevPower := highestPower := primePower]
ifNotNil: [prevPower ~= (prevPower := primePower)
ifTrue:
uniqPowers put: primePower.
primeGroup := Array new writing]].
primeGroup put: p].

Note that this part is not lazy, because I later want to reuse these two collections several times. This part is not the most elegant I came up with. It is much like using a (ending:) slicing sequence, but ending: is not splitting the groups where I want to...
Another option would be to count the number of equal powers, and use that in a (limiting:) slicing sequence, but limiting: is not taking a block argument...

Forming an iterator on the powers bits:

bitIterator := [:out |
| bitProbe |
bitProbe := 1.
[ bitProbe <= highestPower ]
whileTrue:
[out put: bitProbe.
bitProbe := bitProbe bitShift: 1 ] ] reading.

For each bit, grouping primes having the bit set in their associated power, and raise the product of this group to the power of corresponding bit:

terms := bitIterator collect: [:bit |
primesGroupedByPower += 0.
uniqPowers += 0.
(primesGroupedByPower selecting: [:pp |
pp += 0.
(uniqPowers get bitAnd: bit) ~= 0])
stitching rest karatsubaProduct raisedToInteger: bit].

Note that stitching is concatenating the substreams into a single stream, while += 0 is resetting the streams at the beginning.

Performing the product of products and shifting to account the power of 2:

^terms karatsubaProduct bitShift: self - self bitCount

There is still an operation I didn't Xtreamize, the karatsubaProduct. It consists in a recursive split. Not sure how to do it...

Now let us see if timing is decent:
Smalltalk garbageCollect.
[50000 optimizedPrimeFactorFactorialViaXtreams] timeToRun
-> 1321

Approximately the same as the non Xtreams version (+4%). That's pretty good.
But this version has a tricky bug: stitching is raising Incomplete if there is no substream, and this happens for some numbers. For example,
9 optimizedPrimeFactorFactorialViaXtreams
produces the groups ((3) (5 7)) with the powers (4 1). When the bitIterator generator is evaluating with bit = 2, there is no group having a power with this bit set, Incomplete is raised by stitching, and caught by the generator which closes the stream without the final term (3 raisedTo: 4). Thus it returns a wrong result... Ouch! That's one reason for I dislike Exceptions.

My single test on a huge number did not fail, and I only discovered the bug with the generator free version:

optimizedPrimeFactorFactorialViaXtreams2
| uniqPowers highestPower prevPower primeGroup primesGroupedByPower terms numGroupsEstimate |
numGroupsEstimate := 1 << (self highBit >> 1).
uniqPowers := (Array new: numGroupsEstimate) writing.
primesGroupedByPower := (Array new: numGroupsEstimate) writing.
prevPower := nil.
highestPower := 0.
primeGroup := Array new writing.
Integer primesUpTo: self + 1 do: [:p | p = 2 ifFalse: [
| primePower |
primePower := self - (self sumDigitsInBase: p) // (p - 1).
prevPower
ifNil:
[uniqPowers put: primePower.
prevPower := highestPower := primePower]
ifNotNil: [prevPower ~= (prevPower := primePower)
ifTrue:
uniqPowers put: primePower.
primeGroup := Array new writing]].
primeGroup put: p]].
terms := (1 to: highestPower highBit) collect: [:bit |
primesGroupedByPower += 0.
uniqPowers += 0.
[| product |
product := (primesGroupedByPower selecting: [:pp |
pp += 0.
(uniqPowers get bitAt: bit) ~= 0])
stitching rest karatsubaProduct.
1 to: bit - 1 do: [:i | product := product squared].
product] on: Incomplete do: ].
^terms karatsubaProduct bitShift: self - self bitCount

Overall, I find Xtreams powerfull, far more than Smalltalk Stream. But there are still things lacking. For example, the grouping was not obvious. The stitching of empty streams was surprising. I might need a readingStitching that works with an empty stream.

Xtreams is interesting both for its functional like aspects and for its potential use in parallelism, a lot like this. From this later point of view, I find it easy to pipe serial operations with Xtreams in case of a single assembly line. It also seems possible to fork processing for each element when they are independent (this is data parallelism - and could be applied for performing karatsubaProduct of each term in above example, and thus get a very efficient parallel version of factorial).
On the other hand, having two parallel assembly lines does not seem that easy. There is a duplicating: operation that should allow such parallel streams, but it necessarily uses a writeStream. I'd like the assembly chain to be as easy as:

fork := [:in :out1 :out2 | | item | item := in get. out1 put: item. out2 put: item] forking.
line1 := [:in :out | out put: in get op1] filtering.
line2 := [:in :out | out put: in get op2] filtering.
join := [:in1 :in2 :out | out put: (in1 get op3: in2 get)] joining.
factory := generator | fork | (line1 & line2) | join.

Funily, I used unix pipe | to denote serial and ampersand & to denote parallel, but I find the contrary more natural: || makes me think of parallelism, and & of sequencing. Just a point of view...

From above point of view, the differences between read and write streams seems a bit arbitrary. Filters are inherently both. The Generator pattern could be used to turn any WriteStream (push) into a ReadStream (pull), but it's current implementation has potential dead lock conditions. But wait, it's possible I missed some features, or miss-used some others, Xtreams is evolving rapidly. There is a new post about using Xtreams with parallel VM, I have to inquire a bit more!

## Saturday, July 9, 2011

### Testing new Xtreams generators

So, I upgraded Squeak port of Xtreams, and just tested it on the primeFactorFactorial example:

primeFactorFactorialViaXtreams
| primes powers factorial |
primes := [:out | Integer primesUpTo: self + 1 do: [:prime | out put: prime ] ] reading.
powers := primes collecting: [:p | p raisedToInteger: self - (self sumDigitsInBase: p) // (p - 1)].
factorial := powers injecting: 1 into: [:product :power | product * power].
^(factorial positioning buffer: (XTRingBuffer new: 1 class: Array))
-= 1;
get

This looks good, except the final sentence which is not simple to read to my taste, remember it just extract the last element...

Now let's see performance:

Smalltalk garbageCollect.
Time millisecondsToRun: [10000 primeFactorFactorial].
-> 297

Smalltalk garbageCollect.
Time millisecondsToRun: [10000 primeFactorFactorialViaGeneratorAndXtream].
-> 390

Smalltalk garbageCollect.
Time millisecondsToRun: [10000 primeFactorFactorialViaXtreams].
-> 945

Huh?
MessageTally spyOn: [10000 primeFactorFactorialViaXtreams].

- 820 tallies, 822 msec.

**Tree**
--------------------------------
Process: other processes
--------------------------------
47.8% {393ms} WeakArray class>>finalizationProcess
39.4% {324ms} WeakFinalizationList class>>initTestPair
|39.4% {324ms} Object class(Behavior)>>new
8.4% {69ms} primitives
17.0% {140ms} EventSensor>>eventTickler
17.0% {140ms} Delay>>wait
--------------------------------
Process: (40s) 86874: nil
--------------------------------

**Leaves**
39.4% {324ms} Object class(Behavior)>>new
17.0% {140ms} Delay>>wait
8.4% {69ms} WeakArray class>>finalizationProcess

**Memory**
old            -1,959,252 bytes
young        +1,059,496 bytes
used        -899,756 bytes
free        -1,059,496 bytes

**GCs**
full            3 totalling 455ms (55.0% uptime), avg 152.0ms
incr        19 totalling 65ms (8.0% uptime), avg 3.0ms
tenures        0
root table    0 overflows

MessageTally is not reliable with COG VM, but clearly, only one third of execution time is spent in the primeFactorFactorialViaXtreams. It seems like something is wrong with this image, and implementing the generator via interprocess synchronisation did favour those CPU hijackers!

## Wednesday, July 6, 2011

### Optimizing the product of prime powers

Remember how we generated the factorial from the product of its prime factors?
Let's try and optimize the product.
First trick is to use a Karatsuba multiplication, because multiplication is the major contributor in execution time.
The Squeak variant takes care to evaluate Karatsuba only on well balanced operands, and replace the recursion by an iteration if not well balanced:

LargePositiveInteger>>karatsubaTimes: anInteger
"eventually use Karatsuba algorithm to perform the multiplication"

| half xHigh xLow yHigh yLow low high mid xLen yLen |
(anInteger isLargeEnoughForKaratsuba
and: [self isLargeEnoughForKaratsuba]) ifFalse: [^self timesInteger: anInteger].

"Check if length ratio is more than 2, and engage a loop
to operate on integers with well balanced lengths.
but we hope to gain in lower level recursion"
(xLen := self digitLength) >= (yLen := anInteger digitLength)
ifTrue: [(half := xLen bitShift: -1) >= yLen
ifTrue: [^(0 to: xLen by: yLen) detectSum: [:yShift |
((self copyDigitsFrom: yShift + 1 to: yShift + yLen)
karatsubaTimes: anInteger)
bitShift: 8 * yShift]]]
ifFalse: [(half := yLen bitShift: -1) >= xLen
ifTrue: [^(0 to: yLen by: xLen) detectSum: [:xShift |
(self karatsubaTimes:
(anInteger copyDigitsFrom: xShift + 1 to: xShift + xLen))
bitShift: 8 * xShift]]].

"At this point, lengths are well balanced, divide each integer in two halves"
xHigh := self bitShift: -8 * half.
xLow := self lowestNDigits: half.
yHigh := anInteger bitShift: -8 * half.
yLow := anInteger lowestNDigits: half.

"Karatsuba trick: perform with 3 multiplications instead of 4"
low := xLow karatsubaTimes: yLow.
high := xHigh karatsubaTimes: yHigh.
mid := (xHigh + xLow karatsubaTimes: yHigh + yLow) - (low + high).

"Sum the parts of decomposition"
^low + (mid bitShift: 8*half) + (high bitShift: 16*half)

With a bit of tuning for the COG VM:

LargePositiveInteger>>isLargeEnoughForKaratsuba
^self digitLength >= 160

SmallInteger is not large enough for Karatsuba and falls back to regular multiplication *, we omit the code here.
The copyDigitsFrom:to: and lowestNDigits: are just using a primitive for splitting a LargePositiveInteger, they are omitted too.
We'd also better optimize squared which is a simple case:

LargePositiveInteger>>squared
"Eventually use a divide and conquer algorithm to perform the multiplication"

| half xHigh xLow low high mid |
self isLargeEnoughForKaratsuba ifFalse: [^self * self].

"Divide digits in two halves"
half := self digitLength bitShift: -1.
xHigh := self bitShift: -8 * half.
xLow := self lowestNDigits: half.

"Use Karatsuba"
low := xLow squared.
high := xHigh squared.
mid := xLow karatsubaTimes: xHigh.

"Sum the parts of decomposition"
^low + (mid bitShift: 8*half+1) + (high bitShift: 16*half)

Then use squared in raisedToInteger:

Number>>raisedToInteger: anInteger
| bitProbe result |

anInteger negative ifTrue: [^(self raisedToInteger: anInteger negated) reciprocal].
bitProbe := 1 bitShift: anInteger highBit - 1.
result := self class one.
[
(anInteger bitAnd: bitProbe) = 0 ifFalse: [result := result * self].
bitProbe := bitProbe bitShift: -1.
bitProbe > 0 ]
whileTrue: [result := result squared].

^result

We have a reasonable implementation for Karatsuba, but we cannot use a naïve:

Integer>>karatsubaPrimeFactorFactorial

"Recompose the factorial from the prime factors, knowing the power of each prime."
^((Integer primesUpTo: self + 1) collect: [:p |
p raisedToInteger: self - (self sumDigitsInBase: p) // (p - 1)]) karatsubaProduct

Collection>>karatsubaProduct
^self inject: 1 into: [:product :element | product karatsubaTimes: element]
Karatsuba is just a drag in this case because we always multiply a large integer with a small one. The only advantage comes from using squared in raisedToInteger:, and instead of 19136ms, we get:

Smalltalk garbageCollect.
Time millisecondsToRun: [1 to: 3000 do: [:x | x karatsubaPrimeFactorFactorial]].
->  18408 18503

So the second trick is essential: evaluate the product of terms in a divide and conquer fashion, dividing the numbers to multiply in two groups and recursively:

SequenceableCollection>>karatsubaProduct
"Compute the product of self elements, using Karatsuba multiplication"
self isEmpty ifTrue: [^1].
^self karatsubaProductFrom: 1 to: self size by: 1

SequenceableCollection>>karatsubaProductFrom: startIndex to: stopIndex by: inc
| nextInc nextIndex |
(nextIndex := startIndex + inc) > stopIndex ifTrue: [^self at: startIndex].
nextInc := inc * 2.
^(self karatsubaProductFrom: startIndex to: stopIndex by: nextInc) karatsubaTimes:
(self karatsubaProductFrom: nextIndex to: stopIndex by: nextInc)

Smalltalk garbageCollect.
Time millisecondsToRun: [1 to: 3000 do: [:x | x karatsubaPrimeFactorFactorial]].
-> 12627 12539

Better, but there is room toward the prime swing variant which performs the job in less than 7000ms. Let's adapt our strategy and use a third trick: try and square as many LargeInteger as possible. One idea is to group the terms having same powers, multiply them together, then raise to the prescribed power. But if term p is raised to the power of 7, we will have to evaluate (p squared squared) * p squared * p. So we will store p into three collections,
• the collection of terms raisedTo: 1;
• the collection of terms raisedTo: 2;
• the collection of terms raisedTo: 4.
The last point, is that we won't bother with powers of 2, because the fastest variants also don't. Here we come with this algorithm:

Integer>>optimizedPrimeFactorFactorial
"This is the optimized version of primeFactorFactorial"

| powers primes |
self <= 1 ifTrue: [^1].
primes := (Integer primesUpTo: self + 1) allButFirst.
powers := primes collect: [:e | self - (self sumDigitsInBase: e) // (e - 1)].
^(primes karatsubaProductPowers: powers) bitShift: self - self bitCount

SequenceableCollection>>karatsubaProductPowers: powers
"Compute the product of self elements, each raised to the corresponding powers"
| lastPower rank terms |
(lastPower := powers size) = 0 ifTrue: [^1].
rank := 1.
terms := OrderedCollection new: (powers at: 1) highBit.
[rank <= (powers at: 1)]
whileTrue:
[ [ rank > (powers at: lastPower) ] whileTrue: [lastPower := lastPower - 1].
terms add: ((((1 to: lastPower) select: [:e | ((powers at: e) bitAnd: rank) ~= 0]) collect: [:i | self at: i]) karatsubaProduct raisedToInteger: rank).
rank := rank bitShift: 1].
^terms karatsubaProduct

Smalltalk garbageCollect.
Time millisecondsToRun: [1 to: 3000 do: [:x | x optimizedPrimeFactorFactorial]].
->  8436 8433

The optimized prime factor algorithm is about 2.25 times faster than the naive prime factors on small integers up to 3000.
Let's instrument the algorithm to check if the terms are well balanced. We add the line
(terms collect: [:e | e highBit]) inspect.
and check 50000 optimizedPrimeFactorFactorial
-> an OrderedCollection(49775 49532 49490 49130 49165 48285 49407 48651 41929 40562 36299 26336 38838 55004 25969)

It's remarkable, that we decomposed the factorial of 50000 into terms having about 50000 bits!

The score on a larger integer :

[50000 factorial] timeToRun. -> 10288
[50000 primeFactorFactorial] timeToRun. -> 8853
[50000 optimizedPrimeFactorFactorial] timeToRun. -> 1263
[50000 primeSwingFactorial] timeToRun. -> 1004

Now a challenge: write optimizedPrimeFactorFactorial with Xtreams instead of SequenceableCollection!

## Tuesday, July 5, 2011

### Xtreams and Generators

In my last post I reported the lack of Generator in Xtreams framework.
Well, I just received this clarification  from Michael and Martin, the authors of Xtreams.
Definitely, Xtreams has a lot of potential!

## Monday, July 4, 2011

### Using generators on the primeFactorsFactorial example

Last example algorithm, primeFactorsFactorial did involve enumerating of several collections:
• The collection of primes up to n
• The collection of powers of these primes
An alternative is to work with a stream that generates each item on the fly, rather than collecting the whole sequence of values, and then pipe to other kind of streams that transform the values. This technique has two advantages:
1. avoiding allocation of large auxiliary collections that will later be thrown out, can help some algorithms to scale up.
2. piping could theoretically enable parallel execution if we had a multi-threaded VM (though in fact we keep some collections, the optimal performance being often reached by tuning some buffers size, since the cost of inter thread synchronisation is not null)
This does not really apply to our prime factors decomposition since the collection of primes are not huge, but we have this example available, let's use it and see how it would be possible in Squeak.
First, Squeak provides an primesUpTo:do: taking a block as argument. That's all we need.
We can transform this block into a stream using a Generator. Let see in class comment how to use it:

| generator |
generator := Generator on: [:g| Integer primesUpTo: 100 do:[:prime| g yield: prime]].
[generator atEnd] whileFalse:[Transcript show: generator next].

OK, so a new Generator is associated to an outer block, [:g| Integer primesUpTo: 100 do:[:prime| g yield: prime]], which tells which method to execute to produce the sequence of values.
The inner block [:prime| g yield: prime] just tells the generator that a new prime is available.
Each time we ask for generator next, the generator machinery arranges to advance the execution of outer block until it reaches execution of the inner block, and then switch back and return the prime provided to this block.
This is known as the coroutine technique, and implemented by using a clever a simple hack manipulating the call stack. Look at the implementation by yourself.
Note that a Generator now understand value: and can be used directly in place of the inner block, try (Generator on: [:g| Integer primesUpTo: 100 do: g]) contents.
Can we apply this to our primeFactorFactorial?

primeFactorFactorialViaGenerator
| generator |
generator := Generator on: [:g | Integer primesUpTo: self + 1 do: g].
^ (generator collect: [:p | p raisedToInteger: self - (self sumDigitsInBase: p) // (p - 1)]) product

Hmm, it will not work, because a Generator understands only one iteration operator, do:, but not collect: Squeak streams are a pity.
More over, the collect: operation if implemented by means of do: would not be lazy but rather produce the whole collection of powers. We would have only eliminated the first collection of primes, but not the collection of powers of primes.
One good and recent framework - Xtream - It is an interesting alternative and provides the lazy operations we're looking for; follow the squeaksource link to get download instructions. Let's see how we can lazily pipe the operations through streams:

primeFactorFactorialViaGeneratorAndXtream
| generator stream factorial |
generator := Generator on: [:g | Integer primesUpTo: self + 1 do: g].
on: [generator atEnd
ifTrue: [Incomplete raise]
ifFalse: [generator next]])
collecting: [:p | p raisedToInteger: self - (self sumDigitsInBase: p) // (p - 1)])
injecting: 1 into: [:product :power | product * power].
^ [[factorial := stream get] repeat]
on: Incomplete
do: [factorial]

Unfortunately Xtream has no Generator, so we have to wrap it up in a BlockClosureReadStream.
Another problem is that injecting:into: is not reducing the stream to a single element, instead it answers a stream of all the intermediate results. Since we only need the last one, we have to repeat the operation by ourself, capture and answer the last element when the stream starves (we catch this Incomplete exception).
Once I proposed an alternative selector ^[stream get] repeatUntil: Incomplete, with a slight modification, this would have been useful; or maybe we could just extend XTReadStream to respond to last.

Let see how the generator variant performs:

Smalltalk garbageCollect.
Time millisecondsToRun: [1 to: 3000 do: [:x | x primeFactorFactorialViaGeneratorAndXtream] ]
-> 22835 22529

This is to be compared to the 19136ms of naive form. Well, that was expected, we eliminated the intermediate collections, but they did not cost that much in this case. Instead we pay a price for indirection and maybe also the Generator trick forces COG VM to un-optimize some portions. But that doesn't matter, we saw some smart streaming techniques.

A more difficult exercise would be to transform a divide and conquer recursive product evaluation variant into a streaming equivalence... Yet I do not know how, but if I find it, this will be another post.

## Sunday, July 3, 2011

### Evaluating a factorial from its prime factors

Again, I will evaluate a suboptimal algorithm because it is nice and fun in Smalltalk.
I guess in other languages too, especially functional languages, but you know my favourite path.
The exploration started with this little factorial contest submitted by Paul Baumann.

We already know the fastest algorithm are the prime swing variant (read this wiki to get an explanation of swing by Peter Luschny), and simple and clever split recursive algorithm submitted by John Brant.
In both cases, only the odd products are performed, in a divide and conquer fashion which produces well balanced large integers (this is important for accelerated products like Karatsuba Toom or FFT). The final power of two being replaced by a bitShift: which saves many costly LargeInteger multiplications.

Our suboptimal variant of interest consists in evaluating the factorial from its decomposition in prime factors. We know that each prime <= n will be compose n factorial, but we also know the power of each of these primes thanks to Legendre. This algorithm can be naively transcribed in Smalltalk:

Integer>>primeFactorFactorial
"Recompose the factorial from the prime factors, knowing the power of each prime."
^((Integer primesUpTo: self + 1) collect: [:p |
p raisedToInteger: self - (self sumDigitsInBase: p) // (p - 1)]) product

primesUpTo: is a method found in Squeak which use a sieve of Eratosthenes, up to means not inclusive, so we need the disgracious + 1. Then Squeak lacks two methods, sumDigitsInBase: and product.

A naive product, could be simply:

Collection>>product
^self inject: 1 into: [:product :element | product * element]

With this definition, Collection new product is answering 1, and we don't need any protection for evaluating 0 primeFactorFactorial.
For sure, it won't be efficient and should better be replaced by a divide and conquer variant, but we'll see this in another post.

A naive sumDigitsInBase: could be:

Integer>>sumDigitsInBase: base
^(self printStringBase: base) inject: 0 into: [:sum :digit | sum + digit digitValue]

OK, we're done...

1000 primeFactorFactorial. -> Error: subscript is out of bounds: 303

Oops! Unfortunately, it's not only suboptimal, it won't work for primes above 36 because Squeak has no idea which character to use to represent digit 37. Anyway, even with unicode we would only have a limited set of characters. Huge, but limited.

It does not matter anyway, we can just reimplement printStringBase: and instead of streaming out digits converted to characters, just sum the digits.

A naive algorithm would be to just remove the last digit (the remainder by a division by base):

Integer>>sumDigitsInBase: base
| sum |
sum := 0.
q := self abs.
[q = 0]
whileFalse:
[sum := sum + (q - ((q := q // base) * base))].
^sum

Since we probably won't compute the factorial of a LargeInteger, this shall be enough.
For LargeInteger, we would use a divide and conquer method in order to avoid repeated LargeInteger divisions, and replace them by as many SmallInteger divisions as possible. This would be the exact image of printStringBase:, so we won't develop it here.

Now we are done with this very naive implementation. Let's see how it performs on COG VM:
Smalltalk garbageCollect.
Time millisecondsToRun: [1 to: 3000 do: [:x | x factorial]].
-> 29753 29780

Smalltalk garbageCollect.
Time millisecondsToRun: [1 to: 3000 do: [:x | x primeFactorFactorial]].
-> 19204 19136

Incredible, with this very naive code we already gain 33%. In fact, any implementation will probably beat factorial performance, but not its simplicity.

## Saturday, July 2, 2011

### Revisiting the sieve of Eratosthenes

Welcome, this is my first blogger blogging about Smalltalk, and its Squeak flavour.
Today, I will revisit the sieve of Eratosthenes for fun.
I don't think I need to introduce this famous algorithm.

The first idea is to use integers as arrays of bits, because it is fun.
If p is a composite integer, then the bit of rank p will be set to 1.

The second idea is to use an integer division to produce the repeated bit pattern using this property:

(2r11111111111 // 2r11) printStringBase: 2. -> '1010101010'
(2r11111111111111 // 2r111) printStringBase: 2. -> '100100100100'
(2r1111111111111111111 // 2r1111) printStringBase: 2. -> '1000100010001000'

OK, I think you get the idea of division. Now we need two things:
1. get the right number of 1 left of //
2. remove the trailing 1 in the quotient
We can do both with bitShift: or using the binary selector <<
If p is prime, and we want all primes up to n, then the adequate bit pattern is produced by:

1 << (n // p - 1 * p) - 1 // (1 << p - 1) << (p << 1 - 1).

Mind the parenthesis, above expression is a bit tricky...
Remember, Smalltalk binary operators read from left to right, all with same precedence.
First operation produces the 1s left of //:
1 << (n // p - 1 * p) - 1
Indeed, we need a length which is multiple of p in order to get a trailing 1 in the quotient
And we need enough bits to obtain a quotient with a leading 1 of rank <= n - (2*p-1)

Second operation is the division by appropriate number of 1s (p):
// (1 << p - 1)

Third operation shift by 2*p - 1 to obtain the trailing 0s:
<< (p << 1 - 1)
Indeed, we need p - 1 to align the 1s on rank multiples of p, + p to clear the trailing 1.

Of course, I can propose a few variants, for example a single 1 is enough left of //:
1 << (n // p - 1 * p) // (1 << p - 1) << (p << 1 - 1).
We can just complete the missing trailing 0s:
1 << (n - p)  // (1 << p - 1) << (p << 1 - 1 - (n \\ p)).
Or arrange to have first p - 1 trailing 0s produced by the division:
1 << (n // p * p - 1) // (1 << p - 1) << p.

The last one looks good, we are ready to produce the composite mask up to n:

p := 1.
composite := 1.
[ p <= n ]
whileTrue:
[(composite bitAt: p) = 0
ifTrue:
["p is prime, mark all multiple of p up to n as composite"
composite := composite bitOr: 1 << (n // p * p - 1) // (1 << p - 1) << p ].
p := p + 1].

Of course we could stop the loop at n sqrtFloor, increment by 2 from second loop etc...
We could also enumerate primes by feeding a block with p inside the ifTrue: [ ].
But let's have fun by enumerating primes from the composite mask with bit tricks.
For example, we want to collect primes in an Array.

To know the number of primes up to n, we just need to count the number of bits set to 0.
Or subtract n with number of bit set to 1.

numberOfPrimes := n - composite bitCount.

Then, we just enumerate by finding and removing next bit set to zero, starting at least significant bit.

(Array new: numberOfPrimes) collect: [:void |
| isolatedBit |
"find next zero in composite"
isolatedBit := composite bitXor: composite + 1.
"remove next zero in composite"
composite := composite bitOr: isolatedBit.
isolatedBit highBit]

Indeed, 2rxxxxx01111 + 1 will generate a carry up to trailing 0, and result in 2rxxxxx10000, the leading xxxxx being unchanged.
The bitXor: then will remove the leading xxxxx, and result in 2r11111.
We then just need to get the rank of the highest bit, which is highBit's job...

An alternative is to first reverse the bits of composite mask.
Here we will just print the primes in the Transcript:

composite := 1 << n - 1 - composite.
[composite > 0] whileTrue:
[Transcript cr; show: composite lowBit.
composite := composite bitAnd: composite - 1]

Same here, 2rxxxxx10000 - 1 will generate a carry up to trailing 1 and result in 2rxxxxx01111, the leading xxxxx being unchanged. Then bitAnd: will just clear all but the leading xxxxx.

Of course, this method is not really efficient because each bitOr: and bitXor: operation will allocate a new LargePositiveInteger, not counting all the bitShift:, additions and subtractions, but do we care? Squeak Integer class>>primesUpTo: is already fast, we don't need another one...

Nastily, I removed the comments, obfuscated the names, and factored out repeated code with additional temps. It's not a good idea because some operations leaked out the primality condition block, but just for the pleasure to quizz colleagues:

quizz: aBlock
"Could you tell which serie will feed aBlock when this message is sent to an integer n?
What the return value means?"
| s t x y z |
s := (x := y := 1) << self - 2.
[(x := x + 1) <= self]
whileTrue:
[y := (z := y) << 1 + 1.
(s bitAnd: (t := z + 1)) = 0
ifFalse:
[aBlock value: x.
s := s bitAnd: z << (self // x * x + x) // y + t]].
^s

Did you notice the variant?