Name: Examining Soft Machines' Architecture: An Element of VISC to Improving IPC
Item: Examining Soft Machines' Architecture: An Element of VISC to Improving IPC
Author: Dr. Ian Cutress

Examining Soft Machines' Architecture: An Element of VISC to Improving IPC

by Ian Cutress on 2/12/2016 8:00 AM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

97 Comments

Back to Article

tipoo - Friday, February 12, 2016 - link
Pretty interesting stuff. As they open up a bit more and provide more data, I'm cautiously letting myself believe there's a possibility of this taking off, but I'm not budging my optimism meter past 1 until this ships and is positively reviewed.

I wonder if it's more likely they'll be bought out first. If Intel sees a credible competitor, that's certainly possible.
willis936 - Friday, February 12, 2016 - link
The prospect of actual single threaded performance increases is exactly what the future of computing needs. I'm not as concerned with the existence of the technology as I am the adoption. Competing with intel is more than just making a good processor. This company will have to convince other companies to integrate a Lot of the controllers and interfaces that intel does for them.
easp - Friday, February 12, 2016 - link
If they can actually deliver real single-threaded performance increases, the world will beat a path to their door. On-chip peripherals and off-chip interfaces are cookie-cutter in comparison.
Xenonite - Wednesday, February 17, 2016 - link
"If they can actually deliver real single-threaded performance increases, the world will beat a path to their door."

Sadly, this is not how the semiconductor industry works. AMD could, for instance, DOUBLE their single threaded integer performance by simply tweaking their ZEN design to utilize 4x~5x the current planned TDP of 95W, using a larger die to spread the increased current load over multiple transistors, double their L1 and L2 cache sizes and to add a low-latency Last Level Cache.

If done before tape-out, AMD can work with the foundry to optimize the transistors' characteristics and operating points, which would easily allow for a doubling in single-threaded throughput.
Even if the raw clock-rate couldn't simply be doubled, they could use the additional power budget to run MUCH more aggressive speculative execution, and to widen their superscalar pipeline to be at least as wide the average instruction pipeline length is long.

Since IPC does not need to be tied to instruction latency, each core could easily complete around 5~6 instructions per clock by having, say, 10~12 fully functional superscalar pipelines (each pipeline can complete any instruction indapendently, without having to rely on shared logic blocks) .
Xenonite - Wednesday, February 17, 2016 - link
Sorry, I submitted the post before I was finished. Basically, it boils down to the fact that no one (other than myself XD) would be willing to pay for such a processor. Even if AMD managed to totally thrash Intel in absolute performance, no one will care. And with no mass consumer support, their shareholders would never approve such a project in the first place.

The main reason why VISC is doomed to fail, is quite similar: you simply can not attract investors with raw performance in 2016.
Even if they actually had a ~5x single-threaded performance lead over Intel's fastest consumer desktop chips, they STILL wouldn't get the billions of dollars that they need to do a mass market rollout of their arch.

The whole situation is making me really morbid and depressed; what I wouldn't give to go back to the Pentium 3 days.
Demiurge - Friday, February 12, 2016 - link
16-wide ILP isn't going to be a mass-market solution... most designs are barely using 1.5 instructions per cycle, let alone 4. Given the stellar shift to CPU and then GPU based vector processing... I might be missing something here, but I would say that there are already 16-wide ILP on certain specialized operation that actually benefit, such as video processing for example.

Incidentally, does anyone remember Transmeta Code-Morphing Software? If not, look it up...
sonicmerlin - Saturday, February 13, 2016 - link
You didn't even read the fracking article.
name99 - Saturday, February 13, 2016 - link
What he's saying (perfectly legitimately) is that
- there is a LONG history of companies praising to the skies superficially good ideas which actually turned out not to matter much
- VISC's unwillingness to provide SPECInt numbers, even after being so strongly excoriated about this by the entire tech press and academic world, STRONGLY suggests that what they're peddling does not work the way they claim. It likely provides a great speedup for much FP code (speedup which you can also get by using a GPU, the preferred path of traditional companies), and very little speedup for standard integer code.

The speed at which they claim they can execute also makes one wonder. Even Apple (likely right now the best funded CPU design-house out there, with the simplest target in their sights, working on more or less traditional designs) aims for a major core every two years, with a minor upgrade in between. These guys, with vastly fewer engineers and money, and trying to do something more innovative, believe they can spin a major upgrade every year...

That seems extremely unlikely, so the only real question is: they bullsh*tting only the press/their investors, or are they also bullsh&tting themselves?
Samus - Monday, February 15, 2016 - link
The parallels with Transmetta ring a bell with me, too, and yes, I did read the article. I'm inclined to have a immature capitalist response to things like this, specifically: if a company as big as Intel, with some of the best engineers in the world who are often open to radical ideas, haven't bothered trying an instruction decoder, it is likely because the pros did not outweigh the cons. After all, Jackson technology (hyper threading) is some form of what's going on here, just not targeting specific requests.
Azethoth - Wednesday, February 17, 2016 - link
Agreed, and reading the list of unanswered questions it sounds a lot like they are trying to look good in very specific circumstances, rather than being naturally best in class. The competition is GPU + CPU cores. Unless you prove superiority despite all the tricks the competition has available you cannot succeed. What they propose sounds like it needs to break down the normal inter core separation that lets them operate independently so that they can realize single threaded speedup. I am not an EE, so I assume it is at least possible. I am not sure it can be done practically though.
Bleakwise - Tuesday, March 14, 2017 - link
"Floating point code"

"Integer code"

Do you have any idea what you're talking about?

Bulldozer does "flaoting poitn code" faster than the fucking 1080Ti

At least one one thread. Unless you're going to go wide it doesn't help.

The point of this isn't to "go wide" it's to massively increase speculation ability.

The 1080Ti has ZERO speculative ability, NONE. GPUs simply don't do branching, that's not what GPUs do, they rely on ACE units and SMX units and so on to balance thousands of cores.

A CPU on the other hand has more speculative branches than cores.

SIMD and SIMT that GPUs do are not "FPU code"

FFS
dcbronco - Friday, February 12, 2016 - link
AMD helped finance this. They may already have a stake and I would bet some right to first refusal. They used their investment in HBM to get earlier access than NVIDIA, I doubt they would have invested without some sort of incentives for themselves.
Bleakwise - Tuesday, March 14, 2017 - link
Of course not.
bcronce - Saturday, February 13, 2016 - link
There is no such thing as a free lunch. They are trading something. Their benchmarks are for single thread performance, which the graphs showed a much greater efficiency and performance than Intel. Very impressive and I'm sure they'll be great for something.

The problem is the platform sounds great for highly coupled cores and very wide single thread execution with few data dependencies. Could be great for computation.

What I'm wondering is how their platform scales for IO workloads like web servers, file servers, or event video games. Suddenly a large part of the work is communicating with other devices and synchronizing many cores.

One thing that has helped ARM for a long time is they were mostly single core and only recently multi-core. They didn't use to have a complex cache-coherency like x86. This dramatically reduced transistor counts, increase efficiency, and allowed for great decoupled core performance. But as soon as you wanted two cores to work together, it went to crap. Cache-coherency is hardware accelerated inter-core communication. Amdahl's law was not very forgiving to ARM's non-cache-coherency cores for anything except GPU like workloads.

Based on the description, VISC sounds like it needs highly couples cores to maintain low latency and high bandwidth. This is probably why they also seem to have lower frequency. Keeping many parts far away from each-other in sync takes time. But lower frequency also means lower voltage, and power consumption scales with the square of the voltage and linear with frequency.

I wonder how tightly coupled they can keep 4, 8, or 16 cores. Maybe they don't need the core counts for their target workloads or possibly they can stay competitive with a fraction the core counts by having better efficiency in power and IPC.

In the end, I'm sure they'll at least find a niche market and I'm glad some new ideas are making it out there. I wouldn't be surprised if they can take over the dual or quad core market, forcing Intel to add more cores.
Bleakwise - Tuesday, March 14, 2017 - link
It's not a "free lunch"

Obviously all of this crap is going to cost DIE space, it's not free.

If all we cared about was raw processing power we'd just make 2046kb wide vector units and ignore branching and speculation all together.

Bulldozer has better theortical performance than Haswell i5s. I'd rather have the extra out of order pipes, the SMT unit to use any unused pipes, better branch prediction and so, and in the real world this stuff wins the day.

Not everyone can become a world class programmer and re-factor all their code so that it spreads across thousands of cores like it can on a GPU.

Sometimes it's not even possible. Sometimes what you need is branch prediction, branch prediction lets you see the future, LITERALLY this is what the CPU does. Obviously the more branches you predict, the more cycles you're wasting on that thread, because the more speculations you get wrong.

You also reduce the number of misses and increase cache hits.

As for coupling 4 or 16 cores, they haven't even talked about going beyond 4 cores. Obvoiusly it doesn't scale into infinity, if you're getting 90% speculative accuracy you can only gain 10% more. Spending 30% of your transitor budget to bind up 8 or 16 cores when spending 10% of your budget on 4, for a 10% performance gain would be dumb.

You'd be much better off going for more clock speed, or reducing latency, adding a victim cache, or l2 cache coherency, or beefing up the GPU, a better memory controller, or just beefing up your underlying branch predictor,
Bleakwise - Tuesday, March 14, 2017 - link
You'll never get perfect speculation anyway. Unless a language is developed that puts limits on the number of branches possible per X lines of code and keep the number of branches below the number the CPU can handle. You're going to ALWAYS have to deal with the risk of cache misses.

Not sure there is even anything you could gain from 100% target prediction hit grantee beyond having no lost cycles on a miss. Getting there even through a core-binding fabric/bus like this across 16 cores would blow your transistor budget to the point that you could hardly afford a reasonable size cache in the first place.

You'd be better off just reducing the number of stages in the pipelines or just adding more pipelines to each core instead of blowing your budget on this fabric.

For example, binding together 100 in order CPUs to make a virtual 100 pipeline CPU would be ridiculously expensive and power hungry vs just having an 8 core superscaler CPU with 12 out of order pipelines in each CPU.
tipoo - Friday, February 12, 2016 - link
Question, since this is testing their core design in isolation and the rests of the package hasn't been built around it, is that accommodated for in the comparisons to other SoCs, which all have far more die area dedicated to non-core stuff than the cores?
Flunk - Friday, February 12, 2016 - link
If VISC is not an Acronym then don't capitalize it, idiots.

The technology looks like it could be really good, I'm hoping we see some practical applications.
smilingcrow - Friday, February 12, 2016 - link
They can capitalize it for any reason they like; it's just a word so nothing to GYKIATO (Get your kickers .... over).
andychow - Friday, February 12, 2016 - link
It's an acronym, you can't trademark acronyms, so now they claim it's not an acronym. Legal bs 101.
ddriver - Saturday, February 13, 2016 - link
All abbreviations are capitalized, not just acronyms, idiot. Whether it is an acronym or not depends on how it is pronounced.
erple2 - Saturday, March 12, 2016 - link
Enough people confuse initialisms and acronym that it probably doesn't matter anymore.
FunBunny2 - Friday, February 12, 2016 - link
If SMI wants this to be believed, then just publish a paper (in a peer reviewed journal) showing how VISC invalidates Amdahl's Law. This is, after all, what they're really claiming.
willis936 - Friday, February 12, 2016 - link
Could you explain how they're claiming to invalidate Amdahl's law ?
FunBunny2 - Friday, February 12, 2016 - link
as I read the piece, SMI is implying/claiming performance improvement in running serial code in a parallel fashion, and Amdahl says you can't do that. if, OTOH, the claim is that VISC is able to suss out parallel execution in superficially serial code, then that process has to be proven to exist algorithmically. as the piece goes to some length to describe, based on what's been provided by SMI, it much like smoke and mirrors.
Arnulf - Friday, February 12, 2016 - link
I read their claims as an expansion of superscalar design. Nothing new here and certainly nothing breaking any "laws". It still cannot magically make non-parallelizable code run faster than it normally would.
Samus - Monday, February 15, 2016 - link
If their decoder can break up serial code and run it through different cores optimized to do different things better, this would theoretically complete the code faster because their will be no pipeline penalty.

Personally. I think we have better odds of seeing a quantum processor before this type of thing taking off, though. That is to say, no time soon.
gamerk2 - Friday, February 12, 2016 - link
Kinda. There will still be a limited performance simply because some operations can not be made parallel under any circumstances, but Soft Machines is really taking ILP to the extreme here.
xthetenth - Friday, February 12, 2016 - link
No it really isn't and you're profoundly misunderstanding Amdahl's law. All that says is how much an improvement to a portion of a workload's execution speed will affect the workload as a whole's execution speed. Meanwhile what they're doing is trying to extract parallelism from single threads, which means that they're speeding up a greater fraction of the code. Funnily enough, you can use Amdahl's law to predict when this method (shrinking the non-improved section to allow higher maximum speed) is more effective than things like clocking higher.

I suspect what you're doing is confusing the law with an explanation/common use of the law because it is very popular to use it to show there's a limit on the gains that can be made by parallel processors.
Drumsticks - Friday, February 12, 2016 - link
It doesn't really invalidate Amdahl's Law. Serial code still can't be run on multiple cores. As I understand it, it only allows extracting more ILP using an ultra ultra wide design when possible.
xdrol - Saturday, February 13, 2016 - link
I somehow fail to see why should be scheduling 2 threads to 2 cores of 4-wide pipelines - including overhead from 'in-thread' cross-core communication - should be more effective than a 2-thread 8-wide SMT core (aka Skylake - it's not 6-wide, and SMT is fine-grained, threads don't 'wait' like the article suggests)
Alexvrb - Saturday, February 13, 2016 - link
Right! It's like getting the best of narrow and wide designs at the same time. You can go wide or narrow and more/less threads as needed. It'll probably need a lot of OS support to work well. Still, the concept is interesting, and if their translation layer is fast, it could eventually handle legacy well enough.
FunBunny2 - Saturday, February 13, 2016 - link
-- You can go wide or narrow and more/less threads as needed.

but no known processor (or design algorithm) can create parallelism in serial code. just because a cpu wants to implement ILP to a greater extent than extant processors, it can't make parallel from nothing; it can only discover "hidden" parallel that extant processors are missing. not something I'd bet on.
Alexvrb - Sunday, February 14, 2016 - link
I was talking about the processor itself. The CPU can act like a wide or narrow design on demand. Whether or not a particular piece of code will benefit was not something I was discussing. My point is that where it helps, it can go wider than current designs. Where it doesn't help it can scale back and go narrow, leaving more cores available for other threads.

In other words this doesn't displace multi-threading. A single piece of demanding software may still want to run multiple threads concurrently, to indirectly extract more parallel performance - such as a game splitting up AI, physics, audio, rendering, networking, etc into their own threads. I don't think their design eliminates the necessity of doing this sort of thing.

However they can boost average efficiency with a narrow design and lots of cores (similar to mobile ARM designs), without losing performance vs high-power designs (and in some cases gaining performance) because they can act as a wide pipeline by combining cores It's a flexible form of virtual cores that people will tend to just simplify as "reverse HyperThreading".

This is all just in theory of course, their implementation has to prove itself. Not to mention the difficulties they'll face with ISA translation, at least in the near term. If their technology takes off and gets licensed out, there will be ports of modern OS and APIs, and thus apps will be ported to run native (in the case of Windows, cloud compilation would handle the majority of RunTime apps).
Samus - Monday, February 15, 2016 - link
If the translation layer does what they say it does, that is exactly what this processor can do. It can break up serial code for parallel processing. I don't know how, or how efficient, it can do this. To analyze serial code and say ohh, so there's this complex part in the middle and the rest is simple, and send the complex part to one core and the rest to another, and somehow reassemble it after its processed, seems impossible. We have all seen promising tech flop before, Cyrix and Transmetta had some radical ideas for the way x86 worked, in the end neither could trump Intel or AMD.
Alexvrb - Monday, February 15, 2016 - link
What he was saying is that some code can NOT be made parallel. They CAN take single threads and break them up, and when it's possible they can find parallel processing opportunities. But some tasks are inherently serial. Neither the programmer, nor the compiler, nor the VISC processor can make inherently serial tasks parallel. For example, if A has to happen before you can work on B.

Uh, at least not with a conventional binary architecture. I don't know much about quantum processors.
easp - Friday, February 12, 2016 - link
I think uninformed pundits/press/commenters will miss the limits imposed by Amdahl's law.

Its still plausible to me though that this approach will allow more efficient use of silicon and power by allowing better allocation of processor resources at runtime than is possible with traditional compilers, operating system scheduling, hardware scheduling and organization of execution resources.

Whether they can establish a viable foothold in todays competitive landscape is another issue.
Sufiyan - Saturday, February 13, 2016 - link
If anything this shows that Amdahls law is still true.
Bleakwise - Tuesday, March 14, 2017 - link
They never said it violates Amdahl's law.

I fact they said that 2 cores gives a speedup of 53%.

Amdahl's law says it would be a 100% speedup maximum.

Since when is 53% > 100%?
Bleakwise - Tuesday, March 14, 2017 - link
Of course, it does beg the question...

Why don't we just make 24 wide CPU pipelines and allow for 3-way SMT and fatten the cores up with more units instead?
Bleakwise - Tuesday, March 14, 2017 - link
I mean IBM does this with the POWER8 very successfully.
Bleakwise - Tuesday, March 14, 2017 - link
If you would like to know how an Superscaler CPU can beat an in-order CPU....
https://en.wikipedia.org/wiki/Instruction-level_pa...

https://en.wikipedia.org/wiki/Superscalar_processo...
https://en.wikipedia.org/wiki/Instruction-level_pa...

So a Processor with 6 pipelines can do
1*2*3*4*5*6 in one instructoin
a processor with 12 piplines can do
1*2*3*4*5*6*7*8*9*10*11*12
in one clock cycle

This is the opposite of hyper threading, which allows my 4770k with 5 pipelines to do
1*2*3*4*5
or
1*2*3 and 4*5
or
1*2 and 3*4*5
all in one clock cycle.
jjj - Friday, February 12, 2016 - link
What they do with A72 in their slides is a huge red flag. They clock it above 3GHz on 16ff to make it look bad. When you don't need to distort the truth why do it? Was excited about them but they lost all credibility with this.
vs ARM it will be hard for them ,assuming ARM will have yearly updates and a broader range of cores. Area will also matter a lot Ofc vs ARM the proper math when it comes to perf, power, thermal and area would be to include dark silicon. ARM is at 8-10 cores in 2-3 clusters but we might see even more than that (i would add a gaming cluster, as GPU perf is a rather complicated problem right now).

Hope we do get to see them in commercial products and i wonder about their longer term plans. Would be interesting if they would aim for a lot more cores at very low power and even cooler if they would aim to use different types of cores - as undoable as all that might be lol. For glasses we need a huge step forward that process and packaging might fail to enable soon enough and even server might find such a path preferable. Would love to see 1T 32PC at 50-100mW on 5nm. Or ,to just go crazy, would be great if they could reach low enough power (thermal) to stack logic and go monolithic 3D since folks are not quite able to do that , for now.
Guess , it would be great if you could ask them how far they think they can push with the number of cores in a thread.
gamerk2 - Friday, February 12, 2016 - link
Odds are, Soft Machines gets acquired by Intel (who want a low-power core for mobile. And hey, ARM support to eliminate the lack of mobile X86 software to boot) or NVIDIA (who want a CPU core, and hey, already have ARM based tablets. X86 support is a bonus an could allow full NVIDIA branded PCs).
jjj - Friday, February 12, 2016 - link
It would be easier for Intel or ARM to just copy. Additionally, a sale to Intel would be difficult with Samsung and AMD as investors in SM.
fiodhkf - Friday, February 12, 2016 - link
I don't understand these results. How are skylake specint and spefp scores so low? On spec.org the weakest skylake part I could quickly find is Celeron G3900 at 2.8 GHz and 2MB L3 (and huge power consumption, but let's ignore that for now). It has CINT2006 of ~45 and CFP2006 of ~61. Can i5-6200U be that much slower?
extide - Friday, February 12, 2016 - link
Because those are NOT the results of a skylake chip, those are their adjusted results of a chip that is equivalent to skylake, but with 1MB L2, no L3, and made on TSMC's 16nmFF+, which is a chip that will NEVER exist in the wild and is POINTLESS to compare to as these guys will never be competing against a made up chip, only the actual stuff released by Intel, and other people.
fiodhkf - Friday, February 12, 2016 - link
In the second Performance/Watt comparisonfigure the blue curve is supposed to(?) show the true unscaled-for-cache skylake (power is probably scaled to TSMC 16nmFF+, but surely they're not scaling the performance as well). Even there the skylake spec scores are only about half of what they should be according to results on spec.org.
Exophase - Friday, February 12, 2016 - link
The spec.org scores are using ICC, which has optimizations that game a few SPEC2006 subtests like crazy. They also apply auto-par and pointer compression optimizations that aren't applied in GCC. There's also some extra optimizations for peak if you're looking at that but it doesn't make a huge difference in the overall score.

All of this adds up to big differences in SPEC score.
fiodhkf - Friday, February 12, 2016 - link
Thanks, that was pretty much what I guessed would be one explanation for the difference. Still, I'm a bit surprised with the low skylake scores even when compared to some (old) AMD processors where spec.org scores used open64. But I don't care quite enough to try myself.
vladx - Sunday, February 14, 2016 - link
If it works, Intel or ARM won't be abke to copy them because they've already patented the techniques used.
vladx - Sunday, February 14, 2016 - link
*able to
valinor89 - Saturday, February 13, 2016 - link
AMD, Samsung and Globalfoundries are chief investors so it is doubtful Intel or Nvidia will be able to aquire this company.
xthetenth - Friday, February 12, 2016 - link
Why is that such a red flag? They show the optimal part of the curve for A72, and they show the suboptimal tail for all of them, although they extend it farther for the A72 to show what it takes to get it up to the same performance level (basically it's non-viable and that they're in a different class if accurate), and they say as much. There's a huge list of objections the article raises and that isn't on it for pretty good reason. It's just not nearly as big a deal as the rest.
Andrei Frumusanu - Friday, February 12, 2016 - link
It's incorrect to simply extend a curve of an existing design beyond its design operating range. It's perfectly possible to design the physical implementation to be optimized at very high frequencies - in such a case the curve would less steep but consume higher power at the low frequencies. Extending the curve of a low-power design is relatively misleading in this case.
extide - Friday, February 12, 2016 - link
Yeah, and I don't think they should adjust the Intel cores at all. Intel chips come as Intel makes them, that's it. You will never see the skylake arch on TSMC or GF foundry processes. You should take the results from the Intel chip as they are because that is what you will be competing against, not some made up adjusted result that will never exist in the wild.

As for adjusting the OTHER chips, well, ok I see what they are going at here, but I still think they took it a bit too far, like adjusting for more or less cache. Although you can see those other chips on various processes, form GF and TSMC, so the process correction isn't really as big of a red flag to me.
name99 - Saturday, February 13, 2016 - link
The curve is not illegitimate because you're missing the point. The goal of the curve is not to show how great their CPU is, it is to show how great their TECHNOLOGY is (ie their microarchitecture). This is best done by comparisons that hold all else equal (ie same process, same compiler, same caches, etc; only different microarchitecture).

If you're going to criticize the presentation, criticize it on grounds that actually make sense:

- their "performance" score is garbage because they claim to be in the business of speeding up SINGLE-THREADED code, but then mix in a number of benchmarks that are very naturally parallelized. This is much like comparing an ARMv8 CPU with NEON switched off to an x86 using AVX-512, to test matrix multiplication speed --- it's simply NOT telling you anything about single-threaded performance.

- the robustness of their normalizations is dodgy and they provide little evidence that the ways in which they have normalized are legit.
gamerk2 - Friday, February 12, 2016 - link
This is where CPUs are eventually going to go, since it's really the only way to get maximized CPU performance without adding a lot of power-hungry components onto the die.

That being said, the likely outcome is someone (Intel most likely, possibly NVIDIA) acquires Soft Machines and integrates their IP onto their own chips.
vladx - Sunday, February 14, 2016 - link
Doubt it, the only chance for NVidia would be to license it and Intel would most likely be blocked from nuying such a company.
Avendit - Friday, February 12, 2016 - link
How doe this all compare to the Transmetta/Crusoe parts? That had a different purpose but did have the translation abstraction layer approach, but didn't seem to go anywhere unfortunetly. Are there any parallels or learnings to be had?
xdrol - Saturday, February 13, 2016 - link
The Cruzoe was (and Denver is) a VLIW design, it needed software translation to run *anything*, telling what pipeline ports to schedule (a hard optimization problem). Here the translation is supposedly just an ARMv8 to internal ISA mapping, scheduling is still done by hardware like with a normal superscalar design.
Jtaylor1986 - Friday, February 12, 2016 - link
Excellent article Ian. Thanks
jjj - Friday, February 12, 2016 - link
1 more thing.
Any clue about thermal management? Can they turn off individual physical cores or they just lower clocks? Being able to do both would be interesting.
matt321 - Friday, February 12, 2016 - link
This would make sense for someone like Apple to buy/invest/license the technology for their own processor development. They could have common cores with translations for both ARM and x86 (for iOS and OS X respectively) with the long-term goal of migrating completely to VISC ISA.
extide - Friday, February 12, 2016 - link
This is interesting, because I have thought of doing a processor design somewhat like this for a long time. Remember when BD was coming out, there were rumors of "reverse Hyperthreading" well this is kinda that.

I had thought that someone should make a suuuper wide cpu, like 20 or 30 wide, put TONS of execution resources on it, and then put a bunch of hyperthreads. That way a single thread could use all 20-30 execution resources, if possible, or you could have multiple threads sharing all that. Like instead of a quad core, with 2 threads/core have like a super core with 8+ threads, and then maybe a couple of those.
extide - Friday, February 12, 2016 - link
Although, I had always thought that engineers had thought of this already, and that maybe it was a bad idea due to some reason I don't understand, and that's why we haven't ever seen a design like that. Well, this is pretty similar to my idea, except they aren't making a super core, they are allowing a thread to use resources from several cores, if it needs.
Exophase - Friday, February 12, 2016 - link
The problem is that going wider decreases efficiency and slows down critical paths. So the processor that's N * 2 wide will have to be a lot slower and/or less efficient than the one that's N wide. If software can rarely extract enough parallelism to go beyond N wide then the N * 2 wide version will almost always be worse. There's a good balance point to be found here.

Some components in the CPU even scale worse than linearly as they increase in width. The wiring can increase quadratically or even exponentially.

In practice, a lot of the code that you could realistically extract a ton of ILP from is the type of code that's easiest to vectorize or thread (and a lot of vector + thread friendly can run well on GPUs). What remains, outside of some benchmarks anyway, is mostly a lot of code that has fairly limited ILP due to eventually hitting mispredicted branches or from very long dependency chains. Branch mispredictions are particularly bad on a CPU that has a ton of instructions in flight due to being very wide because that much more energy is wasted on failed speculation.
Oxford Guy - Friday, February 12, 2016 - link
So why wasn't Prescott really great (narrow and deep) versus the G5 (very wide and shallow)?
Exophase - Saturday, February 13, 2016 - link
It's like I said, "there's a good balance point to be found here."

Faster clocks need higher voltage which scales super-linearly with power consumption. They require longer pipelines which have worse branch misprediction penalties. They take more cycles to talk to other components that don't scale with CPU clock like RAM. More transistors (more space, power) are thrown at these things to try to compensate, like better branch predictors and more reordering, more aggressive prefetching, etc.

So there's a balancing act between two extremes and what makes the most sense will depend on the manufacturing process, target market and various other things.

G5 was actually not very wide and shallow anyway. It was a 2.7GHz processor in 2003 and was supposed to hit 3GHz. It had a 16-21 stage pipeline with up to > 200 instructions in flight. That's not shallow at all. 4 wide decode with 2x ALU + 2x L/S is not really that wide either.
AlexTi - Friday, February 12, 2016 - link
If algorithm is developed which can split current single-threaded code into "threadlets", which can be run in parallel, why can't it be used in compilers to make multi-threaded code to run on existing architecture? Especially in enviroments which use JIT?
extide - Friday, February 12, 2016 - link
Because a compiler can only schedule instructions to the CPU's front end. This is scheduling of instructions to different ports on the back end of the cpu. The compiler can't tell the CPU what port an instruction goes down, the CPU picks that. THe compiler only gets to pick what instructions are issues, and in what order, and of course, modern CPU's can even change that order if they deem it faster to do so.
Exophase - Friday, February 12, 2016 - link
To have any realistic chance of working the threading speculation/detection has to have a large dynamic component (detecting threadlets as they become desirable at runtime) and has to have architectural support for very lightweight thread splitting, merging, and inter-thread communication.

That can't be provided by compilers targeting existing instruction sets.
AlexTi - Friday, February 12, 2016 - link
Thanks, I think I got the point finally. This looks similar to what instruction scheduler currently does for execution units in conventional CPU. Virtualization layer + CPUs will be a kind of very wide core. Right?
It was already noted in the article, making curent CPU wider is problematic and not universally beneficial. So this new engine should be much more efficient than current implementations.

Good thing is that we'll see eventually :)
Senti - Friday, February 12, 2016 - link
Bullshit. I have no idea how technically incompetent writers should be to reprint that marketing nonsense again and again.

First of all, this brings absolutely no advantages over existing fat core + SMT concept. More IPC per core with more pipelines is not done because it's hard to do without that 'virtual cores' nonsense, but simply because there are not enough actually independent instructions that can be automatically extracted from real code during parts where performance matters.

"Alternatively, if multiple programs or threads want to use the hardware, then a single core is inaccessible to additional threads while the first thread is still in use (though this can be avoided somewhat by simultaneous multithreading or SMT which will let another thread have access when the first has encountered a stall such as waiting for L1/L2 memory)." - total lies. That describes coarse-grained multithreading which is not very popular atm. For example, Intel HT allows usually 2 threads to execute simultaneously dynamically sharing pipelines of the same core all the time. POWER8 uses 8 'virtual' threads per core.

Why no one splits instructions from the same thread over several cores (other than the obvious reason that there are not enough independent instructions to split)? Almost quote from the text: "cross-core communication adds latency and reduces performance".

Instruction set emulation? Far from new concept. Why not popular? Reason is very simple: significant overhead. Try translating something non-trivial like AVX/NEON instructions to some generic internal instruction set.

Finally, the last point: everyone can draw cute performance graphs and huge numbers in marketing presentations, but how about giving actually working chips for independent reviews of performance and power efficiency on real code?
vladx - Sunday, February 14, 2016 - link
Skim the article again, there's a roadmap so let's see how things will go from here.
Exophase - Friday, February 12, 2016 - link
There's another big question with their power measurements. They take differences between idle and 1C and 1C and 2C to cancel out the static contribution of other peripherals. But this still ignores the dynamic contribution.

For example, we can look at Cortex-A72, where ARM claims that one core at 2.5GHz on the TSMC 16FF+ process will consume about 750 mW. In Kirin 950, the power consumption appears to be about 900 mW at 2.3GHz. Is ARM exaggerating or is Huawei's implementation inferior to ARM's expectations? The discrepancy can actually be pretty easily explained by losses in the PMIC/VRMs, the SoC's memory controller, and the DRAM - all components which use more power the more the CPU load increases.

This is especially a factor for wall measurements because they take into effect an additional AC/DC convertor. While it's possible that Soft Machines included these figures in their power estimations I doubt it since they didn't mention it, and like ARM it's more practical and beneficial for them to work with core power estimations only.

So there could easily be another 25+% that the non-VISC platforms are being penalized.

Something else that raises a red flag to me is the 16FF+ test chip. There are only 100 pins. When you take out power, ground, and various control signal are accounted for that leaves a very small interface either to a memory controller or memory (if the controller is integrated). Even a single channel 32-bit interface would be a hard fit. So does this chip really represent both realistic power consumption or realistic performance? I think they're trading one for the other on this one and that makes me question the applicability of the power numbers they've given for it.
Arnulf - Friday, February 12, 2016 - link
Since one cannot buy these "scaled" chips, IMHO it'd make more sense for SMI to publish performance per watt figures of real hardware and let the market decide whether their concept is attractive enough. Yes, Intel may have process node advantage, yes, different CPUs are targetting different performance and power profiles but at least it's a straight comparison and if VISC doesn't beat its entire competition at at least one metric then it's destined to fail anyway.

Oh and the remark in the article regarding "VISC advantage" because of it using twice the number of cores while running a single thread in tests - who cares as long as it comes out on top in performance per watt? If they can beat other CPUs by using more cores, kudos to them!
ppi - Saturday, February 13, 2016 - link
Regarding core count, I would direct you to recent AT article on Android usage of multiple cores. Simplified conclusion may be, that Android tends to utilise 4 cores pretty well.

In real world, this significantly reduces impact of distributing single thread over multiple cores.
kgardas - Friday, February 12, 2016 - link
Interesting stuff, but to be honest, combining "simpler" cores into more complex is also done by software on SPARC64. At least Fujitsu mentions this on some of their hotchip presentation for SPARC64 VII. So you have 4 cores CPU with 4-wide core and you can combine this by software (compiler) into 8-wide or more depending on your needs for instruction parallelism.
Another thing is that something like that is IIRC also supported by POWER8 where you do have a lot of duplicated resources, but not enough so in case 1 thread is able to consume all core resources you may switch-off 7 others. IIRC IBM's compilers contains some optimizations for this too.
Pity think you have mentioned Itanium only in this negative way. Honestly speaking Itanium design was really great and really pity that Intel stopped developing it and not provided any OoO designs on this architecture. If Denver will be successful we will see, but NVidia still counts with it for some designs which may be interesting in a light that they are using ARM's core (A57) for some time now and don't need Denver that much. Also automotive does not care if Denver is there or not yet nVidia pushes it there so I would bet they needs to have really good reason for it. Perhaps their VLIV is good for some special tasks...
So to me whole this looks like they are on another round for money.
Oxford Guy - Friday, February 12, 2016 - link
"Honestly speaking Itanium design was really great and really pity that Intel stopped developing it"

The market disagreed so, if you're right, it's a pity the market dictates product success to such a degree.
kgardas - Saturday, February 13, 2016 - link
Market is not dominated by excellent technology, but by average or mediocre in fact. "Good enough" is enemy of any excellence.
Also in comparison with AMD64 which is just pile of hacks to prolong x86 architecture life, IA-64 was clean design on green field and its really a pity Intel can't push that further -- also due to AMD64 existence in the market.
Alexvrb - Monday, February 15, 2016 - link
That kind of thinking is what lead to the Itanic, the unsinkable chip. Too bad they ran afoul of a giant costberg.

But no, you're right, everyone should switch to a new ISA because Intel says it's better, even though it means switching to an Intel-exclusive ISA that will cost you dearly now, and even more dearly later when you become dependent on a product that only Intel can produce.

If Intel's only desire was a better design for everyone, they would have worked with AMD and freely extended licensing agreements to IA-64 to them so they could both produce IA-64 chips. The outcome could have very well been different in that scenario. But that is not what they did, and they paid for it. Of course, Intel is such a giant that they can afford to take such failures in stride. AMD can not afford another flop - Zen, Polaris, and eventually a ZenPolaris APU have to achieve at least a significant degree of success.
FunBunny2 - Monday, February 15, 2016 - link
-- Of course, Intel is such a giant that they can afford to take such failures in stride.

the irony, of course, is that Intel got where it is just because, in ~1980, Intel had one foot in chapter 7 and one foot on a banana peel. IOW, an easy controlled peon for IBM to abuse, thus the 8088 came to be. if IBM had secured the BIOS, life for both would have turned out rather differently, I suspect.
Alexvrb - Monday, February 15, 2016 - link
Yeah that is ironic. Intel was trying to avoid making a similar mistake, and in doing so they screwed up - but the SIZE of the failure was tiny in comparison. One thing about Intel is that they have better foresight and planning than IBM. IBM always was caught up in their own world. Intel probably was working on backup plans for Itanium failure before it even launched, regardless of how high they thought its chances were.
diediealldie - Friday, February 12, 2016 - link
Thanks for great article.

Anyway, I'd wait for actual working silicon with high frequency(Not .5Ghz) to figure out if it's real or not.

Since they're making abstraction layer with real silicon, demonstrating it on slow chip will not enough to convince industry experts(Hardware will be complicated so making it work on very high frequency is also big challenge).
High freqency chips requires more pipeline thus latencies and cache efficiency gets worse, hardware blocks not working...etc. so high frequeny chip is what Shasta really have to demonstrate.
dcbronco - Friday, February 12, 2016 - link
Interesting that AMD are switching to SMT with Zen and are one of the big financers of Soft Machines and VISC works well with SMT. I also wonder if an OS written for VISC would give a boost to APUs or would the bottlenecks kill any advantage.
zodiacfml - Saturday, February 13, 2016 - link
Interesting as numerous patents created can be beneficial to other CPU makers but for them creating a compelling chip that could sell would be miraculous.
haplo602 - Saturday, February 13, 2016 - link
Lots of marketing and even more estimates and projections. Looks like a very long road ahead to actual working chip with peripherals. And a lot of obstacles still to clear.

While the idea is interesting, it is highly impractical. The higher frequency they'll go, the more the final compositing overhead will bite them. There's a cost traversing the layers of the design from thread to CPU and back with results. I do not see that explained anywhere.

Another point is, I think this is not a new idea. It is fairly obvious an extension so expect Intel already went that route and met a dead-end.
vladx - Sunday, February 14, 2016 - link
Intel are too conservative to come with such ideas, they'd rather milk the cow as long as possible.
hMunster - Sunday, February 14, 2016 - link
There's one important question I don't see addressed, how do you run at higher IPCs when you have a conditional branch every few instructions? A 16-wide virtual CPU using 4 physical cores is all nice and dandy, but 16 instructions will, in normal x85 code, contain at least 2 branches. I can't see them doing a lot of speculative execution because that drives up the power consumption. So how do they solve this?
KAlmquist - Sunday, February 14, 2016 - link
If I understand the article correctly, the difference between VISC and SMT is that in SMT there is a single scheduler which manages all of the execution units. VISC implements a two stage scheduling algorithm. In the first stage, an operation is assigned to a core. In the second stage, the scheduler for that core assigns the operation to an execution unit.

The downside of SMT is that the amount of silicon required to implement the scheduler grows faster than the number of execution units. So as you add more threads and more execution units, it becomes harder and harder to keep the cost of the scheduler to a reasonable level.

In the second stage of VISC, you have multiple schedulers, each feeding a small number of execution units, which keeps these schedulers simple. In the first stage, the schedulers require at least some awareness of all the execution units. For example, if you have an integer multiply instruction, you want to send it to a core that doesn't have other integer multiply operations outstanding rather than just chosing the core with the smallest total number of outstanding operations. What may keep the first stage scheduling reasonably simple is that it doesn't appear to do any instruction reordering (though it does have to do the bookwork to keep track of which instructions have been retired).

In short, VISC appears to be intended to scale better than SMT as you add more threads and execution units.

What is strange, then, is that Soft Machines isn't talking about building an 8 thread device like IBM's POWER8. Instead, they have a two and four thread designs, and are mostly talking about the former. A two thread VISC design makes sense only if you believe that the SMT approach is already hitting its limits with two threads.

My sense is that VISC is not going to be a game changer, but Soft Machines could be successful if ARM Holdings screws up. If ARM has has a major screw up technologically (like AMD did with Bulldozer), Soft Machines could end up with a superior product. Conversely, if ARM screws up on customer relations, all Soft Machines would need is something close to technological parity with ARM to win customers.
Shadowmaster625 - Monday, February 15, 2016 - link
When Intel purchased Altera I immediately began to visualize all sorts of great potential breakthroughs in single threaded IPC. I imagine that within 5 years, we will have at least a modest number of FPGA cells integrated within Intel CPU cores. These cells will be programmed on-the-fly with application specific DSPs that will be capable of completing commonly used combinations of instructions MUCH faster than the general x86 instruction set would allow. I expect this to be the singularly largest breakthrough in computing of the last 20 years. Within 10 years, I expect the CPU itself to create its own DSP code on the fly as it profiles its own instruction loading in real time. The potential here is utterly massive. Think about what ASICs have done for bitcoin mining... Soon they will be able to do that for javascript!
FunBunny2 - Monday, February 15, 2016 - link
-- capable of completing commonly used combinations of instructions MUCH faster than the general x86 instruction set would allow. I expect this to be the singularly largest breakthrough in computing of the last 20 years.

that's what the real cpu/RISC core/micro-architecture has done for decades. twerked continually.

-- I imagine that within 5 years, we will have at least a modest number of FPGA cells integrated within Intel CPU cores.

done: http://www.extremetech.com/extreme/184828-intel-un...
"This new Xeon+FPGA chip will fit in the standard E5 LGA2011 socket, but the integrated FPGA will allow each chip to be customized to specific workloads."
Shadowmaster625 - Monday, February 15, 2016 - link
That's not what I mean. That is of course a good start, but what I'm talking about is programmable logic linked tightly to the actual execution units of the CPU core. Smaller blocks, probably only a square millimeter or perhaps even less. But many of them. Just like Skylake has 6 execution units. One of these programmable blocks would be only about the same size as one of those existing execution units. They would have direct access to the prefetcher and scheduler and instruction/data caches. They would be power gated.
dustwalker13 - Saturday, February 20, 2016 - link
yes it looks good on paper ... but up to now that is all that it does.

silicon existing at HQ is so much smoke and mirrors until some independant source has an actual go at it and publishes results.

it looks promising, but so did a million other things that ended up as just another failiur or worse scam.

i will keep an eye on this one but for now there simply is nothing to see than mirror images produced by a lot of hot air.
mikato - Saturday, February 20, 2016 - link
So why did they come out of stealth mode?
TruePath - Saturday, April 16, 2016 - link
I've been curious for a long time why more wasn't done to use parallel resouces to extract instruction level parrelism.

However, what puzzles me is why do so much of the work on the fly at run time. Sure, one needs to be able to respond to dynamic performance information like failed speculation but it seems like there is substantial overhead in translating the host ISA into native instructions and (I assume) encoding information into the native instructions about resource needs and dependencies.

Even before a program is run knowledge of the exact processor would enable software to translate the ISA (targeting the exact chip), hint at resources needs and perform a degree of instruction reordering (over a larger window than in hardware).

So why not push as much of this into the software as possible. One can even cache the results of software ISA translation. Is it just a desire to be totally hardware compatible?

Examining Soft Machines' Architecture: An Element of VISC to Improving IPC

Post Your Comment

97 Comments

Back to Article

tipoo - Friday, February 12, 2016 - link

willis936 - Friday, February 12, 2016 - link

easp - Friday, February 12, 2016 - link

Xenonite - Wednesday, February 17, 2016 - link

Xenonite - Wednesday, February 17, 2016 - link

Demiurge - Friday, February 12, 2016 - link

sonicmerlin - Saturday, February 13, 2016 - link

name99 - Saturday, February 13, 2016 - link

Samus - Monday, February 15, 2016 - link

Azethoth - Wednesday, February 17, 2016 - link

Bleakwise - Tuesday, March 14, 2017 - link

dcbronco - Friday, February 12, 2016 - link

Bleakwise - Tuesday, March 14, 2017 - link

bcronce - Saturday, February 13, 2016 - link

Bleakwise - Tuesday, March 14, 2017 - link

Bleakwise - Tuesday, March 14, 2017 - link

tipoo - Friday, February 12, 2016 - link

Flunk - Friday, February 12, 2016 - link

smilingcrow - Friday, February 12, 2016 - link

andychow - Friday, February 12, 2016 - link

ddriver - Saturday, February 13, 2016 - link

erple2 - Saturday, March 12, 2016 - link

FunBunny2 - Friday, February 12, 2016 - link

willis936 - Friday, February 12, 2016 - link

FunBunny2 - Friday, February 12, 2016 - link

Arnulf - Friday, February 12, 2016 - link

Samus - Monday, February 15, 2016 - link

gamerk2 - Friday, February 12, 2016 - link

xthetenth - Friday, February 12, 2016 - link

Drumsticks - Friday, February 12, 2016 - link

xdrol - Saturday, February 13, 2016 - link

Alexvrb - Saturday, February 13, 2016 - link

FunBunny2 - Saturday, February 13, 2016 - link

Alexvrb - Sunday, February 14, 2016 - link

Samus - Monday, February 15, 2016 - link

Alexvrb - Monday, February 15, 2016 - link

easp - Friday, February 12, 2016 - link

Sufiyan - Saturday, February 13, 2016 - link

Bleakwise - Tuesday, March 14, 2017 - link

Bleakwise - Tuesday, March 14, 2017 - link

Bleakwise - Tuesday, March 14, 2017 - link

Bleakwise - Tuesday, March 14, 2017 - link

jjj - Friday, February 12, 2016 - link

gamerk2 - Friday, February 12, 2016 - link

jjj - Friday, February 12, 2016 - link

fiodhkf - Friday, February 12, 2016 - link

extide - Friday, February 12, 2016 - link

fiodhkf - Friday, February 12, 2016 - link

Exophase - Friday, February 12, 2016 - link

fiodhkf - Friday, February 12, 2016 - link

vladx - Sunday, February 14, 2016 - link

vladx - Sunday, February 14, 2016 - link

valinor89 - Saturday, February 13, 2016 - link

xthetenth - Friday, February 12, 2016 - link

Andrei Frumusanu - Friday, February 12, 2016 - link

extide - Friday, February 12, 2016 - link

name99 - Saturday, February 13, 2016 - link

gamerk2 - Friday, February 12, 2016 - link

vladx - Sunday, February 14, 2016 - link

Avendit - Friday, February 12, 2016 - link

xdrol - Saturday, February 13, 2016 - link

Jtaylor1986 - Friday, February 12, 2016 - link

jjj - Friday, February 12, 2016 - link

matt321 - Friday, February 12, 2016 - link

extide - Friday, February 12, 2016 - link

extide - Friday, February 12, 2016 - link

Exophase - Friday, February 12, 2016 - link

Oxford Guy - Friday, February 12, 2016 - link

Exophase - Saturday, February 13, 2016 - link

AlexTi - Friday, February 12, 2016 - link

extide - Friday, February 12, 2016 - link

Exophase - Friday, February 12, 2016 - link

AlexTi - Friday, February 12, 2016 - link

Senti - Friday, February 12, 2016 - link

vladx - Sunday, February 14, 2016 - link

Exophase - Friday, February 12, 2016 - link