Comments Locked

39 Comments

Back to Article

  • brucethemoose - Thursday, September 15, 2022 - link

    "This change makes the SIMD pipeline width identical to Arm’s Cortex parts (which are all 128-bit, the minimum size for SVE2), but it does mean that Arm is no longer taking full advantage of the scalable part of SVE by using larger SIMDs. I expect we’ll find out why Arm is taking this route once they do a full V2 deep dive, as I’m curious whether this is purely an efficiency play or something more akin to homogenizing designs across the Arm ecosystem"

    I propose an even simpler reason: faster NEON performance, which is what basically all existing hand coded ARM intrinsics use now.

    And other customers are probably using binaries without SVE2, or compilers that can't emit SVE2 instructions from autovectorized code yet.
  • mode_13h - Saturday, September 17, 2022 - link

    I think there's wisdom in your point about the amount of Neon code in the wild. But it really makes you think. If narrow SVE isn't such a disadvantage, then maybe the whole SVE concept is overblown. Going with 4x 128-bit suggests the CPU is good enough at speculative execution that it can blast through SVE loops and dispatch instructions fast enough to keep the pipelines full. Otherwise, I doubt they'd have sacrificed width in what's meant to be their most HPC-oriented core.

    There's no way they would accept a regression in floating point IPC, going from V1 -> V2. So, maybe SVE > 128-bit is dead in any incarnation other than proprietary HPC cores that really prioritize FLOPS/W.
  • brucethemoose - Saturday, September 17, 2022 - link

    I bet the 4x 128 bit units are less area efficient than 2x 256, at the very least. But yeah, its probably not a huge regression if they did it.

    However, the whole point of SVE2 is long term standardization. There will come a point where devs on some platforms can basically assume an ARM CPU is an SVE2 capable one, as is happening with AVX/AVX2 now. And one doesn't want to run 8+ 128-bit units on those future CPUs.

    TBH there seem to be some much better HPC designs than Neoverse in the pipe, like the SiPearl Rhea or whatever Fujitsu comes up with.
  • mode_13h - Monday, September 19, 2022 - link

    One thing about SVE is that the *code* doesn't have to scale, even if the hardware can. As we saw in a recent glibc memcpy optimization, which only used up to 256-bit vectors. Running such code on a 512-bit hardware implementation should give no advantage over 256-bit.

    BTW, one thing I find a little bit shocking is how vendor-specific some of the glibc optimizations are. It's not just ARMv8.4 or whatever, but even going so far as to look at the model ID of the specific CPU.
  • Wilco1 - Monday, September 19, 2022 - link

    It still scales: it works correctly at different vector sizes without needing to rewrite your code or write a new compiler backend for yet another vector extension. There is no guarantee that wider vectors give a speedup since that is highly dependent on the microarchitecture. A64FX and Neoverse V2 are wildly different design points - wide vectors make sense for a low-clocked supercomputer but a wide OoO with narrow vectors wins all general purpose code.

    And yes one needs to test for specific cpus and features in string functions since not all microarchitectures have a fast SIMD unit unfortunately (now that's the part that should be shocking...).
  • Dante Verizon - Thursday, September 15, 2022 - link

    They can't even beat Apple and now want to face AMD... What a joke.
  • Lakados - Thursday, September 15, 2022 - link

    You seem to be under the impression that the N1 chips didn’t already stand up against all the Intel and AMD server offerings at the time of its release.

    For the right workloads the N1 CPUs are absolute beasts. The N2s look promising especially if it’s true we can expect up to a 40% increase in performance.
  • smilingcrow - Thursday, September 15, 2022 - link

    Not a particularly good joke as Apple don't even compete in the server market.
    Even their workstation offerings are still half baked.
  • Samus - Friday, September 16, 2022 - link

    Seriously, talk about comparing Apples to Oranges. Apple left the server industry ages ago, which is too bad because OSX Server was a potent OS. The hardware was total shit for any application larger than a very small business. It's almost as if they intended OSX Server for residential use.

    Now, though, the Apple M CPU could be repurposed in the server space but I doubt it or the underlying architecture would be appropriate for data centers. The GPU is proprietary and too weak, the cache hierarchy favors singular multithreaded desktop workloads, not wide parallelism, and it is lacking basic pipeline and die-level security features, specifically microsegmentation.

    Most obviously, though, Apple hasn't yet demo'd a scalable implementation. They just keep blowing the chip up (like Pro, Max) and gluing them together (Ultra.) I doubt they will ever use the M CPU's in their own data centers let alone making them for consumers.

    Obviously Apple has the talent to make a data center CPU, but they currently have not done that and as a company obsessed with margins, I doubt they will unless there is a cost benefit.
  • Threska - Friday, September 16, 2022 - link

    I don't think anywhere in their history they would have the interest or expertise for entering the server space. The prosumer server would be a very niche market.
  • mode_13h - Saturday, September 17, 2022 - link

    A multi-$Trillion company is big enough to enter any market it really wants. Their revenues dwarf even Intel's. There's virtually nothing they couldn't get into, if they thought it were critical to their core business strategy.
  • mode_13h - Saturday, September 17, 2022 - link

    I think there's a compelling reason for them to care about cloud, and that's for easy portability between phone <-> laptop/desktop <-> cloud. It'd be a big win if they could support a "compile once, run anywhere" model, even if "anywhere" were still restricted to just Apple platforms.

    Also, cloud computing is a massive market and Apple needs to keep growing. At some point, they can't afford to ignore a market that big.
  • MintBoy - Sunday, September 18, 2022 - link

    They will, once the computing horizon gets to the point where connectivity is fast enough end gets to the point where we pivot back to thin-clients, which is inevitable IMHO. There'll always be a niche for powerful end-user wielded hardware, but Apple won't turn down the opportunity to 'lease' computing power to the general public for a monthly recurring fee.
  • PeachNCream - Thursday, September 15, 2022 - link

    As others have already implied, ARM CPUs are competitive - thus the design has landed in a few noteworthy products. Not only do they compete in performance metrics, they do well in density and perf/watt metrics.

    Unlike Joe Average PC User such as you, for-profit companies tend not to build brand-loyalties that limit their options as they aim to maximize returns on their investments. If they do things like that, they risk getting left behind by more nimble, flexible competitors.
  • DougMcC - Thursday, September 15, 2022 - link

    No one ever got fired for buying <strikethrough>IBM</strikethrough> <strikethrough>Dell</strikethrough> <strikethrough>Lenovo</strikethrough> Apple. Irrational brand loyalty among for-profit companies is very much a thing.
  • mode_13h - Saturday, September 17, 2022 - link

    Don't forget Microsoft!

    > Irrational brand loyalty among for-profit companies is very much a thing.

    True, but it's powered by a different dynamic than consumers, as you imply. Among business customers, risk-aversion tends to be a powerful force.

    Also, if you blaze a new path by going with a new market entrant, instead of the conventional option, it's a lot more work to justify your decision. So, another big factor is simply laziness.
  • ballsystemlord - Thursday, September 15, 2022 - link

    In comparing them to apple, you first have to realize that (until the M1 was released?) all benchmarks were limited to what apple allowed you to install on their iphones.
    As in, if Nvidia could choose what benchmarks got run on AMD GPUs, would you trust the results?
  • The Hardcard - Thursday, September 15, 2022 - link

    That is inaccurate. If you have a developer account, you can run whatever you want to put in the work of compiling. The SPEC suite that Andrei ran is not in the App Store.
  • ballsystemlord - Thursday, September 15, 2022 - link

    Actually, I did not know that.

    But then as was also pointed out in the reviews, and I should have mentioned it before, the code can (and sometimes will), take different paths depending on the device and it's capabilities. For example, the settings games use are programmed by the developers and adjusted on-the-fly in at least some cases, you can't set a baseline for benchmarking via the options menu.
  • ballsystemlord - Thursday, September 15, 2022 - link

    But more importantly, although they loose massively in many things, they do have very competitive performance in others. (See ARM server CPU reviews on this site to see what I mean.)

    Now however mortifying their performance is in those benchmarks that they loose in, and badly, the big companies purchase thousands of units at a time. So, if they need to run one specific workload, ARM could be the core to buy for that workload.

    Sort of like having an ASIC accelerator instead of a CPU.
  • name99 - Thursday, September 15, 2022 - link

    Someone doesn't understand the difference between single-threaded and throughput computing, or what cloud providers want from their chips...
  • michael2k - Thursday, September 15, 2022 - link

    They can’t do one without also doing the other you realize?
  • lemurbutton - Thursday, September 15, 2022 - link

    That's funny because Apple is significantly ahead of AMD. Miles ahead. Probably 3-4 generations of improvements ahead.
  • Kangal - Friday, September 16, 2022 - link

    That's interesting way to put it. The best junction point to compare these chipsets is in the 10W range. So things like large tablets, thin notebooks, small laptops, TV Boxes and Mini PCs (all passive cooled). Here's what that looks like:

    SiFive FU740
    RockChip RK3588
    MediaTek K-1380
    Qualcomm 8CXg3
    Apple M1
    AMD r7-6800u
    Intel i7-1265u
  • mode_13h - Saturday, September 17, 2022 - link

    That list spans a massive range of price points and application targets. While a comparison would be interesting, there's a limited amount it could tell you, due to some using rather older IP and process nodes than others. Not coincidentally, those also tend to be the cheaper ones.
  • mode_13h - Saturday, September 17, 2022 - link

    I started confirming a few things for myself, and thought I'd share.

    * SiFive U740 - 4x 2-way in-order core @ 28 nm
    * RockChip RK3588 - 4x Cortex-A76 @ 8 nm LP
    * MediaTek K-1380 - 4x A78 + 4x A55 @ 6 nm
    * Qualcomm 8CX Gen3 - 4x X1 + 4x A78 @ 5 nm
    * Apple M1 - 4x Firestorm + 4x Ice Storm @ 5 nm
    * AMD R7 6800U - 8x Zen3+ @ 6 nm
    * Intel i7-1265U - 2x Golden Cove + 8x Gracemont @ Intel 7

    SiFive and Rockchip don't even belong in that list. Also, the tray price of the Intel CPU is probably 2-3x that of the Kompanio 1380, so it's rather out-of-place as well.

    That narrows it down to the usual suspects: Qualcomm, Intel, AMD, and Apple. However, since the X1 is basically just a beefed up A78, I'd argue even the 8CX Gen 3 is out of place.
  • Kangal - Sunday, September 18, 2022 - link

    That's funny, because the list actually started with Intel, AMD, and Apple. Then I remembered the Qualcomm and thought to look at alternatives.

    The most out of place option on that list is the RockChip, because every other chipset is the best of their category in some way. The SiFive is the best of RISC-V that is built and not theoretical. The MediaTek is the best Cortex-A / ARMv8 you can buy. The Qualcomm is the best Cortex-X / ARMv9 you can buy. Apple is its own thing. Intel has the fastest single-core performance. ARM has the best overall x86 performance.

    So really it kind of boils down to Apple vs AMD, or rather can AMD catch up in the next 2-years? But that's a meaningless question since they don't run the same software or same code. But can give us a hint of what's possible out there. Maybe a proper optimised Windows10 Pro running on an ARMv9 Qualcomm chipset with Nuvia cores? Then compare web browsing, regular computing, rendering, and gaming between Native Programs to an AMD Ryzen ultrabook. I suspect the Qualcomm will eventually overtake the performance point AND do so at lower power. Just as long as you avoid the legacy stuff, or unoptimised/rushed ports from x86-Windows to UWP-Windows.
  • mode_13h - Monday, September 19, 2022 - link

    > The MediaTek is the best Cortex-A / ARMv8 you can buy.
    > The Qualcomm is the best Cortex-X / ARMv9 you can buy.

    That's an artificial distinction between A-series and X-series.

    Also, 8CX Gen3 is still ARMv8, as hinted by the fact that it includes A78 cores. It's the X2 cores which are ARMv9.

    > can AMD catch up in the next 2-years?

    In what sense? Perf/W is the only area where Apple is significantly ahead. It's an important area, but Zen 4 seems to trounce M2 in single-thread performance.

    > I suspect the Qualcomm will eventually overtake the performance point
    > AND do so at lower power.

    I doubt Qualcomm will ever beat Apple on performance, and that's mainly because Apple's vertical integration lets them use larger dies with more cache. Qualcomm has to worry about the BOM price of their chips and how it compares to their rivals, while Apple only has to worry about the final product price.
  • Kangal - Wednesday, September 21, 2022 - link

    Qualcomm doesn't have to worry about the BOM. They have an intrinsic advantage to x86 chipsets that are usually larger. However, AMD has an advantage with the chiplet design, but they too have the same market/BOM to consider so it's a wash. Intel meanwhile is the market leader and they set the tray price really high, and they have been bleeding money from their fabrication process. Not to mention both x86 companies spend a lot more on R&D, whilst a generic ARM Licence is cheap. All in all, QC can afford to blow the budget and still undercut the competitors. And that's how it should be, those laptops should be priced cheaper due to lacking a critical/useful feature which is backwards compatibility.

    No, I didn't mean Apple.

    I meant now on Windows it should be pretty competitive already between the QC 8CXg3 (4x Big + 4x Medium), versus Intel i7-1265u (2x HUGE + 8x Medium), versus AMD r7-6800u (8x Big). When talking about a passively cooled, thin laptop, at the 10W power level.

    I managed to find the GeekBench 5.4 results, which are interesting:
    QC: 1100, 5000
    AMD: 1500, 9000
    Intel: 1700, 6000

    So that's in the current "Windows 10" era, with semi-optimised code, and a subpar design from Qualcomm who have been a joke. They had an exclusivity contract with Microsoft, and have been dragging their feet. Since it has recently elapsed, they're only beginning to start competing now.

    In the near future with "Windows 12" we should see Applications become more evenly optimised between the architectures. That's when ARM will probably flex its advantages. And we might finally get to see those Apple A13 Cores, I mean Nuvia Cores, running on the platform. They've been announced like for 4-years or something, and been perpetually delayed with a new redesign to fit into ARMv9 ISA. I feel like with the delayed Gen-2 ARMv9 cores from the European Team (Cortex-A730), we should see big improvements. With its derivative (Cortex-X4), there might not be much or any advantage to the Custom Nuvia cores anymore.
  • smalM - Thursday, September 15, 2022 - link

    "Past that, it’s likely worth noting that while Arm’s presentation slides put bfloat16 and int8 matmul down as features, these are not new features."
    They may not be new, but they are also not part of ARMv9.0, so they got mentioned separately.
  • mode_13h - Saturday, September 17, 2022 - link

    What other core(s) have them?
  • smalM - Monday, September 19, 2022 - link

    Neoverse V1
  • smalM - Monday, September 19, 2022 - link

    All ARMv9 cores may have them too, but I'm not sure.
  • Silver5urfer - Thursday, September 15, 2022 - link

    Intel is toast until they get the SPR XEON out and it's delayed shame on Intel because they could have crushed ARM processors now AMD has to take over the x86 HPC market alone and tackle them. However once AMD releases Genoa with it's unprecedented scalable performance this ARM will again relegate itself to small subset.

    As for ARM vs x86 for clients, and many users x86 dominates it to the point of no return, simply because of Software and Hardware choices available plus a lot of forums as well for knowing things around. For eg any normal person can buy a decommissioned XEON and build their own Homelab while ARM cannot. And x86 PC destroys the locked down horrible garbage on mobile since the mobile ARM processors have battery and limited lifetime plus they get too old very very fast. An i7 2600K still runs, a very old Intel Core 2 Quad Q6600 can even run games if you patch the exe with the SSE4 mandatory requirements etc. So that's the beauty of x86 it's a real computer while ARM phones ? Look back and see how they fare, also the latest and greatest is a big joke, esp the fact that OSses both Android and iOS are crippled to death. Android used to be solid open but Google started to kill it inside out with PlayStore API mandates, Blacklisting, Scoped Storage and other garbage UI changes. ARM ends up in a landfill while a socketed x86 component can run latest OS without any BS.
  • Kangal - Thursday, September 15, 2022 - link

    Prison Mike:
    The worst thing about prison was the... ARM Demeters.
  • vinay001 - Friday, September 16, 2022 - link

    Well, many comments are comparing ARM with x86 in a generic way.

    But as of today, most of the ARM servers are being targeted for specific loads. In targeted scenarios Arm is indeed beating x86. Also, it has to be considered that electicty costs for servers is way more that hardware costs This is where Arm severs purpose for its major customers.

    There are almost all hyperscalers now providing Arm instances. Be it Amazon, Mocrosoft, Google, Oracle or major Chinese ones.

    Comparing Arm for gaming is again incorrect as its not the purpose that these Arm cores solve.

    Intel and AMD both know this and they have low power cores in pipeline that will cater to such scenarios. But all hyperscalers seem to be developing in-house solutions also. Amazon is targeting 40-50% Graviton instances by 2025-26.

    So Arm, this time, looks to have some traction.
  • mode_13h - Saturday, September 17, 2022 - link

    > Comparing Arm for gaming is again incorrect as its not the purpose that these Arm cores solve.

    I know you mean gaming desktops/laptops, but plenty of gaming happens on phones. And that means most major game engines probably have well-optimized ARM backends. So, it seems like it shouldn't be a big leap for ARM to work its way into gaming-oriented chromebooks and eventually even min-desktops.
  • TeXWiller - Monday, September 19, 2022 - link

    "This change makes the SIMD pipeline width identical to Arm’s Cortex parts (which are all 128-bit, the minimum size for SVE2), but it does mean that Arm is no longer taking full advantage of the scalable part of SVE by using larger SIMDs."
    I see this as taking full advantage of the scalable part of the SVE architecture. These units should be able to churn the 2048 bit vectors like the wider implementations, just a little slower.
  • Wilco1 - Monday, September 19, 2022 - link

    It's scalable indeed, it just means your loops process less data per iteration. It does not need to be slower since an OoO core will execute multiple loop iterations in parallel (4x128 = 2x256 = 1x512).

Log in

Don't have an account? Sign up now