> ... for the issues to which AMD is vulnerable, it has implemented a full hardware-based security platform for them. The change here comes for the Speculative Store Bypass, known as Spectre v4, which AMD now has additional hardware to work in conjunction with the OS or virtual memory managers such as hypervisors in order to control. AMD doesn’t expect any performance change from these updates.
Which Intel CPU generation will have hardware fixes for these Spectre variants?
It will take years: https://www.wired.com/story/intel-meltdown-spectre-storm/.
I guess it will score it a lot of datacentre business. Not so few big "cloud" providers still run completely bare with regards to microcode patches, and some very likely do so intentionally.
Is it still the case that an AMD core is less powerful than an Intel core for the same frequency? I understand that AMD is making it up with more cores but in a cloud you get charged per core. Can a cloud substitute an Intel for an AMD cpu?
That hasn't been true since the release of Ryzen in 2017. The reason why AMD is lagging behind in single core performance is because Intel CPUs can be clocked higher. Often to 5GHz. Whereas AMD usually only boosts to somewhere around 4.4GHz. Gamers care about a 12% difference. Servers usually don't even go beyond 3GHz.
IPC varies according to the specific task but Ryzen 1xxx and 2xxx have always had IPC on average comparable to Broadwell CPUs (excluding AVX workloads). So intel has had a slight lead there from Skylake onwards.
According to what we're seeing, the situation seems to be reversed with the 3xxx series, where AMD seems to have a small but significant lead; we'll have to wait for independent benchmarks.
Regardless who comes out ahead on the benchmarks, competition is always a good thing!
It kinda matters because AMD hasn't been able to outperform Intel at single threaded tasks for like 5+ years.
They aren't competition if they fall of the map. So a slightly edge would be great because it at least puts them back in the game.
That's not fair comparison. Servers are lower speeds because they have far more cores. AMD's desktop processors have more cores than Intel, so you'd expect the clock to be lower.
Servers are lower speed because server owners at a pretty trivial scale generally care more about perf/watt than the number of servers they have, and the slower frequencies hit that sweet spot. There are some places where they care about individual node perf, and you'll see lot's of cores _and_ high clock speed there. Borg scheduler nodes come to mind there.
No, servers are lower speed because they have more cores and larger caches. Both of those take up more die space, which makes routing higher-speed clocks harder/impossible.
This is easy to prove. The highest clock rate xeons you'll find are a special SKU exclusive to aws. Sure enough, they have far fewer cores than instances with lower clocks.
The cores have independent clock trees and PLLs. Half the point of going multi core instead of giant single core in the first place is so that you don't have to route clock lines all over the place.
What you're seeing isn't routing issues, but the fact that their newer process isn't up to snuff, and they don't have the proper yields on larger die sizes.
Like, I've shipped RTL and know pretty well how this stuff works.
All evidence hints to opposite. IPC gain from all improvements in Zen2 means AMD should be equal or faster at same frequency
from the (possibly cherrypicked) benchmarks in the original announcement , it looks like single thread performance is on par or slightly better even with a clock disadvantage.
Not sure how much overall effect it will have, but Windows also recently released an update to perform better with Zen, and these benchmarks don't include that update. I'm not sure what the story was with CPU vulnerability mitigations were, but if those were turned off then Zen 2 could handily be beating Intel.
It certainly looks promising, but I'll still hold my excitement until we get some 3rd party benchmarks.
I believe the update is to optimize where threads are scheduled across CCXs. shouldn't affect a single threaded benchmark.
I'm also skeptical of first party benchmarks, but I'm already pretty excited that I finally might be able to justify an upgrade from my old haswell setup.
IIRC, the benchmarks also don't include the spectre/meltdown patches for Intel. AMD, apparently, worked very hard to give a worst-case comparison for the most part.
> in a cloud you get charged per core. Can a cloud substitute an Intel for an AMD cpu?
In a cloud you typically pay for cores from a specific CPU type. Presumably any clouds at offer AMD cpus will price them in a competitive manner.
I thought they had a “cpu x or equivalent” kind of language, like rental cars.
Yeah, it depends on what they deem to be equivalent. AWS already offer AMD cores, and they're 10% cheaper compared to Intel cores - https://aws.amazon.com/ec2/amd/
AWS mostly did away with that years ago; AFAIK first-tier cloud providers all promise a specific CPU model. And yes, and AMD vCPU is cheaper than an Intel vCPU.
It used to be the case, but it looks like that gap is almost nothing with zen 2. Clouds also often have different pricing for AMD cores.
What do you mean by "run completely bare with regards to microcode patches", that they don't apply microcode patches & errata?
Yes, that they did not bring up microcode patches that cover some of Spectre/Meltdown family bugs
Do you have a source for that?
GCP was "fully fixed before it was known" according to the engineers I know there. I find it /highly/ unlikely that they don't have patched microcode.
I mean, the cloud business is the place with the most to lose from these kinds of issues, I am incredibly suspicious of the claim that cloud providers aren't patching their microcode.
Whether it's intels or their own modified variant of microcode I would fully expect them to be patched in some way.
Google's researchers played a big part in discovering / classifying / mitigating the vulnerabilities. They also developed the retpoline pattern. It is very likely that GCP was "fixed before it was known."
Indeed, this is why it's unlikely that hey have the patched microcode.
How do you figure that?
They have much to lose from not applying these mitigations, especially if they're the people spending a fortune to find them.
If the grand-parents claim holds, they wouldn't have needed it because they already implemented a workaround themselves.
I honestly doubt that claim however and haven't heard it before this thread.
$499 for 12C/24T on the 3900X? $399 for 8C/16T on the 3800X? Those prices make this a very tempting chipset on top of the promised performance being touted by AMD. Looking forward to seeing the reviews and news on X570 boards. Hopefully the manufacturers will be plugging in decent features to match their threats of massive price hikes on AM4 boards with the new chipset.
For a single GPU machine, I’m not sure what the use case is for X570 over a much cheaper B450 board. Most games currently aren’t GPU-bus bandwidth limited (or rather the GPU itself is the bottleneck) so I suspect PCIe 4.0 won’t impact benchmarks much.
On all of these the GPU is directly connected to the CPU and there is no chipset in the way.
The benefits of X570 over B450 therefore have nothing to do with GPU performance but instead would be either overclocking capability or, more significantly, I/O to everything else.
B450 only provides 6x PCI-E 2.0 lanes and 2 USB 3.0 gen 2. That's not a lot of expansion capability, especially with nvme drives. Want 10gbe? Or a second nvme drive? Good luck.
X570 gets to leverage double the bandwidth to the CPU in addition to being more capable internally. So you'll see more boards with more M.2 nvme slots as a result, for example. And thunderbolt 3 support. Check out some of the x570 boards shown off - the amount of connectivity they have is awesome. That's why you'd get x570 over b450.
It still seems to me that B450 are perfectly adequate for a Desktop workstation/gaming PC.
Most people do not need a second nvme drive or 10GbE.
10 GbE or M.2 NVMe performance is already significantly degraded by being on a PCH in the first place. More hops, higher latency, much lower IOPS. Don't do it if you can avoid it.
The thing is that most things aren't (currently) bottlenecked by PCIe 3.0. A 2080 Ti shows about 3% performance degradation by running in 3.0x8 mode. 4 lanes of PCIe 3.0 is 4 GB/s (32 Gb/s) which is plenty for 10 Gb/s networking... or even 40 Gb/s networking like Infiniband QDR (which runs at 32 Gb/s real speed after encoding overhead). So you can reasonably run graphics, 10 GbE, and one NVMe device off your 3.0x16 PEG lanes.
And AMD also provides an extra 3.0x4 for NVMe devices, so you can run graphics, 10 GbE, and NVMe RAID without touching the PCH at all.
The real use-case that I see is SuperCarrier-style motherboards that have PEX/PLX switches and shitloads of x16 slots multiplexed into a few fast physical lanes, like a 7-slot board or something. Or NVMe RAID/JBOD cards that put 4 NVMe drives onto a single slot. But right now there are no PEX/PLX switch chips that run at PCIe 4.0 speeds anyway, so you can't do that.
> So you can reasonably run graphics, 10 GbE, and one NVMe device off your 3.0x16 PEG lanes
Sure but you won't find any board with a setup like that. You can also reasonably split the x4 nvme lanes into 2x x2 but again you won't find a such a setup.
You'll find no shortage of boards with everything wired up to the PCH, though, and it's "good enough" even if it isn't ideal. The extra bandwidth will certainly not be unwanted. Especially when you're also sharing that bandwidth with USB and sata connections.
> The real use-case that I see is SuperCarrier-style motherboards that have PEX/PLX switches and shitloads of x16 slots multiplexed into a few fast physical lanes, like a 7-slot board or something.
I think those use cases would instead just use threadripper or epyc. Epyc in particular with its borderline stupid 128 lanes off of the CPU.
I thought Thunderbolt 3 was Intel specific? So is the most tangible benefit to be able to have more full bandwidth NVMe drives?
I'd agree you probably won't see much of a performance gain once things are loaded when comparing B450 to the X570 boards, but using a PCIe 4.0 SSD will very likely improve boot and load times.
(I'm fairly certain for most gaming workloads, the bandwidth increase will only come into play when getting closer to 4k 144Hz, which is unlikely to be pushed out by first gen PCIe 4.0 GPUs.)
Since 2 lanes of PCIe4 are as fast as 4 lanes of PCIe3 I think it can make sense for more IO if you don't want to go for Threadripper. On top of that the motherboards I have read about thus far have better VRM setups for the CPU so for overclocking that could make sense as well.
You do have to be careful that the B450 board you buy can handle the power requirements.
Rumors claimed there would be a 3700 that was 8c/16 64 watts and only slightly slower clocked than the 3700x. It was also supposedly $200 or so (the 3700x is $329).
Here's hoping it comes out, looks like a great CPU for a relatively cheap desktop.
If the 3700x is $329 I would bet on the 3700 being around $250, at least at launch. ~$80 cheaper is how they currently have it priced, 200$ for a slightly lower clockspeed seems too cheap, and it'd cut into their 3600(X) range.
But if you look at the launch price 2700 was $300 and 2700X was $330. I think the difference is so small that it drives people who don't want to manually overclock buy the X version to skip the hassle.
... If you can find any. I still can't find anyone to sell me a handful of AMD's embedded SoCs. I don't want to buy them on a board. I want to buy the chips themselves. I'm not sure what's up with that, if it's a supply issue or just a hard "no" to anybody that isn't an OEM.
This article isn't about embedded SoCs? It's about desktop CPUs that are readily available at retail. Embedded SoCs are a pain to buy from most vendors except the Chinese ones you can find on Alibaba.
My experience with AMD's supply problem is across more than just their embedded product line, but the supply problem is worst for me in their embedded product line. I'm sorry, I appear to have had some sort of mind frame shift recently, which is causing me to add layers of abstraction where inappropriate, resulting in communication errors.
It was fun seeing their materials refer to their branch predictor as a TAGE predictor. I remember hearing about the original paper for that when I was too young and inexperienced to focus long enough to understand it; then I saw a TAGE predictor show up in Chris Celio's BOOM repository, and I read it through.
If what AMD says is true, the new (for them) TAGE predictor in their industry-leading microarchitecture having 30% fewer branch mispredictions than the last, it feels very cool that one can read and somewhat anderstand the operation of a similar predictor in the leisure hours of a few days.
Also those caches are huge, wow.
It took a little hunting, is this the paper? http://www.irisa.fr/caps/people/seznec/JILP-COTTAGE.pdf
Interesting; Intel is rumored to use TAGE but it was never confirmed officially. AMD claimed to use a perceptron based predictor in the past.
They still will, just for L1 only; TAGE is for the lower levels.
The BTB was the real problem with Naples and the reason it sucked on non-trivial, branchy, pointer-chasing workloads like MySQL. With the improved branch prediction resources I’ll be interested in head-to-head of this rig vs. a Kaby Lake (or later).
Sounds interesting and I don't recall reading that before - can you point to some online article about it?
Also AMDs often smaller cache sizes, which still seems to be a problem if you compare to ice lake and above.
What? Zen/Zen+ had larger L1 icache than Intel, same size L1 dcache and L2 and L3. Zen 2 actually decreases the L1 icache to 32KB but in exchange increases L1 associativity, micro-op cache size, BTB size, and L3 size.
Zen’s gigantic cache wasn’t very effective on account of the way it is arranged in itty bitty little shards.
8MB is itty bitty?
There are a few narrow workloads where having a huge unified cache is an advantage, but it generally isn't. If you have many independent processes or VMs it can actually be worse, because when you have one thrashing the caches it would ruin performance across the whole processor rather than being isolated to a subset.
Meanwhile most working sets either fit into 8MB or don't fit into 64MB. When you have a 4MB working set it makes no difference and when you have a 500GB one it's the difference between a >99% miss rate and a marginally better but still >99% miss rate.
Where it really matters is when you have a working set which is ~16MB and then the whole thing fits in one case but not the other. But that's not actually that common, and even in that case it's no help if you're running multiple independent processes because then they each only get their proportionate share of the cache anyway.
So the difference is really limited to a narrow class of applications with a very specific working set size and little cache contention between separate threads/processes.
The L3 cache is still unified across all sockets. I unsure what the previous comment was taking about, but how does amds differ from Intel in that is prevents one bad process from blowing cache?
And most people don't run a bunch of vms. Single thread performance still dominates and latency cannot be improved by adding cpus.
> I unsure what the previous comment was taking about, but how does amds differ from Intel in that is prevents one bad process from blowing cache?
Ryzen/Epyc has cores organized into groups called a CCX, up to four cores with up to 8MB of L3 cache for the original Ryzen/Epyc. So Ryzen 5 2500X has one CCX, Ryzen 7 2700X has two, Threadripper 1950X has four, Epyc 7601 has eight.
Suppose you have a 1950X and a thread with a 500MB+ working set size which is continuously thrashing the caches because all its data won't fit. You have a total of 32MB L3 cache but each CCX really has its own 8MB. That's not as good for that one thread (it can't have the whole 32MB), but it's much better for all the threads on the other CCXs that aren't having that one thread constantly evict their data to make room for its own which will never all fit anyway.
This can matter even for lightly-threaded workloads. You take that thread on a 2700X or 1950X and it runs on one CCX while any other processes can run unmolested on another CCX, even if there are only one or two others.
> And most people don't run a bunch of vms.
That is precisely what many of the people who buy Epyc will do with it, and it's the one where there are the highest number of partitions. The desktop quad cores with a single CCX have their entire L3 available to any thread.
> Single thread performance still dominates
If your workloads are all single-threaded then why buy a 16+ thread processor?
I didn't know the l3 wasn't shared across complexes. From what I understand, it is 4 cores per ccx and 2mb per core, so up to 8 mb per complex.
While that might prevent one bad process from evicting things, it seems like it might almost lead to substandard cache utilization, especially on servers that might just want to run one related thing well.
Also sharing between l3s would seem to be a huge issue, but I wasn't able to find info on how that is handled (multiple copies?). But this would seem to help cloud systems to isolate cache writes.
I work on mostly hpc and latency sensitive things where I try to run a bunch in single threads with as little communication as possible, but still need to share data (eg, our logging goes to shm, our network ingress and outgres hits a shared queue, etc).
I would probably buy as a desktop, but not for the servers. Also no avx512 which besides the wider instructions the real gain seems to be in an improved instruction set for them.
> While that might prevent one bad process from evicting things, it seems like it might almost lead to substandard cache utilization, especially on servers that might just want to run one related thing well.
Right, that's the trade off. Note that it's the same one both Intel and AMD make with the L2, and also what happens between sockets in multi-socket systems. And separation reduces the cache latency a bit because it costs a couple of cycles to unify the cache. But it's not as good when you have multiple threads fighting over the same data.
> I would probably buy as a desktop, but not for the servers. Also no avx512 which besides the wider instructions the real gain seems to be in an improved instruction set for them.
If you're buying multiple servers the thing to do is to buy one of each first and actually test it for yourself. We can argue all day about cache hierarchies and instruction sets, and that stuff can be important when you're optimizing the code, but it's a complex calculation. If you have the workload where a unified cache is better, but so is having more cores, which factor dominates? How does a 2S Xeon compare with a 1S Epyc with the same total number of cores? What if you populate the second socket for both? How much power does each system use in practice on your actual workload? How does that impact the clock speed they can sustain? What happens with and without SMT in each case?
When it comes down to it there is no substitute for empirical testing.
I bought the Ryzen 1700, 1800X and 2700X. Looking to upgrade to one of these, but will wait for benchmarks, AMD is really on a roll. The 3950X is extremely tempting especially considering I can drop it right in my existing AM4 motherboard. Allowing backwards compatibility where possible was the best move they made. I would've bought 1 CPU from them instead of 3 (and soon 4, and then 5 with Zen3), without it.
zen3 would most likely be on a new socket, AFAIK. Zen2+ (4xxx series) would most likely be the last cpu for AM4, as amd have stated they will keep sockets for 4 years but that's OK since intel change it about every year or so.. I myself will be upgrading from 1700X to Either the new 8 core... but the 16 core looks so sweet...
AMD's slides showed Zen3 as the next release (Ryzen 4000 series), as 7nm+/Zen3 in 2020. Zen2+ isn't a thing, unless AMD is simply calling it Zen3. I'm not sure the names matter, as Zen+ was better than I was expecting with the XFR2 changes, so I bought one. From what I can tell, they tied support to DDR4 support. I'm expecting Zen3 next year to be the last AM4 CPU.
Doesn't really matter to me, with the value they've been delivering since the original Ryzen launch, I see no reason to not buy them all. People appreciate a good discount on a desirable CPU on Craigslist when its time to upgrade. It's just an easy swap, especially if you use an IC Graphite thermal pad instead of thermal paste.
Does anyone know if AMD is working on supporting transactional memory in their cpus?
It looks like Microsoft is adding code to Windows to schedule threads in a way that works with this configuration of cores and CCXs
Is/has Linux added similar code
Are we calling this Mini-Numa or something else?
>> Is/has Linux added similar code
Yes, in 4.15 patches emerged for TR/Epyc and waaaaaay back in 2.6 it had scheduler domains which can do the same thing.
I would call it a form of Non-Uniform Cache Architecture (NUCA). Linux definitely knows the cache topology and the scheduler is NUMA-aware but I don't know if the scheduler takes cache topology into account. It seems like it would be a small change to track L3s instead of NUMA nodes.
It's NUMA, just like usual. The difference is the exact topology, and the real difference in latency (which changes the cost function of accessing from a different node).
The CCXes aren't NUMA, they all have the same access to memory, no system would show multiple NUMA domains on a desktop Ryzen.
Only Threadripper and EPYC are NUMA.
Zen 2 drops the NUMA because of the single I/O die with memory controller right?
I can't find confirmation, but that would make sense, single-socket EPYC and TR with Zen2 should be UMA
> Only Threadripper and EPYC are NUMA.
Yeah, I meant this. I don't use Ryzen so I think of Zen as being Threadripper and EPYC.
Why would we call it “mini”? It is straight up NUMA.
It's not NUMA since the distance from all the cores to main memory is the same (uniform).
Although Threadripper 2990 is not uniform to main memory, which adds a wrinkle
So now there’s at least 2 levels of Non Uniform access to manage by the Bios/OS
Don't forget the CCX level, too.
The full list of varying latencies is something like: SMT, inter-core, inter-CCX, inter-die, inter-socket. And even that misses a few subtleties.
Oh hrmm, my mistake. This is a welcome change from the older one.
AMD’s primary advertised improvement here is the use of a TAGE predictor, although it is only used for non-L1 fetches. This might not sound too impressive: AMD is still using a hashed perceptron prefetch engine for L1 fetches, which is going to be as many fetches as possible, but the TAGE L2 branch predictor uses additional tagging to enable longer branch histories for better prediction pathways. This becomes more important for the L2 prefetches and beyond, with the hashed perceptron preferred for short prefetches in the L1 based on power.
I found this paragraph confusing, is it talking about data prefetchers (Which would make sense b/c of the mention of short prefetches) or branch predictors? (Which would make sense b/c of the mention of TAGE and Perceptron)
A little of both. My understanding of the above paragraph is that the L1 predictor is trying to predict which code-containing cache lines need to stay loaded in L1, and which can be released to L2, by determining which branches from L1 cache-lines to L1 cache-lines are likely to be taken in the near future. Since L1 cache lines are so small, the types of jumps that can even be analyzed successfully have very short jump distances—i.e. either jumps within the same code cache-line, or to its immediate neighbours. The L1 predictor doesn’t bother to guess the behaviour of jumps that would move the code-pointer more than one full cache-line in distance.
Or, to put that another way, this reads to me like the probabilistic equivalent of a compiler doing dead code elimination on unconnected basic blocks. The L1 predictor is marking L1 cache lines as “dead” (i.e. LRU) when no recently-visited L1 cache line branch-predicts into them.
I was also confused by this, but my reading is this is entirely about branch prediction nothing about caching. In that context L1 and L2 simply refer to "first" and "second" level branch prediction strategies, and are not related to the L1 and L2 cache (in the same way that L1 and L2 BTB and L1 and L2 TLB are not related to L1 and L2 cache).
The way this works is there a fast predictor (L1) that can make a prediction every cycle, or at worst every two cycles, which initially steers the front end. At the same time, the slow (L2) predictor is also working on a prediction, but it takes longer: either throughput limit (e.g., one prediction every 4 cycles) or with a long latency (e.g., takes 4 cycles from the last update to make a new one). If the slow predictor ends up disagreeing with the fast one, the front end if "re-steered", i.e., repointed to the new path predicted by the slow predictor.
This happens only in a few cycles so it is much better than a branch misprediction: the new instructions haven't started executing yet, so it is possible the bubble is entirely hidden, especially if IPC isn't close to the max (as it usually is not).
Just a guess though - performance counter events indicate that Intel may use a similar fast/slow mechanism.
I guess I was being ridiculously optimistic hoping for the 16 core chip to be around the $600 mark :(
They're not just adding more cores on because the processes have improved allowing for more on the same silicon.
The complexity of the chip is higher than the previous models, with three dies under the hood instead of one. The high end chips are closer to Threadripper than they are to the models they're replacing.
I think $750 is still a ridiculously good price, and Intel's feet are being held to the fire.
The 1950X cost $1000 2 years ago, and it looks like you can pick them up for $500 now. Both are 16 cores with two chiplets.
- 4 memory channels vs 2 - 64 PCIe lanes vs 16-4-4
The 3950X sounds like quite the improvement! Definitely a great deal IMO.
- 64MB of L3 cache vs 32MB - DDR4 3200 vs DDR4 2666 - The PCIe lanes are 4.0 vs 3.0 - 3.5Ghz base clock vs 3.4 - 4.7Ghz boost clock vs 4.0 - 15% better instructions per clock - Full avx2 instead of emulating with two 128-bit units - 105W vs 180W TDP
The biggest difference between the 2 imo is that TR4 boards have 8 DIMM slots whereas Ryzen maxes out at 4 even with X570 afaik.
Ryzen 3000 specs say it can support 128GB of ram but it’s hard to find 32GB DIMMs on the market.
So if you’re trying to build a workstation with lots of ram and more than one GPU, the Ryzen boards are too limited even if you’re willing to buy a nice one.
I think they've got the market segmented decently. Ryzen's the consumer chip after all, and it's high end configuration boarders on prosumer for sure.
I feel like if you're requiring 128GB of ram and/or maxing out the PCIE then you're going to have a bigger budget and ThreadRipper makes more sense.
Crucial makes 32GB DIMMs with ECC memory, not sure if non-ECC. Many Ryzen mobos accept ECC memory.
> Full avx2 instead of emulating with two 128-bit units
Just a note: Two units for fused multiply-add, otherwise it's four units with two multipliers and two adders.
> I think $750 is still a ridiculously good price
Threadripper 1950x comes with the same core count, more memory channels, more PCI-E lanes and more memory. You can grab one for $499 from amazon.
But you have to pay around 150$ more for the Mainboard and a threadripper compatible cooler is also quite expensive due to the huge size of the CPU.
So you're not going to save more then a few bucks but get a slower and outdated CPU.
Most high end AM4 Motherboards have sufficient clearance to allow a TR cooler with an adapter plate on the AM4-MB so buying one for a later upgrade might be possible.
Note: I have a TR cooler running on my AM4 board (custom loop though so not completely comparable) and there is more than sufficient space to place it.
You can't use an AM4 motherboard with the 1950X - you have to use an X399-chipset/TR4 motherboard, which cost more than AM4 boards (and likely have adequate room for TR coolers)
This was as a response to the idea of using a ryzen as alternative to a 1950 and solving possible thermal issues if they would occur. I never mentioned using a TR on an AM4.
It appears you may have misunderstood the comment you were replying to upthread - the original debate was if buying a 1950X at $499 would be cheaper/better than a $750 Ryzen. @lhoff pointed out that even when the 1950X is cheaper, you'd still need to buy relatively expensive coolers and mobo (for TR), meaning you won't be saving (much) on older tech. Thermal issues weren't the subject (except as an explanation on why TR4 coolers are expensive).
In turn, I misunderstood your reply to @lhoff, because in that context, I read it as a rebuttal of the idea that TR parts being expensive by suggesting an AM4 mobo + TR4 cooler as substitutes on a 1950X system.
The X570 boards won't be cheap and are probably comparable. Quality requirements for PCIe 4 are pretty significant. Depend on your needs an older ThreadRipper might be a better move. I'm not that convinced and will probably go with a 3950X in September (unless the next generation of TR is significantly compelling to wait longer).
My 4790K feels so outdated now...
Why are you comparing the price of a product that has yet to launch with the price of a product that has received many discounts over the years?
Because it doesn't matter as you'll be able to buy both.
That's the last product they have which contains the original Zen dies. They are probably just liquidating the remaining stock.
I wouldn't make the assumption that AMD could sustainably sell that much silicon at that price point.
And the TR 1950x is also much slower due to lower IPC, lower clock speeds, and NUMA.
On the other hand, a 12C with a handsome IPC improvement from last generation at $499 feels pretty tempting to me...
You can also buy into TR4 Threadripper 1950x for ~$500 . I'm using one to write this post. It's a great system and you can likely buy used for cheaper than $500. I'm also assuming this price will fall further once they announce another TR4 chip.
I think the whole system price basically ends up as a wash with TR4 motherboards being at least ~$300 and needing a ~$100 cooler, I'm assuming these consumer chips will continue to have bundled coolers. The 3950X also draws 75W less than the 1950X so you can probably save a few bucks on the power supply and of course your electric bills over time.
The performance comparison will be interesting though. The 3950X should be quite a bit faster than the 1950X when it's not bottlenecked by memory bandwidth, but of course the 1950X still has twice the memory channels. Slightly offset by the Zen2 memory controller supporting higher frequency RAM. So which one is better will depend heavily on workload. I suspect that for a developer workstation the 3950X would be the better performer, most compilation workloads are not very sensitive to bandwidth.
Yea, the platform cost is higher. I ended up making a build you could make for ~$1,500 now. It was ~$1,600 when I made it. The biggest feature I'm interested in is the availability of PCIe lanes. I want this for adding a 10G nic later and two GPUs at some point as well (host & guest).
If you don't need those features you're completely correct about the 3950x.
Ah, GPU passthrough?
My biggest problem with virtualization is USB. I have a libvirt with GPU passthrough setup that works great, but have been unable to get a USB controller of any sort to passthrough; always winds up in a group with a bunch of other PCI-e devices. And ordinary forwarding with SPICE or something isn’t really sufficient for what I’d like to set up...
It's technically a security risk, but take a look at the acs patch that's out there. It'll forcefully split up the iommu groups to the hypervisor so you can do the pass through, but it does mean that the cards that were in the same group before can technically see each other's dma and other stuff on the bus. For anything other than a shared host it's pretty much fine though.
Have you tried IOMMU splitting? https://forums.unraid.net/topic/72027-iommu-group-splitting-...
(disclaimer - I don't own a board that can do this, I will one day, though).
I have, though it causes a lot of stability problems :(
This should be doable on desktop Ryzen. I currently have one GPU (x16) + NVMe SSD (x4) + 10G NIC (x4 from chipset). The 16x can be split x8/x8 for dual GPU.
https://linustechtips.com/main/topic/799836-pcie-lanes-for-r... — nice diagram of lanes
If you're running 10G through the chipset and running GPUs on a x8 config you're going to bottleneck yourself in my opinion.
My single port Intel X520 achieves line rate through x4 chipset lanes just fine.
GPUs generally don't come close to saturating x8 3.0 lanes, unless you have a very specific workload (like the new 3dmark bandwidth benchmark AMD used to demo PCIe 4.0).
Games don't do nearly enough asset streaming to use a lot of bandwidth, since the amount of assets used at the same time is limited by VRAM size, and most stuff is kept around for quite some time. Offline 3D renderers like Blender Cycles IIRC just upload the whole scene at once and then path tracing happens in VRAM without much I/O. For buttcoin mining, people literally use boards with tons of x1 slots + risers. No idea how neural nets behave, but would make sense that they also just keep updating the weights in VRAM.
Except this is pcie 4 vs pcie 3 so it's double the lanes for the latter, so no, x8 pcie 4 will not bottleneck anything. Unfortunately GPU do not support pcie 4 yet, but it's not a problem for integrated 10GbE.
I thought AMD said their. New card was pcie 4
Yes, the Navi card of course supports gen4, and even the Vega20 did too. At least in the original Instinct variant (most places on the internet say that gen4 was cut on the consumer Radeon VII card)
I do wonder if they'll continue with Threadripper. If it exists in the next generation, it might simply be as rebadged and slightly nerfed EPYC chips rather than something custom.
AMD has already stated that they're going to continue with Threadripper.
It would leave a fairly big gap in the lineup with nothing to compete against Intel's X299 platform. AM4 is lacking in memory channels and PCIe lanes. Epyc has much lower clockspeeds, much more expensive CPUs, and more expensive motherboards than Threadripper.
> it might simply be as rebadged and slightly nerfed EPYC chips
Well, that is what first-gen Threadripper was. Same socket and all, but with half the connected DDR lanes and a pin telling the motherboard it's not EPYC.
The first-gen Threadripper only had two dies, and even the WX series was a bit weird internally, with two of the four dies not being able to perform IO.
I know it's not a big difference, but given the changes to IO and the 16 core consumer version, I don't see why there would be any internal difference to EPYC this time around (which this article claims will have a variable number of chiplets).
The only difference would be the number of DDR channels I guess, yeah.
I hope as well; moreover, I hope they'd release a 64 core TR next year that could last a decade (even if it costs 2,500+). There are rumors Zen 3 should bring 4-way SMP, i.e. 4 threads/core instead of 2, so that might lead to ridiculous numbers of threads in normal systems.
As Lisa said, TRs were distinct to Epycs; I guess using UDIMMs vs RDIMMs and much higher base clock (except for the high freq EPYC 7371) led to a few changes.
12C with 64MB cache for $499 is looking more and more like the sweet spot to me, too.
8C 16T for 329 at 65w sounds like an incredible value for money for a budget workstation.
It almost makes me think the TDP of either the 3800x or 3700x is a publishing mistake. 3.6/4.4 to 3.9/4.5 on the same hardware can't possibly require such a dramatic voltage increase to go from 65w to 105w.
If it does, it means the process is struggling to produce chips at that speed, so the headroom is incredibly low and you can forget about overclocking.
It doesn't take much of a voltage increase at all to send power usage skyrocketing.
That said the part you're missing is binning. The 3800x is definitely the worst binned chiplets, as evidenced by the 3900x and 3950x having the same TDP.
TDP is just a number these days. You're not getting 4 more cores for free, you're getting a cap put on your all-core power consumption at a lower clock rate. More cores, lower all-core boost clocks (the advertised clocks are single-core boost).
Even then, both AMD and Intel CPUs will pull significantly above their rated TDP when boosting. It's not quite a base-clock measurement (eg 9900K is more like 4.3-4.4 when 95W-limited) but it's definitely not a boost power measurement either.
Again, pretty much just a marketing number these days.
> You're not getting 4 more cores for free, you're getting a cap put on your all-core power consumption at a lower clock rate.
There's a mere 100mhz difference in base clock. Which is what TDP is based off of. No where close to enough of a reduction to fully explain +50% cores at the same TDP
> Again, pretty much just a marketing number these days.
Not really no. You just need to understand it represents all core base frequency thermal design target, and not maximum power draw.
It's still based in reality, though. It's not some random made up number.
And binning is an extremely real thing with very significant impact. Not sure why you seem to be trying to outright dismiss it.
Same as with Threadripper which eats 300W or so at 4GHz with applicable overvolting. (Rated 180W, you can overvolt it hard, even to 1.55V, if you can dissipate 450W, to get humongous 4.2G all core.)
To be fair 8 core 16 thread Ryzen chips could be had for as little as ~$200 for years now. With this release the respectable 2700X will probably fall pretty low in price till stocks run out.
At 65 W?
Yes, the Ryzen 1700 is a 65W part
Which can easily pull 200W+ when I test out my R7 1700 clocked up at 4.1ghz 1.375v. Of course I know TDP isn't at all realistic to actual power draws once you stop using things at stock settings. It only cost me $159 on sale and its easy enough to keep cool so no real complaints here.
No, 2700X which is listed as a 105W TDP part.
I like small builds that are a good compromise between performance and power needs, and the 3700X looks sweet on 65W at that price point.
Considering that Intel charges almost $600 for their top 8-core, yes, that would be ridiculous. There were some leaked/rumored prices for Ryzen 3xxx and Navi that were way too low and I didn't understood why AMD would almost give away such good products; the simple truth is that they won't.
The Ryzen 7 1800X, the Zen flagship was $499. The Ryzen 7 2700X, the Zen+ flahsip was $329.
That was a 34% decrease in cost for comparable models after 1 generation. If that trend follows then the comparible Zen2+/Zen3 model will only be around $500. So, hopefully you just need to wait a year.
The 2700X price cut occurred in the context of Intel releasing a 6C12T processor for $350 that kept up with AMD's $500 processor. i.e. very close in multithreaded performance and easily beating it in single-thread.
That's not going to happen this time around. Intel doesn't really have a response to 16C consumer processors. Best thing they can do is release the 10C chip they're working on... probably at $500 again. And they will be behind the 12C version that AMD has at $500 already.
The only similarly aggressive move that Intel could even make would be to drop 10C to the $350 segment (perhaps with Hyperthreading disabled), which would be a massive blow to their margins.
750 is pretty close at least.
I bet there will be such chips. They'll just have a few cores disabled. It's not like they make a separate die for every core configuration - that wouldn't make sense.
So in other words the $499 12C/24T 3900X?
That was a great article. I wonder if fixing the reliability of their performance counters to work with rr ( https://rr-project.org/ ) is anywhere on their radar.
Sadly, until that happens, AMD CPUs are dead to me. For a C++ (or C or Rust) developer, rr is just too much of a productivity boost to give up.
Anybody expect 7nm processes to result in longevity issues? As far as I understand (and IME) the first components to fail are capacitors. Might that begin to change?
Apropos the article, I'm trying to convince myself to build an EPYC 3201 server now rather than waiting for the Zen 2 version, for which I presume I'd have to wait until October or November at the earliest.
I think, electromigration will still kill the chip earlier than individual device failures.
It was Intel is said to switched to cobalt wiring in latest node, and seems to be paying dearly for that. TSMC and others seem to go the conventional road and continued to perfect the salicide for smaller nodes without any issues.
Officially, Intel says 10nm has lithography problems. They did try a more aggressive node than TSMC's first "7nm", entirely using 193nm UV, and were the only company to attempt Self-Aligned Quadruple Patterning (SAQP) for the top metal layers.
First generation anything is not as good as it's going to be.
Up until recently every CPU generation was on a brand new node - there was no second gen on the same process. We never had reliability issues with CPUs before.
The Intel Atom C2000 product family had a terrible problem which would brick systems, see for example https://www.tomshardware.com/news/intel-cpu-failure-atom-pro... or https://www.theregister.co.uk/2017/02/06/cisco_intel_decline...
There was also a weak transistor on some Sandy Bridge chips in the SATA controller. But these failures are really quite rare.
I wonder if I'm misinterpreting TDP.
For the 105W TDP chip vs. say the 65 W one. If there is a lesser task not saturating the cores, the power/heat generation would be similar, and the bigger chip doesn't really ramp up the heat/wattage unless heavier loads are thrown at it?
For sibling chips like this, yes. Two cores being run at 90% utilization each will draw about the same amount of power regardless of whether you bought the 6-core version or the 8-core version.
Similarly, 4 cores running on the 12 or 16 core chips should eat about the same amount of power as each other.
How does new instructions, register renaming, etc work with different compilers? Say I'm using Visual Studio to compile C++, will it take advantage of the new processor features by default? What about if the binary runs on a different CPU, will the compiler include feature checks and multiple code versions?
Not sure about VC++, but in gcc you can use -march=native to let the compiler compile the code with all instruction sets available on your CPU, I think there is a VC++ equivalent.
As for already compiled binary, depending on how it was compiled it may or may not work of a different CPU. Also the compiler doesn't do the runtime checks.
AIUI, register renaming is a runtime feature of the microcode and has nothing to do with the machine language interface.
Correct. The compiler can be trained to utilise it efficiently but it's not a exposed feature
RE new instructions, those appear to be mostly useful for things running on bare metal (like the OS kernel).
The under the hood stuff like true 256 bit registers, branch prediction, cache, etc, all is below the machine code level as other people have pointed out. The compiler doesn't know about it.
>What about if the binary runs on a different CPU, will the compiler include feature checks and multiple code versions?
This is referred to as multiple/dynamic code paths and it needs to be supported by the processor microarchitecture and compiler. afaik only the Intel Compiler and Intel processors support it with the <code>-ax</code> compilation flag.
In general you should pick a minimum architecture for your applications, since it will be forward compatible.
GCC and LLVM have supported multiple code versions based on feature detection for a few years, they call it function multiversioning. As far as I'm aware MSVC does not have this yet.
GCC has supported it for over 6 years, since 4.8. It was only added to clang in 7.0, released 8 months ago.
"new instructions" and "register renaming" are on opposite sides of the spectrum. It's up to your compiler to carch up with insn set additions, but register renaming is invisible.
I find these articles incredible, but then I remember I work and live on a Mac. And I can't wait to pay 5K for the content of one of those articles published 2 years ago.
Same here. I'd love to build a Ryzen box, and actually I'll do so for gaming, but my primary computer has to run macOS and unfortunately that requirement comes with a significant price tag, in addition to outdated hardware.
You can build a kick-ass Mini-ITX Ryzen box for less than a Mac monitor stand. Why not both?
As I said, I do. I have a separate gaming PC, which gets occassional use and will be upgraded to the Ryzen 3000 series later this year. I wish I could do the same with a reliable, legitiamte macOS system - the value is just incredible over in the non-Apple PC world.
Can’t you do hackintoshes with Ryzen?
Not without a lot of messing around. With Intel you can pretty much just straight install macOS + Clover + FakeSMC and you're golden. With Ryzen you have to start messing around with custom kernels, which means updates will break things and also means a lot of programs won't work.
Last I checked a lot of progress had been made, but you're unable to run any 32-bit applications and some software such as the Adobe Creative Suite simply won't run.
I too thought it was hard, but recently found this that supposedly makes it way easier: https://kb.amd-osx.com/guides/HS/
I'm considering a new ryzen hackintosh build in july!
Going to give it one, single try... then I'm on to Linux as my primary. Will have a Windows VM for some work, and may keep a mac VM as well. Most of the stuff I do works fine in Linux, and it really looks like Manjaro and Pop_OS! have made a lot of progress beyond the general dev stuff I work on (mostly via Docker/Linux anyway).
I worry a bit about stability...
Oh yea, I wouldn't want to do that. I use a hackintosh but with an intel CPU and it just works flawlessly except for the whole NVidia GPU holding me back.
Well, I also want it to be reliable. This is my main computer, my daily driver, after all.
I did it, and it's great. It's possible to get Ryzen + GTX 2070 for a price of a Mac Mini. Of course it's a big ugly box, but it plays latest games without a hitch, even on 120hz/1440p "Ultra" graphics settings.
Apple hardware functions as a software protection dongle. If you want to use Apple software you must buy a dongle.
Except if you hackintosh
> And I can't wait to pay 5K for the content of one of those articles published 2 years ago.
Err, "can't wait" as in "it will be awesome when it arrives" or as in "I'm not going to wait for that, stupid Apple"?
I think the OP sarcasm was obvious.
I thought it was likely, but I also thought the wording ("for the content of one of those articles") was weird enough that I was confused. I think what was meant was "And I can't wait to pay 5K for the hardware described in one of those articles published 2 years ago."
Also, I would be a lot more confident in labeling it sarcasm if I didn't believe there were plenty of people actually eagerly awaiting new Apple hardware just like that.
In any case, Poe's law strikes again.
It was sarcasm. Sorry for the confusion. English is not my native tong.
It’s “tongue”! Just kidding...but it is :D
I think the 2 years ago was what makes it obvious for me. I'm not a native speaker either.
Another article on https://www.follownews.com/amd-zen-2-microarchitecture-analy...
Now... when will AMD fix power consumption on their video cards as well?
Funny about those downvotes.
AMD can deliver 8C/16T in 65 W but their GPUs need 50%+ more power than nvidia's for the same performance (up to 100% more at the 1080 lower end). You're saying I'm not right and they don't have a problem?
At this same event they also announced their 1st gen RDNA GPUs, called Navi. Also releasing to retail on July 7th. They supposedly go quite some way to reducing the gulf in GPU power usage/performance.
Going by their E3 presentation of the 5700 and 5700 XT last night they haven't done enough to curb power use, but only testing will tell once we have these cards in hand.
Agreed... I'm a bit torn on this, since they've done a lot for Linux support, may go 5700XT or Radeon VII to pair with 3950X
Not really. They have finally managed to reach Pascal-level perf/watt... on 7nm, competing against 16nm NVIDIA chips. It's more or less a GTX 1080 competitor as far as performance and perf/watt is concerned. So, roughly 3 years and a node's worth of uarch disadvantage behind NVIDIA.
It's a repeat of Vega, where AMD finally managed to reach Maxwell-level perf/watt... on 14nm, competing against 28nm NVIDIA chips. Once again they are years late and too expensive to boot.
They've managed to close the gap to Turing a little bit (because NVIDIA is still on a 16+ node rather than 7nm) but it's going to be a bloodbath when NVIDIA ports down to 7nm next year.
Price is the great equalizer, but once again AMD is choosing to price head-to-head with NVIDIA. Racing onto 7nm was not a cheap move for them.
They're also getting some breathing room because the GTX 2xxx series cards have been something of a disappointment. That could vanish quickly once the 3xxx series cards are announced.