Original Link: https://www.anandtech.com/show/13253/hot-chips-2018-arm-machine-learning-core-live-blog



04:57PM EDT - Arm officially announced Project Trillium earlier this year, as a way to bring machine learning to the Arm ecosystem. As part of the trade show today, we have a presentation where Arm is going to talk about its first machine learning core. The talk is set to start at 2pm PT / 9pm UTC.

05:01PM EDT - Combined effort of large team

05:01PM EDT - 150+ people

05:02PM EDT - Arm works with partners across many segments. Without fail, all those markets want machine learning acceleration

05:02PM EDT - Need to come up with a fundamental tech that can be used for many different segments

05:03PM EDT - ML processor focused for Neural Networks

05:03PM EDT - Optimized ground up architecture

05:03PM EDT - open source stack for easy deployment

05:03PM EDT - First design targets mobile with derivatives for additional segments

05:05PM EDT - Main component is the Machine Learning Processor. Biggest block is SRAM

05:05PM EDT - 16 compute engines per ML Proc

05:05PM EDT - 4 TOP/s of convolution at 1 GHz

05:05PM EDT - Targeting 3 TOP/W in 7nm and 2.5mm2

05:06PM EDT - INT8 support, support for NNAPI

05:06PM EDT - Optimized for CNN/RNN

05:06PM EDT - Support 16-bit, in case pixels are coming

05:06PM EDT - 8-bit is 4x perf over 16-bit

05:07PM EDT - Need to get four things right

05:08PM EDT - Static Scheduling, Efficient Convolutions

05:08PM EDT - CNNs are static analyzable. So you can optimize for deterministic data with stuff laid out in memory ahead of time

05:08PM EDT - Creates command stream for different parts in the processor

05:09PM EDT - Control Unit takes command stream

05:09PM EDT - No caches, simplified flow control. Simplified hardware (co-design with compiler). Predictable performance

05:10PM EDT - Compiler needs to know the details of the core, e.g. SRAM size

05:11PM EDT - Underlying requirement of high-throughput dot products are worth optimizing for

05:12PM EDT - Output feature maps are interleaved across compute engines

05:12PM EDT - So 16 engines = 16 output feature maps

05:12PM EDT - MAC Engine: 8 x 16-bit wide dot products per cycle

05:13PM EDT - Each engine is 2*8*16 = 256 ops/cycle

05:13PM EDT - 32b accumulators, 4.1 TOPs @ 1 GHz

05:13PM EDT - Gate for zeros to save power - 50% power reductions

05:13PM EDT - Keep inputs to datapaths stable to save energy

05:14PM EDT - Also have broadcast network

05:14PM EDT - Creates tensor block of activations that is broadcast to all the MAC engines

05:15PM EDT - when final result the 32b values are scaled back to 8b and sent to the programmable layer engine

05:15PM EDT - POP IP for Mac Engine for 16nm and 7nm

05:16PM EDT - Custom physical design with 40% area reduction and 10-20% power improvements

05:16PM EDT - DRAM power can be up to 50% without bandwidth saving

05:16PM EDT - use weight compression, activation compression, and tiling, to save DRAM power

05:17PM EDT - Spend dedicated silicon on compression is essential

05:18PM EDT - Histogram of 8x8 blocks in Inception V3 shows key ML matrixes

05:18PM EDT - Develop lossless compression for the popular 8x8 blocks

05:19PM EDT - 3.3x compression over standard implementation

05:19PM EDT - When training, many training weights taper down to zero

05:19PM EDT - Can force the weights to zero with no change in accuracy but offers better compression

05:20PM EDT - This allows for weight compression and 'pruning'

05:20PM EDT - Support a compression format to get the zeroes out

05:20PM EDT - Take the remaining non-zero can be clamped / normalized

05:20PM EDT - All happens in hardware

05:21PM EDT - Weights stay compressed until read from internal SRAM

05:21PM EDT - Very model dependent

05:21PM EDT - Tiling

05:22PM EDT - Tricks that the compiler can use to reduce bandwidth by using the SRAM

05:22PM EDT - Possible due to static scheduling

05:23PM EDT - State of the art in tiling is still evolving

05:23PM EDT - Programmable Layer Engine helps with future-proofing

05:24PM EDT - Use Arm M-class CPU and extended it with vector extensions for neural networks

05:24PM EDT - Engine can manipulate SRAM data

05:25PM EDT - Designed to scale in engines

05:25PM EDT - Designed to scale with MAC engine throughput

05:25PM EDT - Designed to scale with Machine Learning Processors for multiple workloads

05:26PM EDT - Q&A time

05:27PM EDT - Q: If there's one M-series core in the engine, is there 16 per MLP? A: Yes

05:28PM EDT - Q: What about lower than 8-bit? A: Industry momentum is around 8-bit. We're looking into lower bit levels, but that could be in the future

05:28PM EDT - Q: When will it be available for licencing ? A: Later this year

05:29PM EDT - That's a wrap. Come back at 3pm for a talk on Tachyum

Log in

Don't have an account? Sign up now