Speaker 1: Real, uh, the Tesla bot will be real. Um, but, uh, basically if you think about what we're doing right now with the cars, uh, Tesla is arguably the world's biggest robotics company, cuz our cars are like set semi sentient robots on wheels, neural nets, recognizing the world, understanding how to navigate through the world. Uh, it, it kind of makes sense to put that onto a humanoid form. Um, we're also quite good at, uh, senses and batteries and [00:00:30] uh, actuators. So, uh, we think we'll probably have, uh, a prototype sometime next year, uh, that, uh, is basically looks like this. Um, and it's intended to, um, uh, be friendly of course, um, and uh, navigate through a world, uh, built for humans and, uh, eliminate dangerous, repetitive and boring tasks. Um, we're setting [00:01:00] it such that it is, um, at a mechanical level, at a physical level. Speaker 1: Uh, you can run away from it, um, and, and most likely overpower it. So, uh, hopefully that doesn't ever happen, but, um, you never know it's a run, uh, 5 48, um, uh, has sort of a, a screen where the head is for useful information. Um, but as otherwise basically [00:01:30] got the order pilot system in it. So it's, uh, got cameras, got eight cameras and um, yeah, uh, what we want to, uh, show today is that, uh, Tesla is, uh, much more than an electric car company, uh, that we have, uh, deep AI activity, uh, in, um, hardware on the insurance level, on the training level. Um, and, uh, basically we, I think we're, I think arguably the leaders [00:02:00] in real world AI, as it applies to real world, um, um, and those of you who have seen the full self driving, uh, beta, I, uh, can appreciate the rate at which the Tesla neural net is loaning to, to drive. Speaker 2: So here I'm showing the video of the raw inputs that come into the stock and then neural processes that into the vector space. And you are seeing parts of that vector space rendered the instrument cluster on the car. Now, what I find kind of fascinating about this is that we are effectively [00:02:30] building a synthetic animal from the ground up. So the car can be thought of as an animal, it moves around, it senses the environment and, uh, you know, acts autonomously and intelligently. And we are building all the components from scratch in house. So we are building of course, all of the mechanical components, the body, the nervous system, which is all the electrical components and for our purposes, the brain of the autopilot. And specifically for this section, the synthetic visual cortex, we are processing just individual image and we're making a large number of predictions about these images. Speaker 2: So for example, here, you can see predictions [00:03:00] of the stop sign, uh, the stop lines, uh, the lines, the edges, the cars, uh, the traffic lights, uh, the curbs here, uh, whether or not the car is parked, uh, all of the static objects like trash cans, cones, and so on. And everything here is coming out of the net, um, here in this case, out of the hydrant. So that was all fine and great. But as we worked towards FSD, we quickly found that this is not enough. So where this first started to break was when we started to work on smart summon. Here, I am showing some of the predictions of only the curb detection [00:03:30] task, and I'm showing it now for every one of the cameras. So we'd like to wind our way around the parking lot to find a person who is summoning the car. Now, the problem is that you can't just on image space predictions. You actually need to cast them out and form some kind of a vector space around you. Um, so we attempted to do this using C plus plus and developed, uh, what we call, uh, the occupancy tracker at the time. Speaker 2: So here we see that the curb detections from the images are being stitched up across camera scenes, camera boundaries. And [00:04:00] over time now there were two pro two major problems. I would say with the setup. Number one, we very quickly discovered that tuning the occupancy tracker and all of its hyper parameters was extremely complicated. You don't want to do this explicitly by hand in C plus, plus you want this to be inside in neural network and train that end to end. Number two, we very quickly discovered that the space is not the correct output space, uh, want to make predictions in image space. You really want to make it directly in the vector space. So for example, here in this video, I'm showing single camera predictions in orange and multi-camera predictions in blue. [00:04:30] And basically if you, if you can't predict these cars, if you are only seeing a tiny SLI of a car, so your detections are not going to be very good and their positions are not gonna be good, but a multi-camera network does not have an issue. Speaker 2: Here's another video from a more nominal sort of situation. And we see that as these cars in this tight space, cross camera boundaries, there's a lot of Jan that enters into the predictions. And basically the whole setup just doesn't make sense, especially for very large vehicles like this one. And we can see that the multi-camera networks struggle significantly less with these kinds of predictions. So [00:05:00] here we are making predictions about the road boundaries in red intersection areas in blue, um, road centers and so on. So we're only showing a few of the predictions here just to keep the visualization clean. Um, and yeah, this is, this is done by this spatial, uh, R and N. And this is only showing a single clip, single traverse, but you can imagine there could be multiple trips through here. A and basically a number of cars, a number of clips could be collaborating to build this map basically and effectively an HD map, except it's not in a space of explicit [00:05:30] items. Speaker 2: It's in a space of features of a recurring neural network, which is kind of cool. I haven't seen that before. So here's putting everything together. Uh, this is what our architectural roughly like today. So, um, we have raw just feeding on the bottom. They go through rectification layer to correct for camera calibration and put everything into a common, uh, virtual camera. We pass them through, uh, res residual networks to process them into a number of features at different scales. We fuse the multi-scale information with, by FPN. This goes through transformer [00:06:00] module to re represent it into the vector space and the output space. This feeds into a feature queue in time or space that gets processed by a video module like the spatial RM, and then continues into the branching structure of the HDNet with trunks and heads for all the different tasks. Speaker 3: So here, uh, we are planning to do a line change, um, in this case, the car needs to do two back to back lane changes to make the left turn up ahead for this, the car searches over, uh, different menus. Um, so [00:06:30] in the first, the, the first one, it searches is, uh, lane change. That's close by, but the, uh, car breaks pretty harshly. So it's pretty uncomfortable. The next maneuver tries that's the lane change bit late. So it speeds up goes by in the other car, goes in front of the other cars and find it as the lane change, but now it risks missing the left turn. Speaker 3: We do thousands of such searches in a very short time span, um, because these are all physics based models. These features are very easy to simulate. Uh, and in the end we [00:07:00] set of candidates and we finally choose one based on the automat conditions of safety, comfort, and easily making the turn. So now the car has chosen this path and you can see that as the car executes this trajectory, uh, it pretty much matches what we had planned the cion plot on the right side here. Um, that one is the actual velocity of the car and the white line B underneath it is, was a plan. So we are able to plan for 10 seconds here and able to match that, uh, when we see in hindsight, so this is a well-made plan. [00:07:30] So a single car driving through some location can sweep out some patch around the trajectory tree using this technique, but we don't have to stop there. Speaker 3: So here we collect, collect a different clips, uh, from the same location, from different cars, maybe, uh, and each of them sweeps out some part of their road. Cool thing is we can bring them all together into a single giant optimization. So here these 16 different trips are organized, uh, using, uh, align, using various features, such as ROS lane [00:08:00] lines. All of them should agree with each other and also agree with all of the image space observations together. This is this previous, an effective way to label the road surface, not just where the car drove, but also in other locations that it hasn't driven yet. We don't know to stop at just the road surface. We can also reconstruct 3d static obstacles. Um, here, uh, this is, uh, reconstructed, uh, 3d point cloud from our cameras. Um, the main innovation here is the density of the point cloud. Typically these points require texture, uh, to [00:08:30] form associations from one frame to the next frame. But here we are able to produce these points, even on Textless surfaces, like the road surface or walls. Uh, and this is really useful to annotate arbitrary obstacles that, um, we can see on the, see in the world, how many everything together we can produce. These amazing datasets let ate, um, all of the road, texture, all the static objects and all of the moving objects, even through occlusions producing Speaker 4: Excellent thematic, uh, labels. If we put all of it together, [00:09:00] we get training optimized chip R D one chip. This was entirely designed by Tesla team internally all the way from the architecture to GDS out and package. This chip is like a GPU level compute with a CPU level flexibility, and twice the network chip level IO bandwidth, but we didn't stop here. [00:09:30] We integrated the entire electrical, thermal and mechanical piece out here to form our training tile fully integrated, interfacing with a 52 old DC input. It's unprecedented. This is an amazing piece of engineering. Our compute plane is completely agonal to power supply [00:10:00] and cooling that makes high bandwidth compute planes possible. What it is is a nine plop training tile. This becomes our unit of scale for our system and this it's real.