Everything revealed at Elon Musk's Tesla Bot event
Everything revealed at Elon Musk's Tesla Bot event
10:32

Everything revealed at Elon Musk's Tesla Bot event

Tech Industry
Speaker 1: Real, uh, the Tesla bot will be real. Um, but, uh, basically if you think about what we're doing right now with the cars, uh, Tesla is arguably the world's biggest robotics company, cuz our cars are like set semi sentient robots on wheels, neural nets, recognizing the world, understanding how to navigate through the world. Uh, it, it kind of makes sense to put that onto a humanoid form. Um, we're also quite good at, uh, senses and batteries and [00:00:30] uh, actuators. So, uh, we think we'll probably have, uh, a prototype sometime next year, uh, that, uh, is basically looks like this. Um, and it's intended to, um, uh, be friendly of course, um, and uh, navigate through a world, uh, built for humans and, uh, eliminate dangerous, repetitive and boring tasks. Um, we're setting [00:01:00] it such that it is, um, at a mechanical level, at a physical level. Speaker 1: Uh, you can run away from it, um, and, and most likely overpower it. So, uh, hopefully that doesn't ever happen, but, um, you never know it's a run, uh, 5 48, um, uh, has sort of a, a screen where the head is for useful information. Um, but as otherwise basically [00:01:30] got the order pilot system in it. So it's, uh, got cameras, got eight cameras and um, yeah, uh, what we want to, uh, show today is that, uh, Tesla is, uh, much more than an electric car company, uh, that we have, uh, deep AI activity, uh, in, um, hardware on the insurance level, on the training level. Um, and, uh, basically we, I think we're, I think arguably the leaders [00:02:00] in real world AI, as it applies to real world, um, um, and those of you who have seen the full self driving, uh, beta, I, uh, can appreciate the rate at which the Tesla neural net is loaning to, to drive. Speaker 2: So here I'm showing the video of the raw inputs that come into the stock and then neural processes that into the vector space. And you are seeing parts of that vector space rendered the instrument cluster on the car. Now, what I find kind of fascinating about this is that we are effectively [00:02:30] building a synthetic animal from the ground up. So the car can be thought of as an animal, it moves around, it senses the environment and, uh, you know, acts autonomously and intelligently. And we are building all the components from scratch in house. So we are building of course, all of the mechanical components, the body, the nervous system, which is all the electrical components and for our purposes, the brain of the autopilot. And specifically for this section, the synthetic visual cortex, we are processing just individual image and we're making a large number of predictions about these images. Speaker 2: So for example, here, you can see predictions [00:03:00] of the stop sign, uh, the stop lines, uh, the lines, the edges, the cars, uh, the traffic lights, uh, the curbs here, uh, whether or not the car is parked, uh, all of the static objects like trash cans, cones, and so on. And everything here is coming out of the net, um, here in this case, out of the hydrant. So that was all fine and great. But as we worked towards FSD, we quickly found that this is not enough. So where this first started to break was when we started to work on smart summon. Here, I am showing some of the predictions of only the curb detection [00:03:30] task, and I'm showing it now for every one of the cameras. So we'd like to wind our way around the parking lot to find a person who is summoning the car. Now, the problem is that you can't just on image space predictions. You actually need to cast them out and form some kind of a vector space around you. Um, so we attempted to do this using C plus plus and developed, uh, what we call, uh, the occupancy tracker at the time. Speaker 2: So here we see that the curb detections from the images are being stitched up across camera scenes, camera boundaries. And [00:04:00] over time now there were two pro two major problems. I would say with the setup. Number one, we very quickly discovered that tuning the occupancy tracker and all of its hyper parameters was extremely complicated. You don't want to do this explicitly by hand in C plus, plus you want this to be inside in neural network and train that end to end. Number two, we very quickly discovered that the space is not the correct output space, uh, want to make predictions in image space. You really want to make it directly in the vector space. So for example, here in this video, I'm showing single camera predictions in orange and multi-camera predictions in blue. [00:04:30] And basically if you, if you can't predict these cars, if you are only seeing a tiny SLI of a car, so your detections are not going to be very good and their positions are not gonna be good, but a multi-camera network does not have an issue. Speaker 2: Here's another video from a more nominal sort of situation. And we see that as these cars in this tight space, cross camera boundaries, there's a lot of Jan that enters into the predictions. And basically the whole setup just doesn't make sense, especially for very large vehicles like this one. And we can see that the multi-camera networks struggle significantly less with these kinds of predictions. So [00:05:00] here we are making predictions about the road boundaries in red intersection areas in blue, um, road centers and so on. So we're only showing a few of the predictions here just to keep the visualization clean. Um, and yeah, this is, this is done by this spatial, uh, R and N. And this is only showing a single clip, single traverse, but you can imagine there could be multiple trips through here. A and basically a number of cars, a number of clips could be collaborating to build this map basically and effectively an HD map, except it's not in a space of explicit [00:05:30] items. Speaker 2: It's in a space of features of a recurring neural network, which is kind of cool. I haven't seen that before. So here's putting everything together. Uh, this is what our architectural roughly like today. So, um, we have raw just feeding on the bottom. They go through rectification layer to correct for camera calibration and put everything into a common, uh, virtual camera. We pass them through, uh, res residual networks to process them into a number of features at different scales. We fuse the multi-scale information with, by FPN. This goes through transformer [00:06:00] module to re represent it into the vector space and the output space. This feeds into a feature queue in time or space that gets processed by a video module like the spatial RM, and then continues into the branching structure of the HDNet with trunks and heads for all the different tasks. Speaker 3: So here, uh, we are planning to do a line change, um, in this case, the car needs to do two back to back lane changes to make the left turn up ahead for this, the car searches over, uh, different menus. Um, so [00:06:30] in the first, the, the first one, it searches is, uh, lane change. That's close by, but the, uh, car breaks pretty harshly. So it's pretty uncomfortable. The next maneuver tries that's the lane change bit late. So it speeds up goes by in the other car, goes in front of the other cars and find it as the lane change, but now it risks missing the left turn. Speaker 3: We do thousands of such searches in a very short time span, um, because these are all physics based models. These features are very easy to simulate. Uh, and in the end we [00:07:00] set of candidates and we finally choose one based on the automat conditions of safety, comfort, and easily making the turn. So now the car has chosen this path and you can see that as the car executes this trajectory, uh, it pretty much matches what we had planned the cion plot on the right side here. Um, that one is the actual velocity of the car and the white line B underneath it is, was a plan. So we are able to plan for 10 seconds here and able to match that, uh, when we see in hindsight, so this is a well-made plan. [00:07:30] So a single car driving through some location can sweep out some patch around the trajectory tree using this technique, but we don't have to stop there. Speaker 3: So here we collect, collect a different clips, uh, from the same location, from different cars, maybe, uh, and each of them sweeps out some part of their road. Cool thing is we can bring them all together into a single giant optimization. So here these 16 different trips are organized, uh, using, uh, align, using various features, such as ROS lane [00:08:00] lines. All of them should agree with each other and also agree with all of the image space observations together. This is this previous, an effective way to label the road surface, not just where the car drove, but also in other locations that it hasn't driven yet. We don't know to stop at just the road surface. We can also reconstruct 3d static obstacles. Um, here, uh, this is, uh, reconstructed, uh, 3d point cloud from our cameras. Um, the main innovation here is the density of the point cloud. Typically these points require texture, uh, to [00:08:30] form associations from one frame to the next frame. But here we are able to produce these points, even on Textless surfaces, like the road surface or walls. Uh, and this is really useful to annotate arbitrary obstacles that, um, we can see on the, see in the world, how many everything together we can produce. These amazing datasets let ate, um, all of the road, texture, all the static objects and all of the moving objects, even through occlusions producing Speaker 4: Excellent thematic, uh, labels. If we put all of it together, [00:09:00] we get training optimized chip R D one chip. This was entirely designed by Tesla team internally all the way from the architecture to GDS out and package. This chip is like a GPU level compute with a CPU level flexibility, and twice the network chip level IO bandwidth, but we didn't stop here. [00:09:30] We integrated the entire electrical, thermal and mechanical piece out here to form our training tile fully integrated, interfacing with a 52 old DC input. It's unprecedented. This is an amazing piece of engineering. Our compute plane is completely agonal to power supply [00:10:00] and cooling that makes high bandwidth compute planes possible. What it is is a nine plop training tile. This becomes our unit of scale for our system and this it's real.

Up Next

Texas Sues Facebook Over Facial Recognition, Apple Could Release 3 New Macs Soon
tt-02-15-2022-00-00-47-13-still047

Up Next

Texas Sues Facebook Over Facial Recognition, Apple Could Release 3 New Macs Soon

Vaccinated Amazon employees can remove their masks, Samsung offers some high-end phones and tablets
tt-screenshot-021222

Vaccinated Amazon employees can remove their masks, Samsung offers some high-end phones and tablets

Choosing the best webcam
how-to-webcams-00-22-09-02-still087

Choosing the best webcam

Apple addresses AirTag tacking concerns, YouTube outlines new creator features
100-apple-airtags-2021

Apple addresses AirTag tacking concerns, YouTube outlines new creator features

I attended Samsung's Galaxy S22 event in the metaverse. It did not feel great
samsung-metaverse

I attended Samsung's Galaxy S22 event in the metaverse. It did not feel great

Samsung's Bridgerton spoof makes fun of Apple
bridgerton-sb-v1-00-00-19-03-still001.png

Samsung's Bridgerton spoof makes fun of Apple

Samsung's Unpacked event in 11 minutes
s22-ultra1

Samsung's Unpacked event in 11 minutes

Oscars Nominations Are In, Apple Announces Tap to Pay on iPhone
apple-apple-pay-transaction-big-jpg-large

Oscars Nominations Are In, Apple Announces Tap to Pay on iPhone

Spirit and Frontier airlines are merging, Verizon's contracts extended to 3 years
tt-02-08-22-thumb

Spirit and Frontier airlines are merging, Verizon's contracts extended to 3 years

Joe Rogan apologizes for racial slurs, Amazon rumored to buy Peloton
gettyimages-1367949987

Joe Rogan apologizes for racial slurs, Amazon rumored to buy Peloton

Tech Shows

The Apple Core
apple-core.png

The Apple Core

Alphabet City
alphabet-city.png

Alphabet City

CNET Top 5
top-5.png

CNET Top 5

The Daily Charge
the-daily-charge.png

The Daily Charge

What the Future
what-the-future.png

What the Future

Tech Today
tech-today.png

Tech Today

Cooley On Cars
on-cars.png

Cooley On Cars

Carfection
carfection.png

Carfection

Latest News

Android 15: See New Features in Action
yt-android-15-clean-1

Android 15: See New Features in Action

If Apple Makes Siri Like ChatGPT or Gemini, I'm Done
240516-site-hey-siri-lets-talk

If Apple Makes Siri Like ChatGPT or Gemini, I'm Done

Bose SoundLink Max Review: How Does It Compare to the Cheaper SoundLink Flex?
240514-site-bose-soundlink-max-2

Bose SoundLink Max Review: How Does It Compare to the Cheaper SoundLink Flex?

Hands-On With Huawei's Pura 70 Ultra
240515-winged-pura-70-ultra-00-01-30-05-still001.jpg

Hands-On With Huawei's Pura 70 Ultra

I Tried Three Fitness Apps to Help My Postpartum Recovery
cs-ai-coaching-seq-00-07-22-19-still001

I Tried Three Fitness Apps to Help My Postpartum Recovery

How Many Times Did Google Say AI at I/O 2024?
Every AI Thumbnail

How Many Times Did Google Say AI at I/O 2024?

I Tried Google's Project Astra
240513-site-google-project-astra-hands-on-v3

I Tried Google's Project Astra

Everything Google Just Announced at I/O 2024
240513-site-google-io-supercut-thumbnail-v2

Everything Google Just Announced at I/O 2024

Google Introduces Gemini AI Upgrades to Gmail and Chat
google-io-gemini-gmail-chip

Google Introduces Gemini AI Upgrades to Gmail and Chat

Google Brings Multistep Reasoning to Search
screenshot-2024-05-14-at-11-16-37am.png

Google Brings Multistep Reasoning to Search