Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Drop-0.5] Complete implementation for slowpath and CI based on the new architecture of opi-evpn-br #377

Merged
merged 16 commits into from
Apr 25, 2024

Conversation

mardim91
Copy link
Contributor

Introducing the complete implementation for CI and slowpath based on the new opi-evpn-br architecture that has been presented to opi-tsc in 15 Feb 2024 link

This PR includes:

  • Modular based implementation according to the architecture
  • Refactored GRPC server
  • Working CI without changing the allready upstream environment
  • Modified tests to be alligned with the new archicture
  • Slowpath common modules for every env (LGM, FRR)
  • CI module tailormade for CI env
  • Linux Vendor Module/ipu module tailormade for the Intel Mt.Evans IPU

This PR is versioned as drop-0.5 because it includes only the CI and slowpath implementation for an xPU.

The drop-1.0 that will come in a later PR will include also the fastpath implementation for an xPU

@mardim91 mardim91 requested a review from a team as a code owner March 25, 2024 14:48
Copy link

codecov bot commented Mar 25, 2024

Codecov Report

Attention: Patch coverage is 22.66635% with 3231 lines in your changes are missing coverage. Please review.

Project coverage is 24.93%. Comparing base (9001d2a) to head (9717c56).
Report is 1 commits behind head on main.

❗ Current head 9717c56 differs from pull request most recent head a9bae8e. Consider uploading reports for the commit a9bae8e to get more accurate results

Files Patch % Lines
pkg/infradb/infradb.go 25.88% 748 Missing and 68 partials ⚠️
pkg/LinuxGeneralModule/lgm.go 0.00% 702 Missing ⚠️
pkg/frr/frr.go 0.00% 336 Missing ⚠️
pkg/LinuxVendorModule/intele2000/intelE2000.go 0.00% 301 Missing ⚠️
pkg/LinuxCIModule/lci.go 0.00% 157 Missing ⚠️
pkg/utils/netlink.go 0.00% 92 Missing ⚠️
cmd/main.go 0.00% 79 Missing ⚠️
pkg/infradb/taskmanager/taskmanager.go 19.58% 78 Missing ⚠️
pkg/infradb/vrf.go 47.40% 69 Missing and 2 partials ⚠️
pkg/config/config.go 0.00% 66 Missing ⚠️
... and 20 more
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #377       +/-   ##
===========================================
- Coverage   58.92%   24.93%   -33.99%     
===========================================
  Files          30       44       +14     
  Lines        1816     5575     +3759     
===========================================
+ Hits         1070     1390      +320     
- Misses        650     4022     +3372     
- Partials       96      163       +67     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/LinuxVendorModule/ipu/ipu.go Fixed Show fixed Hide fixed
pkg/LinuxGeneralModule/lgm.go Fixed Show fixed Hide fixed
@artek-koltun
Copy link
Contributor

Can we split this patch to manageable chunks? It is hard to review +6,528 −2,984
Would it make sense to introduce some preparation patches and then replace components one by one?

@mardim91
Copy link
Contributor Author

mardim91 commented Mar 27, 2024

Can we split this patch to manageable chunks? It is hard to review +6,528 −2,984 Would it make sense to introduce some preparation patches and then replace components one by one?

Hello @artek-koltun,

Unfortunately we cannot split this PR to multiple patches. It has been agreed on the TSC that because this is a rearchitecture effort of the opi-evpn-br the whole code will be developed in a fork and will be introduced as a single PR for review when significant milestone is reached. The milestone that we have reached is that we have a fully working CI env and support for IPU slowpth configuration

Co-authored-by: Vemula Venkatesh <[email protected]>
Co-authored-by: Patel Atul <[email protected]>
Co-authored-by: Saikumar Banoth <[email protected]>
Co-authored-by: Jambekar Vishakha <[email protected]>
Signed-off-by: Dimitrios Markou <[email protected]>
@mardim91
Copy link
Contributor Author

mardim91 commented Apr 9, 2024

Hello @sandersms @artek-koltun Can you please give us some reviews on this PR ?

Thank you in advance

@glimchb
Copy link
Member

glimchb commented Apr 9, 2024

Can we split this patch to manageable chunks? It is hard to review +6,528 −2,984 Would it make sense to introduce some preparation patches and then replace components one by one?

I completely agree with @artek-koltun
nobody can meaningfully review +6,528 −2,984 changes
and rubber stamp external contribution is not something we do in OPI

@mardim91
Copy link
Contributor Author

Can we split this patch to manageable chunks? It is hard to review +6,528 −2,984 Would it make sense to introduce some preparation patches and then replace components one by one?

I completely agree with @artek-koltun nobody can meaningfully review +6,528 −2,984 changes and rubber stamp external contribution is not something we do in OPI

Hello @glimchb,

In the TSC call in 14th of Dec 2023 has been agreed that because this is a rearchitecture effort we will start with a big PR when a significant Milestone (this PR here) has been reached and when this is merged more smaller PRs will follow with additional code changes.

TSC has agreed to that plan so we have followed this plan. So unfortunately I cannot break this PR to smaller ones.

Here you can find the recording with the plan explanation (00:36:20) and the decision (00:43:50)

https://zoom.us/rec/share/6MHNDuCm-7k6pwtpM6mYV-W2oJjQhTCS-d6199daX2i-I1T6m6s-W6e1MNiLIVNk.anxwm6C81S3uKX0H

Copy link
Contributor

@artek-koltun artek-koltun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments

Makefile Outdated
@@ -13,6 +13,10 @@ build:
@echo " > Building binaries..."
@CGO_ENABLED=0 go build -o ${PROJECTNAME} ./cmd/...

build-arm:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's unify with godpu
opiproject/godpu@6e3f317

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 will take care of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ipu_linux.Init()
frr.Init()

case "ci":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe in readme what those modes used for?
It would be nice to describe there also the main components, how they interact with each other etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @artek-koltun I will take a look on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @artek-koltun I will add a readme in the docs folder the where I will describe the architecture and also how the different modes are used. I will also add some flow digramms there. I hope that will do the work. The reason that I add that in the docs folder and not in the main Readme is because I do not want to pollute that with too much information.

Hope that is alright

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @artek-koltun I have added the markdowns with the architecture and flow chart examples in the docs folder. You can take a look if you want.

cmd/main.go Outdated
var redisAddress string
flag.StringVar(&redisAddress, "redis_addr", "127.0.0.1:6379", "Redis address in ip_address:port format")
switch config.GlobalConfig.Buildenv {
case "ipu":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPU is only Intel specific term. If I am not mistaken, it was agreed to use more widespread dpu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is generic enough and can be used for different IPU/dpus that you have underneath and as well for the CI env where you have no IPU/DPU. Here the case : ipu is when you have an IPU smartnic underneath. Even though that the code is generic some modules need to be provided by the vendor to address the specific smartnic that you have underneath so with this case: ipu we capture the case and we load some of the specific modules that are realted to the IPU card. If for instance you had a dpu bluefield as a card the vendor will need to provide some specific modules for that card and then we will create a case dpu bluefield for that specific card where the ppropriate modules will be loaded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. I mean, let's change the name and use more widespread terminology, like "dpu"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is clear. According to the card that you have underneath you need to load certain modules. If you have an intel ipu for example you need to load the ipu_linux module. If you have a Marvell card you need to load the Marvel_linux. If you have a Bluefield you need to load a bluefield_linux module. So this is what this case does it loads the correct module related to the card that you have. So for intel IPU we have an ipu_linux module that we need to load. We cannot just change that to dpu because dpu can be Marvell or Bluefield. BTW by ipu here we are talking only for an Intel IPU. So the IPU case is pretty unique but the dpu case it isn't because you can have a Marvell dpu or a Bluefield dpu.

Is that more clear now ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've heard that the term ipu is also used to other dpus/xpus from different vendors within Intel.
Then let's be more specific and name it ipu_mev/mev since the description says Mount Evans or ipu_intel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ok let me think about this a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @artek-koltun what do you think about intel_e2000? E2000 is the part name for a mount evans. So this means that all the vendor related files in the code will have the same name.

If you agree with this change we can start do the changes on the naming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intel_e2000 sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @atulpatel261194 Can you please update where is needed the name ipu with the name intel_e2000 ? Not only updating the config but also the golang filenames.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

cmd/main.go Outdated
options.Codec = utils.ProtoCodec{}
store, err := redis.NewClient(options)
rootCmd.PersistentFlags().StringVarP(&config.GlobalConfig.CfgFile, "config", "c", "config.yaml", "config file path")
rootCmd.PersistentFlags().IntVar(&config.GlobalConfig.GRPCPort, "grpcport", 50151, "The gRPC server port")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uint16Var for this port and below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 will take care of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

cmd/main.go Outdated
rootCmd.PersistentFlags().IntVar(&config.GlobalConfig.HTTPPort, "httpport", 8082, "The HTTP server port")
rootCmd.PersistentFlags().StringVar(&config.GlobalConfig.TLSFiles, "tlsfiles", "", "TLS files in server_cert:server_key:ca_cert format.")
rootCmd.PersistentFlags().StringVar(&config.GlobalConfig.DBAddress, "dbaddress", "127.0.0.1:6379", "db address in ip_address:port format")
rootCmd.PersistentFlags().StringVar(&config.GlobalConfig.FRRAddress, "frraddress", "127.0.0.1", "Frr address in ip_address format, no port")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPVar for ip addresses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 will take care of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After carefully considering your suggestion and evaluating our current implementation,currently sticking with using a string for IP addresses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and what is the justification?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently FRRAddress isn't getting used, removed it as of now.

Dockerfile Outdated
mkdir -p /etc/iproute2/ && \
echo "255 opi_evpn_br" > /etc/iproute2/rt_protos && \
cat /etc/iproute2/rt_protos && \
ls -al /
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those cat and ls needed only for debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe this is for debug. We can remove those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 will take care of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

cmd/main.go Outdated
viper.SetConfigType("yaml")
viper.SetConfigName("config.yaml")
}
config.LoadConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if LoadConfig failed? Should it be called only when a config file found? Should we return an error and print a message here if a config is malformed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 will take care of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added error log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it enough? Should we return an error to the top level an exit, since if a config is malformed, the user might get the unexpected results?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added error handling before unmarshal and also added panic logs

cmd/main.go Outdated
go runGatewayServer(grpcPort, httpPort)
runGrpcServer(grpcPort, tlsFiles, frrAddress, store)
// validateConfigs validates the config parameters
func validateConfigs() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a dedicated configs pkg, should we have config related things there?

also I see that we have only one rootCmd. Do we need a cobra then? To read config file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 will take care of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config validation is still in main...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe that is pre run for the cmd line argumnent which are necessary for setup bring up.
validating config file content is not intended as configs may differ for different vendors.
and if any error occurs we have Fatal log for read and load of the config file.

Copy link
Contributor

@artek-koltun artek-koltun Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have that validateConfigs in main, how the vendors can provide their customization?
Are we going the core part? Would it make sense to move only that part into configs pkg?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as suggested, moved validateConfigs to config pkg.

cmd/main.go Outdated
flag.IntVar(&httpPort, "http_port", 8082, "The HTTP server port")
err := infradb.NewInfraDB(config.GlobalConfig.DBAddress, config.GlobalConfig.Database)
if err != nil {
log.Println("error in creating db", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we proceed running here? Can it work without DB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 will take care of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as for validateConfigs probably a new patch has not been uploaded yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added panic log if db creation fails.

cmd/main.go Outdated
flag.StringVar(&frrAddress, "frr_addr", "127.0.0.1", "Frr address in ip_address format, no port")
// Create GRD VRF configuration during startup
if err := createGrdVrf(); err != nil {
log.Printf("Error in creating GRD VRF %+v\n", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it function if grd vrf is not created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it cannot function because the GRD represents the underlay configuration which will be needed in the next drop [drop-1.0] where we will provide the fastpath modules which are responsible for offloading the slowpath configuration to the fastpath. So in order to configure correctly the fastpath for the underlay the GRD vrf is needed to be there so the faspath modules can find the correct inofrmation and offload the rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so let's return an error from this function. In the current state, it will continue executing, which is not coorect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ok we wil take care of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the error log and exit call if grd vrf is not created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Makefile Outdated
@@ -13,6 +13,10 @@ build:
@echo " > Building binaries..."
@CGO_ENABLED=0 go build -o ${PROJECTNAME} ./cmd/...

build-arm:
@echo " > Building binaries..."
@CGO_ENABLED=0 env GOOS=linux GOARCH=arm64 go build -o ${PROJECTNAME} ./cmd/...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @artek-koltun indicated to unify with godpu and you can also have GOOS=$(uname) to default to the local OS. If a different OS is needed such as cross-compile, then

GOOS ?= $(shell go env GOOS) # detect the os target from the environment variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 will take care of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

cmd/main.go Outdated
viper.SetConfigType("yaml")
viper.SetConfigName("config.yaml")
}
config.LoadConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the LoadConfig validate the file information that it is reading in case of errors in the information? Should that be part of ValidateConfigs operation below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 will reply on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added error handling.

@@ -293,8 +293,6 @@ services:
/entrypoint.sh ls localhost:50151 opi_api.network.evpn_gw.v1alpha1.BridgePortService.UpdateBridgePort -l && \
echo toremove && \
apt update && apt install iproute2 -y && \
ip link add br-tenant type bridge vlan_default_pvid 0 vlan_filtering 1 vlan_protocol 802.1Q && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being removed since the configuration will be done through the API? Will this have implications on the CI testing and automation that need to be handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason that we remove that is because the br-tenant is created now when the opi-evpn-br starts. That means that there is no need to create the br-tenant beforehand as docker compose command. It is not created through API but is created with linux netlink command when the Linux General Module (lgm) starts. There will be no implecations on the CI testing and automation.

func setUpTenantBridge() {

@sandersms
Copy link
Contributor

There is a ipu.go module that is under the LinuxVendorModule and the code looks to be relatively generic with the use of netlink for the controls for the network settings. Can this be considered more generic? Are there specifics that are related to the Intel IPU that I am missing here?

Access = iota
// Trunk bridge port type
Trunk = iota
)
Copy link
Contributor

@sandersms sandersms Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to use the definitions for the Bridge Port Type from the VLAN protobuf that is in the review in the network area in the future to keep things common. Also that various status indicators. Eventually we can import that protobuf for the definitions that overlap.

Copy link
Contributor Author

@mardim91 mardim91 Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we use the pb types. What you see here is the internal translation that happens from a PB object to an internal infradb domain object. So the protobufs types internally are translated to an infradb domain object which is more ruch and we can capture on that internal information that the user doesn't want to see but it is crucial fro the system to work. So we have chosen to do an internal translation to the BP type from the PB types to the internal infradb object. This way we will not need to impor the Protobufs everywhere but ony in the place where the actual transation happens. This transation from pb types ton the internal domain types (infradb) and vise versa is an architecture/design choice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may just want to track this for refactoring/optimization with the PB schemes for the future as an item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't get that fully. Can you please explain what do you mean ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the new PB types that will cover that type of operation (common protobufs), we can open an issue to track the future change that would handle this from a common perspective as well as any definitions in order to simplify the definitions and not have them spread and duplicated in various locations. Nothing to do for now until the common protobufs are reviewed/merged into the repo. Once they are merged, then we open an issue to track migration to a common definition.

@mardim91
Copy link
Contributor Author

There is a ipu.go module that is under the LinuxVendorModule and the code looks to be relatively generic with the use of netlink for the controls for the network settings. Can this be considered more generic? Are there specifics that are related to the Intel IPU that I am missing here?

The reason that this module exist is because it captures the specific configuration that needs to be handled for each card specifics. So the LinuxVendoModules do exactly that. They take care the non generic part of the card. So if you have a Bluefiled card a coresponding module is needed to capture the Bluefiled specifics.

// NewTaskQueue initializes a TaskQueue
func NewTaskQueue() *TaskQueue {
return &TaskQueue{
channel: make(chan *Task, 200),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 200?

Also, I see that it is used in a very strange way...

go t.taskQueue.Enqueue(task)

If a component cannot keep up with the requests, it should block the producing threads, rather than allowing them to generate billions of goroutines, which will never be handled due to upcoming Out Of Memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 200 number was a bit arbitrary in the beginning but afterwards has been tested and works without any problem for the amount of the objects that we expect to be created.

Also the number of the objects that we expect for the system to serve is not so many. In an EVPN system you do not generate so many VRFs, LBs, SVIs and BPs in parallel and in big numbers. Normally is a handfull of those. So we expect the current system to work really well in most of the cases and no OOM event to be produced. We have also ran some basic tests by creating 100 VRFs one after the other immediately and the system behaved correctly without any problem. The 100 VRFs are allready a lot. We do not expect to have in a real system more than 100. So I believe we are safe for now. If we face any problem in the future we can revisit this topic I believe. We have plans to make it multi-queue and parallel processing of the queues if we find any bottlenecks and is necessary. Of course if you have suggestions we a happy to examine them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not understand why we cannot block if there is no space in the channel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume that there is no space in the channel and the buffer is full with all the 200 slots of the queue occupied which is highly unlikely but let's assume that. If there is more requests comming that means that we will have some outstanding go routines that just wait to find space in the buffer to enqueue their tasks. So even if that is the case the number of those go routines will not be too many. So what exactly you mean to block ? What exactly should we block ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no space, we block and wait while we have a free space in the channel -> a client cannot provide more requests and flood us with too many of them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artek-koltun Yes ok that makes sense. We can take a look on that but I was thinking as this is not a major problem right now we can probably open an issue on that and take a look in a later time so we can move a bit faster. WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artek-koltun I will create an issue after merging this PR as without having this code in the tree the issue doesn't have any value

// translation of pb to domain object
domainLB := infradb.NewLogicalBridge(lb)
// Note: The status of the object will be generated in infraDB operation not here
if err := infradb.CreateLB(domainLB); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, CreateLB will send a msg to task manager, which will find someone to really create a bridge. But why do we return here ok and subsequently from the server's CreateLogicalBridge. Should we wait when it is really created? What if creation fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here by createLB you create a LogicalBridge intent in the DB. The actual intent is getting realized asychronously by the different components that put the configuration in place.

It is the same paradigm as in K8s. In K8s you create an intent in the database and then that intent get's realized by the different controllers. It is the same thing here.

If the intent is created in the DB then the call is succesful. We should not wait until the intent get's realized. If an error is occured during the realization of the intent then that is reported asychronously in the status of the object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an error is occured during the realization of the intent then that is reported asychronously in the status of the object
How will a client know that? By trying to get an object until it is returned successfully?

Also, our API does not say anything about intents. If a bridge returns ok, it implies that an object was created and functional.
If we introduce a concept of intents, it should be clearly stated in API. But I am not sure if it is a good idea to add intents 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should state something in the API. An API is an API how do we choose to implement the backend that supports that API is implementation specific to my view. In our case when the bridge for instane returns ok it means that the object has been created in the DB and the realization of the intent is ongoing. We do have Pending states in the API today that demostrates that functionality

https://github.com/opiproject/opi-api/blob/main/network/evpn-gw/component.proto#L37

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we have API update to indicate that pending state, ok


subscribers := eventbus.EBus.GetSubscribers("logical-bridge")
if subscribers == nil {
log.Println("CreateLB(): No subscribers for Logical Bridge objects")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error since no subscribers?

Copy link
Contributor Author

@mardim91 mardim91 Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Maybe we should return an error if we do not have any subscribers. I will discuss this internally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @atulpatel261194 will take care of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

vpns := make(map[uint32]bool)

subscribers := eventbus.EBus.GetSubscribers("logical-bridge")
if subscribers == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if GetSubscribers returns not nil, but an empty array? Will it be better to use len(subscribers) == 0?

Copy link
Contributor Author

@mardim91 mardim91 Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Inbanoth @atulpatel261194 can you please reply on this ? Is the nil check enough or we should use the length ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. len works with nil slices and with empty slices
  2. comparison with nil works only with nil slices.

I think option 1 is just safer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artek-koltun what you said is right.
But here, we always expect slice to be non-empty unless there is no key (ex: "logical-bridge") exist. and if there is no key exist then it return the zero value for the slice type, which is nil for slices. In such a case, both the nil check and the length check will return true.

So we can use either of the comparisons, also i verified both works for us.

i think we can keep nil comparison as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone update the code and by accident return an empty slice, the code will work differently, and it can take time to identify the root cause. I think it is better to just use len from the very beginning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Inbanoth will do the update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

lbs := make(map[string]bool)
_, err = infradb.client.Get("lbs", &lbs)
if err != nil {
log.Println(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to remove fall back and remove set vpns

	err = infradb.client.Set("vpns", &vpns)
		if err != nil {
			log.Println(err)
			return err
		}

lb and other objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we do not believe we should fall back in the operations with the DB. If there is a need for this we can discuss it in the future. We can live with this right now. I believe if we see any need we can revisit in the future. I would say let's not be too picky with the code right now. Let's make sure that the basic functionality is working currently and we can make more picky changes and enhancements in additional PRs.

WDYT ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then at least an issue should be created after merging this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can create an issue just to register it as point for future discussions if that is needed just so we do not forget.

cmd/main.go Outdated
var redisAddress string
flag.StringVar(&redisAddress, "redis_addr", "127.0.0.1:6379", "Redis address in ip_address:port format")
switch config.GlobalConfig.Buildenv {
case "ipu":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've heard that the term ipu is also used to other dpus/xpus from different vendors within Intel.
Then let's be more specific and name it ipu_mev/mev since the description says Mount Evans or ipu_intel?

return nil, err
}

log.Printf("UpdateBridgePort(): Bridge Port with id %v is not found so it will be created", in.BridgePort.Name)
Copy link
Contributor

@sandersms sandersms Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Update really do an auto-create if the port is not found or should it just return not found? The update may not have all the parameters needed since it typically has a update mask of parameters being changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. According to the google AIP guidelines if we have the AllowMissing to true then it will create the object otherwise it will return an error. Check line 103.

Also the createBridgePort function internally validates also the received Spec so if a parameter is missing that needs to be there it will return error.

The Update functions are not yet tested. We have only the implementation there but is untested for the update case. We will test that in the future but right now is not first priority to have a fully working update. We have tested the Create, Delete, Get and List but not the update yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so, if the check in line 103 doesn't execute it falls through and then calls the "createBridgePort (function in line 27) and handles the processing there based on the parameters passed in and doesn't worry about the update mask settings.

My concern is that if only the update mask parameters are provided in the update request, then somewhere the defaults need to be set for the interface being created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in order to allow the creation of an object using the field mask we first validate that the filed mask includes all the mandatory parameters. We do not really have any defaults. If the mandatory parameters are not set then we will return an error to the user.

But the update call is an untested call so far. I would like to test that in the future and revisit the matter. However I think that we are pretty safe with the current implementation that we have now

return nil, err
}

log.Printf("UpdateSvi(): Svi with id %v is not found so it will be created", in.Svi.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Update really do an auto-create if the port is not found or should it just return not found? (see port/grpc.go comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered on that above

return nil, err
}

log.Printf("UpdateVrf(): Vrf with id %v is not found so it will be created", in.Vrf.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Update really do an auto-create if the port is not found or should it just return not found? (see port/grpc.go comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered on that above

@sandersms
Copy link
Contributor

There is a ipu.go module that is under the LinuxVendorModule and the code looks to be relatively generic with the use of netlink for the controls for the network settings. Can this be considered more generic? Are there specifics that are related to the Intel IPU that I am missing here?

The reason that this module exist is because it captures the specific configuration that needs to be handled for each card specifics. So the LinuxVendoModules do exactly that. They take care the non generic part of the card. So if you have a Bluefiled card a coresponding module is needed to capture the Bluefiled specifics.

It makes sense; but, the current file doesn't seem to have any IPU specifics in the file, instead all of the calls are netlink commands. That seems to be more generic in usage unless I am missing something.

@mardim91
Copy link
Contributor Author

There is a ipu.go module that is under the LinuxVendorModule and the code looks to be relatively generic with the use of netlink for the controls for the network settings. Can this be considered more generic? Are there specifics that are related to the Intel IPU that I am missing here?

The reason that this module exist is because it captures the specific configuration that needs to be handled for each card specifics. So the LinuxVendoModules do exactly that. They take care the non generic part of the card. So if you have a Bluefiled card a coresponding module is needed to capture the Bluefiled specifics.

It makes sense; but, the current file doesn't seem to have any IPU specifics in the file, instead all of the calls are netlink commands. That seems to be more generic in usage unless I am missing something.

Yes there are netlink commands but the names of the interfaces that are passed as parameters and the way that those interfaces are handled are IPU specific.

For instance here the way that we handle these mux interfaces as we call them is something that is IPU specific and is not the same as DPU Bluefield for instance where they use representor interfaces.

log.Printf("LVM : VlanID %v value passed in Logical Bridge create is greater than 16 bit value\n", BrObj.Spec.VlanID)
return false
}
//TODO: Update opi-api to change vlanid to uint16 in LogiclaBridge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsigned integers are not recommended by AIP, since not all languages have a good support for them https://linter.aip.dev/141/forbidden-types

Copy link
Contributor Author

@mardim91 mardim91 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 Can you please change the comment there from uint16 to int16 and put also this link in the comment: https://linter.aip.dev/141/forbidden-types

We can create an issue to update also the other fields of the API to use signed integers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,31 @@
grpcport: 50151
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file name: config-intele2000.yaml -> config-intel2000.yaml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

@mardim91 mardim91 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name of the chip is intel-e2000 no intel-2000 so this extra "e" is crucial. Maybe the name of the file can be config-intel-e2000.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config-intel-e2000.yaml will be fine

Makefile Outdated
compile: get build

build:
@echo " > Building binaries..."
@CGO_ENABLED=0 go build -o ${PROJECTNAME} ./cmd/...
@CGO_ENABLED=0 GOARCH=$(GOARCH) go build -o ${PROJECTNAME} ./cmd/...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark also proposed to add GOOS #377 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// build time check that struct implements interface
var _ EvpnObject[*pb.LogicalBridge] = (*LogicalBridge)(nil)

// NewLogicalBridge creates new Logica Bridge object from protobuf message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogicaL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


subscribers := eventbus.EBus.GetSubscribers("logical-bridge")
if len(subscribers) == 0 {
log.Println("NewLogicalBridge(): No subscribers for Logical Bridge objects")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot understand if missing subscribers is an error here... If it is an error, we need to return here. If it is not, do we need to print, since it is not an error?
Here, we treat it as an error https://github.com/opiproject/opi-evpn-bridge/pull/377/files#r1562602607

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I guess we missed that. @atulpatel261194 can you please return an error also in these cases here as we do in the other cases where we do not have subscribers ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mardim91 , @artek-koltun New functions for each object just returns the domain specific object. we handle the error in infra db if there is no subscribers while create, delete and update of perticular object.

Copy link
Contributor Author

@mardim91 mardim91 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atulpatel261194 If we handle the no sunscribers as error we should return the error also in the NewObject functions. There is no point allowing zero subscribers in the NewObjects and then retunr the error later in the CreateLB function for example. Let's handle the zero subscribers as error for now everywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, fixed

Copy link
Contributor

@sandersms sandersms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with the comments addressed and associated issues opened.

@sandersms sandersms merged commit d33fbef into opiproject:main Apr 25, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants